locked
how to read pdf file through C# ?

    Question

  • hi

    i have pdf file and i need to read the text and to insert to any val.

    how do to it with C# (winform) ?

    thank's in advance
    Monday, May 31, 2010 7:27 AM

Answers

All replies

  • Hi there.

    Take a look at these 2 projects on codeplex.com;

    PDFSharp

    PDF Library

    Hope this helps.

    Regards,

    Magnus


    My blog: InsomniacGeek.com
    Monday, May 31, 2010 7:39 AM
  • thank's for the help,

    but in those sample i can only make new pdf file.

    and i need only to read the text from pdf file to any val in my C# program

     

     

    Monday, May 31, 2010 8:15 AM
  • hope it will help you-

    http://www.codeproject.com/KB/cs/PDFToText.aspx

     

    regards

    jayant

    • Marked as answer by Liliane Teng Friday, June 04, 2010 9:14 AM
    Monday, May 31, 2010 8:24 AM
  • This question has been asked many times by user. I suggest to  first google your question and then post it here.

    Moderators: Make this question available as a FAQ.

     

    Monday, May 31, 2010 8:47 AM
  • thank's for the help,

    but in those sample i can only make new pdf file.

    and i need only to read the text from pdf file to any val in my C# program

     

     


    Hi there.

    Well, I don't agree with you. They have classes for reading the contents of PDF documents. Please at least download the samples. And here is a sample to do that .

    Regards,

    Magnus


    My blog: InsomniacGeek.com
    • Marked as answer by Liliane Teng Friday, June 04, 2010 9:15 AM
    Monday, May 31, 2010 10:41 AM
  • Hello E_gold,
    Thanks for your post.
    The following websites could give you an idea of how to achieve.
    http://jadn.co.uk/w/ReadPdfUsingCsharp.htm
    (How to read pdf files using C# .NET)
    http://social.msdn.microsoft.com/forums/en-US/xmlandnetfx/thread/4a9fb479-b48e-4366-ad39-02b2dac674f5/
    (read pdf content into text file using c#.net)

    If you have any problems, please feel free to follow up.
    Best regards,
    Liliane


    Please mark the replies as answers if they help and unmark them if they provide no help. Thanks
    • Marked as answer by Liliane Teng Friday, June 04, 2010 9:15 AM
    Wednesday, June 02, 2010 7:53 AM
  • Hello Agalo,

    Yeah.. This question is very common. Your suggestion is very good. We will consider it. Thanks.

    If you have any problems or suggestions, please feel free to contact me.

    Best regards,

    Liliane


    Please mark the replies as answers if they help and unmark them if they provide no help. Thanks
    Wednesday, June 02, 2010 8:03 AM
  • for those with acrobat installed, there are a couple c# members that can get all words of a PDF document:

    http://social.msdn.microsoft.com/forums/en-US/xmlandnetfx/thread/4a9fb479-b48e-4366-ad39-02b2dac674f5/

     as posted by gg1

    for course this options may not be viable if you think of distributing your application for profit or if you need format

    In summary the link has the fllowing sample code and some adobe website refereces:

    // the following will allow word extraction by pdf file spec
    // opening the pdf document is rather crude and need to be more robust
     public static string getTextFromPDF(string filespec)
     {
      Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
      Acrobat.AcroAVDoc avDoc = (Acrobat.AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc"); //Visible pdf document with a UI Window
      avDoc.Open(System.IO.Path.GetFullPath(filespec), System.IO.Path.GetFileName(filespec));
       
      AcroPDDoc doc = (AcroPDDoc)avDoc.GetPDDoc();
      string txt = PdDocGetText(doc);
      doc.Close();
      avDoc.Close(1);
      gAppClass.Exit();
      return txt;
     }
    // slightly modified version of a post in adobe forum by originally by Eldrarak82
     private static string PdDocGetText(AcroPDDoc pdDoc)
     {
      AcroPDPage page;
      int pages = pdDoc.GetNumPages();
      string pageText = "";
      for (int i = 0; i < pages; i++)
      {
       page = (AcroPDPage)pdDoc.AcquirePage(i);
       object jso, jsNumWords, jsWord;
       List<string> words = new List<string>();
       try
       {
        jso = pdDoc.GetJSObject();
        if (jso != null)
        {
         object[] args = new object[] { i };
         jsNumWords = jso.GetType().InvokeMember("getPageNumWords", System.Reflection.BindingFlags.InvokeMethod, null, jso, args, null);
         int numWords = Int32.Parse(jsNumWords.ToString());
         for (int j = 0; j <= numWords; j++)
         {
          object[] argsj = new object[] { i, j, false };
          jsWord = jso.GetType().InvokeMember("getPageNthWord", System.Reflection.BindingFlags.InvokeMethod, null, jso, argsj, null);
          words.Add((string)jsWord);
         }
        }
        foreach (string word in words)
        {
         pageText += word;
        }
       }
       catch
       {
       }
      }
      return pageText;
     }

    the above code sample has yet to be fully tested and may need improvement. nonetheless it is a good starting point.

    for those interested in tables, rows and columns, look up the doucments by adobe like

    http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/a crobat/pdfs/plugin_apps_developer_guide.pdf

    around page 130ish to 136

    the link

    http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf

    may also be helpfull for a lot other tasks.

    • Edited by fs - ab Friday, June 01, 2012 4:44 PM
    Thursday, May 31, 2012 12:32 AM