locked
read pdf content into text file using c#.net RRS feed

  • Question

  • Dear all,

    Im trying to read pdf content into text file using c#.net.
    when i trying to read pdf, its returning content as a unicode characters .so how can i read the characters form pdf.

    Thanks,
    ram krishna
    Friday, October 13, 2006 10:27 AM

Answers

All replies

  • As far as I know, PDF is not an XML format, folks on this thread will not be much help to you...
    Thursday, October 26, 2006 6:24 AM
  • Sunday, May 13, 2007 11:46 PM
  •  

    Hi,

           Why dont u use "Tab delimited files " to read a pdf data.

    Friday, June 27, 2008 6:12 AM
  • A free codeproject sample here as well

    Tuesday, November 11, 2008 11:40 PM
  • Hi!Have you tried "PDF Focus .Net"?This library supports pdf to text converting and pdf to images and don't require installed MSOffice.

    This sample code will be usefull for you i think:

            string pathToPdf = @"C:\Text.pdf";
                string pathToText = @"C:\Result.txt";
    
                //Convert PDF file to Text file
                SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
    	    	
                f.OpenPdf(pathToPdf);
    
                if (f.PageCount > 0)
                {
                    int result = f.ToText(pathToText);
                    
                    //Show Text document
                    if (result==0)
                    {
                        System.Diagnostics.Process.Start(pathToText);
                    }

    Monday, February 13, 2012 7:28 AM
  • to extract all words from a PDF document for those with adobe acrobat installed like the standard version 9 ( may work with earlier version but I have not tested against earlier version)

    here are a couple of c# member available that will compile provided you have added reference to your project acrobat.dll,  and added using Acrobat as well as  to your class file:

    // the following will allow word extraction by pdf file spec
    // opening the pdf document is rather crude and need to be more robust
     public static string getTextFromPDF(string filespec)
     {
      Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
      Acrobat.AcroAVDoc avDoc = (Acrobat.AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc"); //Visible pdf document with a UI Window
      avDoc.Open(System.IO.Path.GetFullPath(filespec), System.IO.Path.GetFileName(filespec));
       
      AcroPDDoc doc = (AcroPDDoc)avDoc.GetPDDoc();
      string txt = PdDocGetText(doc);
      doc.Close();
      avDoc.Close(1);
      gAppClass.Exit();
      return txt;
     }
    // slightly modified version of a post in adobe forum by originally by Eldrarak82
     private static string PdDocGetText(AcroPDDoc pdDoc)
     {
      AcroPDPage page;
      int pages = pdDoc.GetNumPages();
      string pageText = "";
      for (int i = 0; i < pages; i++)
      {
       page = (AcroPDPage)pdDoc.AcquirePage(i);
       object jso, jsNumWords, jsWord;
       List<string> words = new List<string>();
       try
       {
        jso = pdDoc.GetJSObject();
        if (jso != null)
        {
         object[] args = new object[] { i };
         jsNumWords = jso.GetType().InvokeMember("getPageNumWords", System.Reflection.BindingFlags.InvokeMethod, null, jso, args, null);
         int numWords = Int32.Parse(jsNumWords.ToString());
         for (int j = 0; j <= numWords; j++)
         {
          object[] argsj = new object[] { i, j, false };
          jsWord = jso.GetType().InvokeMember("getPageNthWord", System.Reflection.BindingFlags.InvokeMethod, null, jso, argsj, null);
          words.Add((string)jsWord);
         }
        }
        foreach (string word in words)
        {
         pageText += word;
        }
       }
       catch
       {
       }
      }
      return pageText;
     }

    the above code sample has yet to be fully tested and may need improvement. nonetheless it is a good starting point.

    for those interested in tables, rows and columns, look up the documents by adobe like

    http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/a crobat/pdfs/plugin_apps_developer_guide.pdf

    around page 130ish to 136

    the link

    http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf

    may also be helpfull for a lot of other tasks.



    • Proposed as answer by devintbb Tuesday, March 4, 2014 3:09 PM
    • Edited by gg1 Tuesday, November 18, 2014 7:12 AM fix typos
    Wednesday, May 30, 2012 10:10 PM
  • for those without acrobat installed but have pagemaker installed, take a look at the AdobePDFMakerX.dll


    • Edited by gg1 Wednesday, May 30, 2012 11:04 PM
    Wednesday, May 30, 2012 10:42 PM
  • with windows 8.1,

     Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
      Acrobat.AcroAVDoc avDoc = (Acrobat.AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc"); //Visible pdf document with a UI Window
     

    may return null in avDoc

    However

    if (avDoc == null) avDoc = new Acrobat.AcroAVDoc();
    will help 

    Friday, November 21, 2014 11:59 PM
  • Here is another alternative in case the acrobat is not present on the executing machine:

    // Load PDF file.
    var document = DocumentModel.Load("Sample.pdf");
                
    // Retrieve PDF file's text conten.
    string content = document.Content.ToString();
    
    // Or save as TXT file.
    document.Save("Sample.txt");

    You can read about this sample code in the following article that demonstrates reading of PDF files and extracting its text via C#.

    Tuesday, December 15, 2015 10:03 AM
  • the Gemsoft component mentioned above by Easked is no doubt easy and efficient with high performance to use so long one does mind free with 20 paragraph limit or paying $580 US for a full single developer licence
    Saturday, February 20, 2016 6:38 AM