locked
How to fix the encoding when extracting text from a pdf using itextsharp? RRS feed

  • Question

  • User-260554092 posted

    Hi,

    I am extracting text using from a pdf, and the encoding seems to not work. I have 2 methods to extract the text from the pdf because for some pdf's method 1 works, and for others, methods 2 works. I want to combine both but don't understand how...

    Also for method 2, the encoding gets messed up, ie. whitespaces have ascii code of 63 for some reason, is there a way to fix this, so that I can use indexOf method using a string of a white space and it will match the whitespace in the extracted text.

            public static bool does_document_text_have_keyword(string keyword, string pdf_src)
            {
                try
                {
                    PdfReader pdfReader = new PdfReader(pdf_src);
                    string currentText;
                    int count = pdfReader.NumberOfPages;
                    for (int page = 1; page <= count; page++)
                    {     // method_1   
    PdfReader reader = new PdfReader(pdf_src); currentText = PDFParser.ExtractTextFromPDFBytes(pdfReader.GetPageContent(page)) + " "; if (currentText.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true; // method_2 StringWriter output = new StringWriter(); output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy())); currentText = fix_encoding(output.ToString()); if (currentText.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true; } pdfReader.Close(); return false; } catch { return false; } }

    Thursday, December 20, 2012 4:10 PM

All replies

  • User-1760637409 posted

    Hi,

    You can try the below links :-

    http://stackoverflow.com/questions/4784385/extract-data-from-pdf-files

    or

    private string ExtractText()
    {
    PdfReader reader = new PdfReader(Server.MapPath(P_InputStream3));
    string txt = PdfTextExtractor.GetTextFromPage(reader, 2, new LocationTextExtractionStrategy());
    return txt;
    }

    Hope this will solve your problem.

    Thursday, December 20, 2012 4:32 PM
  • User151468730 posted

    Hello ryand!

    To fix the encoding when extracting test from a pdf using itextsharp, you may want to try the following: the LocationTextExtractionStrategy.

    It's documentation states: text extraction renderer that keeps track of relative position of text on page. The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.

    Hopefully this helps,

    Best of Luck!

    With Kind Regards,

    Thursday, December 20, 2012 10:29 PM