Answered by:
parsing from pdf to text

Question
-
User351619809 posted
Hi All,
I need to parse this pdf document in text file. I am using iTextSharp dll for that purpose. My whole PDF document is parsing correctly except there is a table in the pdf document that has lines in it. It parses that table, but if there is some space in one cell of the table then i don't see that space in the converted text document. Below is the format of the table
Col1 Col2 Col3 Col4 Col5 1 Test1 2 5 Test6 2 3 Test7 3 Test6 9 Test8
The output that I see is like this:1 Test1 2 5 Test6 <LF> 2 3 Test7<LF> 3 Test6 9 Test8<LF> <LF> is line feed.
Is there any way, I can see those spaces too. Below is the PDF parsing code
Public Sub ExtractTextFromPdf(path As String) Dim its As ITextExtractionStrategy = New LocationTextExtractionStrategy() Dim HeadLine As String Using reader As New PdfReader(path) Dim str As New StringBuilder() For i As Integer = 1 To reader.NumberOfPages Dim thePage As String = PdfTextExtractor.GetTextFromPage(reader, i, its) Dim pdf31460Lines As String() = thePage.Split(ControlChars.Lf) For Each EachLine As String In pdf31460Lines str.AppendLine(EachLine) If EachLine.Contains("SNEW") Then HeadLine = EachLine End If Next Next InsertParsedFileHeader(str.ToString(), HeadLine) End Using End Sub
I have been searching for this for 3-4 days and couldn't find the right answer. I am doing in 2010 visual studio , any help in C# or Vb.net will be appreciated.
Any help will be greatly appreciated.
Monday, September 15, 2014 6:18 PM
Answers
-
User281315223 posted
Parsing PDFs as Text
The best method of handling this with any kind of reliability would be to use an Optical Character Recognition (OCR) library that would attempt to "read" the contents of a specific object (such as a PDF or an Image) and provide you with the actual content.
Tesseract is one of the most well known open-source OCR libraries out there and would be pretty simple to actually implement within your project to suit your needs. Tessnet2 is also available, which is basically a .NET wrapper that contains that will allow you to use just as you would Tesseract.
You may also want to look into this Stack Overflow discussion as well, which covers several different techniques including using iTextSharp to attempt to read the content of a PDF and another mentions using the PdfBox library to accomplish the same thing.
You can see a few more related methods of handling this below :
- Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
Wednesday, September 17, 2014 8:22 AM
All replies
-
User177399542 posted
Try to give inline styles to your table.
<table border="1"> OR <table style="border:1px solid black;">
Wednesday, September 17, 2014 8:20 AM -
User281315223 posted
Parsing PDFs as Text
The best method of handling this with any kind of reliability would be to use an Optical Character Recognition (OCR) library that would attempt to "read" the contents of a specific object (such as a PDF or an Image) and provide you with the actual content.
Tesseract is one of the most well known open-source OCR libraries out there and would be pretty simple to actually implement within your project to suit your needs. Tessnet2 is also available, which is basically a .NET wrapper that contains that will allow you to use just as you would Tesseract.
You may also want to look into this Stack Overflow discussion as well, which covers several different techniques including using iTextSharp to attempt to read the content of a PDF and another mentions using the PdfBox library to accomplish the same thing.
You can see a few more related methods of handling this below :
- Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
Wednesday, September 17, 2014 8:22 AM