parsing from pdf to text RRS feed

  • Question

  • User351619809 posted

    Hi All,

    I need to parse this pdf document in text file. I am using iTextSharp dll for that purpose. My whole PDF document is parsing correctly except there is a table in the pdf document that has lines in it. It parses that table, but if there is some space in one cell of the table then i don't see that space in the converted text document. Below is the format of the table

    Col1    Col2   Col3   Col4   Col5 
    1       Test1   2     5       Test6
    2               3             Test7
    3       Test6         9       Test8

    The output that I see is like this:

    1 Test1 2 5 Test6 <LF>
    2 3 Test7<LF>
    3 Test6 9 Test8<LF>
    <LF> is line feed.

    Is there any way, I can see those spaces too. Below is the PDF parsing code

     Public Sub ExtractTextFromPdf(path As String)
            Dim its As ITextExtractionStrategy = New LocationTextExtractionStrategy()
            Dim HeadLine As String
            Using reader As New PdfReader(path)
                Dim str As New StringBuilder()
                For i As Integer = 1 To reader.NumberOfPages
                    Dim thePage As String = PdfTextExtractor.GetTextFromPage(reader, i, its)
                    Dim pdf31460Lines As String() = thePage.Split(ControlChars.Lf)
                    For Each EachLine As String In pdf31460Lines
                        If EachLine.Contains("SNEW") Then
                            HeadLine = EachLine
                        End If
                InsertParsedFileHeader(str.ToString(), HeadLine)
            End Using
        End Sub

    I have been searching for this for 3-4 days and couldn't find the right answer. I am doing in 2010 visual studio , any help in C# or Vb.net will be appreciated.

    Any help will be greatly appreciated.

    Monday, September 15, 2014 6:18 PM


All replies