locked
How to read and extract data from pdf file in vb RRS feed

  • Question

  • User869176912 posted

    Hi all,

    When I open and read the pdf file everything looks fine, but whenever I try to read and parse that same pdf file all of a sudden there are a bunch of extra characters or tags. And so whenever my code is looking for a specific string, it's not finding it.

    I.E.

    When I open the pdf file I see this:

    Membership ID: 1111111

    But when I open and parse each line I get this:

    MembershipMembership ID:ID: <<MemberId>>1111111

    Can someone explain to me why those extra characters or tags are there? And how can I get rid of them or account for them in my code when I'm reading and parsing pdf files.

    I'am currently using aspose.pdf library

    Thank you 

    Wednesday, January 10, 2018 6:05 PM

Answers

  • User-1838255255 posted

    Hi MikeT89,

    According to your description and needs, please check the following tutorials about use itextsharp or other dll to extra data, the tutorials have example code to test, please check: 

    Read and Extract PDF Text in C# and VB.NET:

    https://www.gemboxsoftware.com/document/examples/c-sharp-read-pdf/305

    How to read PDF file using iTextSharp in ASP.NET: 

    http://www.devasp.net/net/articles/display/1447.html 

    Best Regards,

    Eric Du 

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Thursday, January 11, 2018 7:56 AM
  • User1536599556 posted

    You may try this code:

    Imports System
    Imports System.Text
    Imports GemBox.Document
    Imports System.Text.RegularExpressions
    
    Module Program
    
        Sub Main()
    
            ' If using Professional version, put your serial key below.
            ComponentInfo.SetLicense("FREE-LIMITED-KEY")
    
            Dim document As DocumentModel = DocumentModel.Load("CustomInvoice.pdf")
    
            Dim sb As New StringBuilder()
    
            ' Read PDF file's document properties.
            sb.AppendFormat("Author: {0}", document.DocumentProperties.BuiltIn(BuiltInDocumentProperty.Author)).AppendLine()
            sb.AppendFormat("DateContentCreated: {0}", document.DocumentProperties.BuiltIn(BuiltInDocumentProperty.DateLastSaved)).AppendLine()
    
            ' Sample's input parameter.
            Dim pattern As String = "(?<WorkHours>\d+)\s+(?<UnitPrice>\d+\.\d{2})\s+(?<Total>\d+\.\d{2})"
            Dim regex As Regex = New Regex(pattern)
    
            Dim row As Integer = 0
            Dim line As StringBuilder = New StringBuilder()
    
            ' Read PDF file's text content and match a specified regular expression.
            For Each match As Match In regex.Matches(document.Content.ToString())
                line.Length = 0
                line.AppendFormat("Result: {0}: ", ++row)
    
                ' Either write only successfully matched named groups or entire match.
                Dim hasAny As Boolean = False
                For i As Integer = 1 To match.Groups.Count - 1
                    Dim groupName As String = regex.GroupNameFromNumber(i)
                    Dim matchGroup As Group = match.Groups(i)
                    If (matchGroup.Success And groupName <> i.ToString()) Then
                        line.AppendFormat("{0}= {1}, ", groupName, matchGroup.Value)
                        hasAny = True
                    End If
                Next
    
                If (hasAny) Then
                    line.Length -= 2
                Else
                    line.Append(match.Value)
                End If
    
                sb.AppendLine(line.ToString())
            Next
    
            Console.WriteLine(sb.ToString())
    
        End Sub
    
    End Module
    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, December 19, 2018 9:11 AM

All replies

  • User-1838255255 posted

    Hi MikeT89,

    According to your description and needs, please check the following tutorials about use itextsharp or other dll to extra data, the tutorials have example code to test, please check: 

    Read and Extract PDF Text in C# and VB.NET:

    https://www.gemboxsoftware.com/document/examples/c-sharp-read-pdf/305

    How to read PDF file using iTextSharp in ASP.NET: 

    http://www.devasp.net/net/articles/display/1447.html 

    Best Regards,

    Eric Du 

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Thursday, January 11, 2018 7:56 AM
  • User1536599556 posted

    You may try this code:

    Imports System
    Imports System.Text
    Imports GemBox.Document
    Imports System.Text.RegularExpressions
    
    Module Program
    
        Sub Main()
    
            ' If using Professional version, put your serial key below.
            ComponentInfo.SetLicense("FREE-LIMITED-KEY")
    
            Dim document As DocumentModel = DocumentModel.Load("CustomInvoice.pdf")
    
            Dim sb As New StringBuilder()
    
            ' Read PDF file's document properties.
            sb.AppendFormat("Author: {0}", document.DocumentProperties.BuiltIn(BuiltInDocumentProperty.Author)).AppendLine()
            sb.AppendFormat("DateContentCreated: {0}", document.DocumentProperties.BuiltIn(BuiltInDocumentProperty.DateLastSaved)).AppendLine()
    
            ' Sample's input parameter.
            Dim pattern As String = "(?<WorkHours>\d+)\s+(?<UnitPrice>\d+\.\d{2})\s+(?<Total>\d+\.\d{2})"
            Dim regex As Regex = New Regex(pattern)
    
            Dim row As Integer = 0
            Dim line As StringBuilder = New StringBuilder()
    
            ' Read PDF file's text content and match a specified regular expression.
            For Each match As Match In regex.Matches(document.Content.ToString())
                line.Length = 0
                line.AppendFormat("Result: {0}: ", ++row)
    
                ' Either write only successfully matched named groups or entire match.
                Dim hasAny As Boolean = False
                For i As Integer = 1 To match.Groups.Count - 1
                    Dim groupName As String = regex.GroupNameFromNumber(i)
                    Dim matchGroup As Group = match.Groups(i)
                    If (matchGroup.Success And groupName <> i.ToString()) Then
                        line.AppendFormat("{0}= {1}, ", groupName, matchGroup.Value)
                        hasAny = True
                    End If
                Next
    
                If (hasAny) Then
                    line.Length -= 2
                Else
                    line.Append(match.Value)
                End If
    
                sb.AppendLine(line.ToString())
            Next
    
            Console.WriteLine(sb.ToString())
    
        End Sub
    
    End Module
    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, December 19, 2018 9:11 AM