locked
get data from pdf RRS feed

  • Question

  • Hello,

    Im wondering how can i get data from a pdf file?

    for example

    i have the name of a person and the emails

    steve

    steve@example.com; stevosky@example.com

    how can i get the emails: steve@example.com; stevosky@example.com and split it?email1= steve@example.com and email2=stevosky@example.com

    plz help me its important

    Tuesday, April 21, 2015 11:36 AM

Answers

  • Hello,

    To start this off, I have been doing extractions from PDF documents for a long time using a third party library as it is much better than the free libraries out there but not going into it because it's over $1000 price tag.

    I would suggest looking at utilizing iTextSharp library with code such as below. Best to install this library from NuGet inside of Visual Studio. From Solution Explorer right click on the solution name and select manage NuGet packages for the solution.

    Original code below (see second code block for VB.NET) came from StackOverFlow

    using iTextSharp.text.pdf;
    using iTextSharp.text.pdf.parser;
    using System.IO;
    
    public string ReadPdfFile(string fileName)
    {
        StringBuilder text = new StringBuilder();
    
        if (File.Exists(fileName))
        {
            PdfReader pdfReader = new PdfReader(fileName);
    
            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    
                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }

    I took this, added it to a class project as per below

    Imports System.Text
    Imports System.IO
    Imports iTextSharp.text.pdf
    Imports iTextSharp.text.pdf.parser
    
    Public Class Extracter
        Public Property FileName As String
        Public Sub New()
    
        End Sub
        Public Function ReadPdfFile() As String
            Dim text As New StringBuilder()
    
            If File.Exists(FileName) Then
                Dim pdfReader As New PdfReader(FileName)
    
                For page As Integer = 1 To pdfReader.NumberOfPages
    
                    Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()
                    Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)
    
                    currentText = Encoding.UTF8.GetString(
                        ASCIIEncoding.Convert(
                            Encoding.Default,
                            Encoding.UTF8,
                            Encoding.Default.GetBytes(currentText))
                    )
    
                    ' 4/21/2015 Karen Payne added the following
                    currentText = currentText.Replace(vbLf, Environment.NewLine)
    
                    text.Append(currentText)
                    text.AppendLine(Environment.NewLine)
    
                Next page
                pdfReader.Close()
            Else
                '
                ' You decide on how to handle file not found
                '
            End If
    
            Return text.ToString()
    
        End Function
    End Class
    

    Created a console project, added a reference to the class library as per above then simply extracted text and saved to a text file to ensure it works.

    Imports iTextSharpHelper
    Module Module1
        Sub Main()
            Dim Extracter As New Extracter With
                {
                    .FileName = IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "File1.pdf")
                }
            IO.File.WriteAllText(
                IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "file1.txt"),
                Extracter.ReadPdfFile)
        End Sub
    End Module
    

    The method ReadPdfFile returns a string which you can then use to find and extract text, see the string class for methods to find things in strings.

    Comment, the above 'AS IS' treats extracted text as a string while the library I use treats each page as a memory stream with a single method to convert the stream to a List(Of String).

    Hope this helps.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    • Proposed as answer by IronRazerz Tuesday, April 21, 2015 5:33 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 4:01 PM
    Tuesday, April 21, 2015 1:13 PM
  • I have never tried reading from PDF files, but there are libraries available to allow you work with them. The one I hear of most often is iTextSharp. If you need help using the library that you pick, you will need to find the support website for that product.

    If you need help splitting a string that looks like this "steve@example.com; stevosky@example.com", you can use the String.Split method like this.

    Dim combined As String = "steve@example.com; stevosky@example.com"
    Dim emails() As String = combined.Split("; ".ToCharArray, StringSplitOptions.RemoveEmptyEntries)

    The result will be a string array with two elements:

    • emails(0) will contain "steve@example.com"
    • emails(1) will contain "stevosky@example.com"


    • Edited by Blackwood Tuesday, April 21, 2015 1:28 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 4:00 PM
    Tuesday, April 21, 2015 1:27 PM
  • In this example i used the small class that Kevin provided with a small change to it to pass the filename as a parameter argument instead of using a separate property.  I also used RegularExpressions to find all the email addresses in the text that is read from the pdf file and listed them in a RichTextBox.  You can experiment with it to make it work the way you want.

    Form1 Code - Form1 has 1 Button and 1 RichTextBox added to it.

    Imports System.Text.RegularExpressions
    
    Public Class Form1
        Private txtExtractor As New Extracter
    
        Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            Using ofd As New OpenFileDialog
                ofd.Filter = "Pdf files|*.pdf"
                If ofd.ShowDialog = DialogResult.OK Then
                    'read the text from the pdf file
                    Dim pdftxt As String = txtExtractor.ReadPdfFile(ofd.FileName)
    
                    'find all email addresses 
                    Dim rgx As New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                    Dim mtchs As MatchCollection = rgx.Matches(pdftxt)
    
                    'adds each email address that was found to a RichTextBox
                    For Each m As Match In mtchs
                        RichTextBox1.AppendText(m.Value & vbNewLine)
                    Next
                End If
            End Using
        End Sub
    End Class
    


     

     The code i used for the Extractor class Kevin posted.

    Imports System.Text
    Imports System.IO
    Imports iTextSharp.text.pdf
    Imports iTextSharp.text.pdf.parser
    
    Public Class Extracter
        Public Sub New()
        End Sub
    
        Public Function ReadPdfFile(ByVal filename As String) As String
            Dim text As New StringBuilder()
    
            If File.Exists(filename) Then
                Dim pdfReader As New PdfReader(filename)
    
                For page As Integer = 1 To pdfReader.NumberOfPages
                    Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()
                    Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)
    
                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)))
    
                    currentText = currentText.Replace(vbLf, Environment.NewLine)
    
                    text.Append(currentText)
                    text.AppendLine(Environment.NewLine)
                Next
                pdfReader.Close()
            End If
    
            Return text.ToString()
        End Function
    End Class
     

     After opening a small pdf file i made that has 4 email addresses in it this is the result.


    If you say it can`t be done then i`ll try it

    • Proposed as answer by KareninstructorMVP Tuesday, April 21, 2015 7:27 PM
    • Unproposed as answer by Ko0kiE Wednesday, April 22, 2015 2:47 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 4:00 PM
    Tuesday, April 21, 2015 5:32 PM
  • Do you have all of the pdf file paths stored in a List(Of String) or an Array already? If you do then you can use a For Each loop to iterate through the file paths and get the links from each file.

     If you just want to be able to select more than one file using the OpenFileDialog then you can set its MultiSelect property to True and then iterate through each file.

            Using ofd As New OpenFileDialog
                ofd.Filter = "Pdf files|*.pdf"
                ofd.Multiselect = True 'this allows you to select more than one file at a time
    
                If ofd.ShowDialog = DialogResult.OK Then
    
                    For Each fn As String In ofd.FileNames 'iterate through each filename that was selected
    
                        'read the text from the pdf file
                        Dim pdftxt As String = txtExtractor.ReadPdfFile(fn)
    
                        'find all email addresses 
                        Dim rgx As New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                        Dim mtchs As MatchCollection = rgx.Matches(pdftxt)
    
                        'adds each email address that was found to a RichTextBox
                        'you can add them to a List(Of String) instead, if you want
                        For Each m As Match In mtchs
                            RichTextBox1.AppendText(m.Value & vbNewLine)
                        Next
    
                    Next
    
                End If
            End Using


    If you say it can`t be done then i`ll try it

    • Edited by IronRazerz Wednesday, April 22, 2015 3:23 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 3:33 PM
    Wednesday, April 22, 2015 3:22 PM

All replies

  • Hello,

    To start this off, I have been doing extractions from PDF documents for a long time using a third party library as it is much better than the free libraries out there but not going into it because it's over $1000 price tag.

    I would suggest looking at utilizing iTextSharp library with code such as below. Best to install this library from NuGet inside of Visual Studio. From Solution Explorer right click on the solution name and select manage NuGet packages for the solution.

    Original code below (see second code block for VB.NET) came from StackOverFlow

    using iTextSharp.text.pdf;
    using iTextSharp.text.pdf.parser;
    using System.IO;
    
    public string ReadPdfFile(string fileName)
    {
        StringBuilder text = new StringBuilder();
    
        if (File.Exists(fileName))
        {
            PdfReader pdfReader = new PdfReader(fileName);
    
            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    
                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }

    I took this, added it to a class project as per below

    Imports System.Text
    Imports System.IO
    Imports iTextSharp.text.pdf
    Imports iTextSharp.text.pdf.parser
    
    Public Class Extracter
        Public Property FileName As String
        Public Sub New()
    
        End Sub
        Public Function ReadPdfFile() As String
            Dim text As New StringBuilder()
    
            If File.Exists(FileName) Then
                Dim pdfReader As New PdfReader(FileName)
    
                For page As Integer = 1 To pdfReader.NumberOfPages
    
                    Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()
                    Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)
    
                    currentText = Encoding.UTF8.GetString(
                        ASCIIEncoding.Convert(
                            Encoding.Default,
                            Encoding.UTF8,
                            Encoding.Default.GetBytes(currentText))
                    )
    
                    ' 4/21/2015 Karen Payne added the following
                    currentText = currentText.Replace(vbLf, Environment.NewLine)
    
                    text.Append(currentText)
                    text.AppendLine(Environment.NewLine)
    
                Next page
                pdfReader.Close()
            Else
                '
                ' You decide on how to handle file not found
                '
            End If
    
            Return text.ToString()
    
        End Function
    End Class
    

    Created a console project, added a reference to the class library as per above then simply extracted text and saved to a text file to ensure it works.

    Imports iTextSharpHelper
    Module Module1
        Sub Main()
            Dim Extracter As New Extracter With
                {
                    .FileName = IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "File1.pdf")
                }
            IO.File.WriteAllText(
                IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "file1.txt"),
                Extracter.ReadPdfFile)
        End Sub
    End Module
    

    The method ReadPdfFile returns a string which you can then use to find and extract text, see the string class for methods to find things in strings.

    Comment, the above 'AS IS' treats extracted text as a string while the library I use treats each page as a memory stream with a single method to convert the stream to a List(Of String).

    Hope this helps.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    • Proposed as answer by IronRazerz Tuesday, April 21, 2015 5:33 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 4:01 PM
    Tuesday, April 21, 2015 1:13 PM
  • I have never tried reading from PDF files, but there are libraries available to allow you work with them. The one I hear of most often is iTextSharp. If you need help using the library that you pick, you will need to find the support website for that product.

    If you need help splitting a string that looks like this "steve@example.com; stevosky@example.com", you can use the String.Split method like this.

    Dim combined As String = "steve@example.com; stevosky@example.com"
    Dim emails() As String = combined.Split("; ".ToCharArray, StringSplitOptions.RemoveEmptyEntries)

    The result will be a string array with two elements:

    • emails(0) will contain "steve@example.com"
    • emails(1) will contain "stevosky@example.com"


    • Edited by Blackwood Tuesday, April 21, 2015 1:28 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 4:00 PM
    Tuesday, April 21, 2015 1:27 PM
  • @Blackwood, the code I posted uses iTextSharp.

    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Tuesday, April 21, 2015 1:42 PM
  • im trying to test it but i cant find the itextsharphelper
    Tuesday, April 21, 2015 1:42 PM
  • @Blackwood, the code I posted uses iTextSharp.

    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.


    Sorry. I didn't see your reply (in fact, I could have sworn there were no replies) or I would not have answered.
    Tuesday, April 21, 2015 2:09 PM
  • That is because you need to create it as a class project. Here is the one I created for my first reply


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Tuesday, April 21, 2015 2:15 PM
  • @Blackwood, the code I posted uses iTextSharp.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.


    Sorry. I didn't see your reply (in fact, I could have sworn there were no replies) or I would not have answered.
    I must have been in stealth mode :-) 

    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Tuesday, April 21, 2015 2:16 PM
  • sorry for my ignorance but how do i add that class in my projet?

    i studied vb in school , but now is the time that i really am learning, sorry

    Tuesday, April 21, 2015 2:24 PM
  • sorry for my ignorance but how do i add that class in my projet?

    i studied vb in school , but now is the time that i really am learning, sorry

    You add the project to the solution. From the IDE menu select View -> Solution Explorer. Next on the solution name in Solution Explorer right click, select Add then Existing project.

    Once this is done in your project add a reference to the class project then where it will be used add as the first line in the file (form, code module or class) this line

    Imports iTextSharpHelper

    EDIT

    I will be in unavailable most of today for any length of time working with my team so I may not respond for a while to any questions after the next hour.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.


    Tuesday, April 21, 2015 2:29 PM
  • Dim num As Integer = 0
            Dim extracter As New extracter
            num = InputBox("Quantos documentos existem?")
            For i = 1 To num - 1
    
                With extracter
    
    
                    .filename = IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, num & "_Cotações da Semana 17 MAIL 20-04 a 26-04.pdf ")
    
                End With
                IO.File.WriteAllText(IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, num & "_Cotações da Semana 17 MAIL 20-04 a 26-04.text"), extracter.readpdffile)
                f_data.ListBox1.Items.Add(num & "_Cotações da Semana 17 MAIL 20-04 a 26-04")
            Next
        End Sub

    im doing that, cause i have 400 pdf files to export data from, its a file that im sending to 400 persons with prices and i need to get the email from that files to send that file in pdf, am i being understandble?

    Tuesday, April 21, 2015 2:37 PM
  • tks blackwood your code will help me when i get the emails from the files!

    Tuesday, April 21, 2015 2:38 PM
  • Not sure if there is a question here but if you are mentioning parsing for information that is all about using the string class.

    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Tuesday, April 21, 2015 3:30 PM
  • sorry, im doing this:

     Dim num As Integer = 0
            Dim extracter As New extracter
            num = InputBox("Quantos documentos existem?")
            For i = 1 To num

                With extracter


                    .FileName = IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "" & i & "_Cotações da Semana 17 MAIL 20-04 a 26-04.pdf ")

                End With
                IO.File.WriteAllText(IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "" & i & "_Cotações da Semana 17 MAIL 20-04 a 26-04.text"), extracter.ReadPdfFile)
                f_data.ListBox1.Items.Add(i & "_Cotações da Semana 17 MAIL 20-04 a 26-04")
            Next

    and its give me a error: Could not find image data or EI

    the question about the class extracter is, how do i find the email adress in the pdf?

    Tuesday, April 21, 2015 3:43 PM
  • In this example i used the small class that Kevin provided with a small change to it to pass the filename as a parameter argument instead of using a separate property.  I also used RegularExpressions to find all the email addresses in the text that is read from the pdf file and listed them in a RichTextBox.  You can experiment with it to make it work the way you want.

    Form1 Code - Form1 has 1 Button and 1 RichTextBox added to it.

    Imports System.Text.RegularExpressions
    
    Public Class Form1
        Private txtExtractor As New Extracter
    
        Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            Using ofd As New OpenFileDialog
                ofd.Filter = "Pdf files|*.pdf"
                If ofd.ShowDialog = DialogResult.OK Then
                    'read the text from the pdf file
                    Dim pdftxt As String = txtExtractor.ReadPdfFile(ofd.FileName)
    
                    'find all email addresses 
                    Dim rgx As New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                    Dim mtchs As MatchCollection = rgx.Matches(pdftxt)
    
                    'adds each email address that was found to a RichTextBox
                    For Each m As Match In mtchs
                        RichTextBox1.AppendText(m.Value & vbNewLine)
                    Next
                End If
            End Using
        End Sub
    End Class
    


     

     The code i used for the Extractor class Kevin posted.

    Imports System.Text
    Imports System.IO
    Imports iTextSharp.text.pdf
    Imports iTextSharp.text.pdf.parser
    
    Public Class Extracter
        Public Sub New()
        End Sub
    
        Public Function ReadPdfFile(ByVal filename As String) As String
            Dim text As New StringBuilder()
    
            If File.Exists(filename) Then
                Dim pdfReader As New PdfReader(filename)
    
                For page As Integer = 1 To pdfReader.NumberOfPages
                    Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()
                    Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)
    
                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)))
    
                    currentText = currentText.Replace(vbLf, Environment.NewLine)
    
                    text.Append(currentText)
                    text.AppendLine(Environment.NewLine)
                Next
                pdfReader.Close()
            End If
    
            Return text.ToString()
        End Function
    End Class
     

     After opening a small pdf file i made that has 4 email addresses in it this is the result.


    If you say it can`t be done then i`ll try it

    • Proposed as answer by KareninstructorMVP Tuesday, April 21, 2015 7:27 PM
    • Unproposed as answer by Ko0kiE Wednesday, April 22, 2015 2:47 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 4:00 PM
    Tuesday, April 21, 2015 5:32 PM
  • you get those emals from the pdf file or you write on it?
    Tuesday, April 21, 2015 5:52 PM
  • you get those emals from the pdf file or you write on it?

     I got them from the text that was read from the pdf file.

     The Extractor class reads the text from the pdf file.  I store that text in a string and then use RegularExpressions to find all the email addresses in the string.  Then i display them in a RichTextBox.


    If you say it can`t be done then i`ll try it

    • Edited by IronRazerz Tuesday, April 21, 2015 6:12 PM
    Tuesday, April 21, 2015 6:08 PM
  • it is giving me this error 

     Could not find image data or EI

    Tuesday, April 21, 2015 6:14 PM
  • I don`t have any clue what those errors are from.  Maybe if you post your code and show what line the error is on it would help someone spot a problem.

    If you say it can`t be done then i`ll try it

    Tuesday, April 21, 2015 6:43 PM
  • It's difficult to know what the exception is without the ability to see your code and work with your documents but between what I provided and IronRazerz provided you have the pieces to work through the problem at hand.

    As mentioned earlier, I have done a lot with extracting data from PDF documents and with that know all too well that parsing data can be troublesome.

     


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Tuesday, April 21, 2015 7:31 PM
  • Actually, I would save the PDF as a text file, and import the text.  Of course, you'll get the 'text' stuff and not the images.

    So, you want email addresses.  What do you want to do, exactly?  Import these emails into a table in a database?  Import them all into a DataGridView?  Collect them all and move them all into an Excel file perhaps?


    Knowledge is the only thing that I can give you, and still retain, and we are both better off for it.

    Tuesday, April 21, 2015 8:03 PM
  • i will post the code, just to answer  to your question ryguy72, i need to get the emails from pdf files, and send that file to the email that i get from the file

    file1-email@email.com

    file1-email2@email.com

    file2-emailot@email.com

    im i being understandble?

    let me post code about getting the emails from pdf

    Wednesday, April 22, 2015 8:31 AM
  • 
    

    Imports System.Text Imports System.IO Imports iTextSharp.text.pdf Imports iTextSharp.text.pdf.parser Public Class extracter ' the name of the file that is saved on my release directory' Public filename As String = "1_Cotações da Semana 17 MAIL 20-04 a 26-04.pdf" Public Function ReadPdfFile(ByVal filename As String) As String Dim text As New StringBuilder If File.Exists(filename) Then Dim pdfreader As New PdfReader(filename) For page As Integer = 1 To pdfreader.NumberOfPages Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy() 'error on line below

    Dim currenttext As String = PdfTextExtractor.GetTextFromPage(pdfreader, page, strategy) currenttext = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currenttext))) currenttext = currenttext.Replace(vbLf, Environment.NewLine) text.Append(currenttext) text.AppendLine(Environment.NewLine) Next page pdfreader.Close() Else MessageBox.Show("Ficheiro não encontrado! Contacte informatica@lubrfuel.pt!", "Mensagem", MessageBoxButtons.OK, MessageBoxIcon.Exclamation) End If Return text.ToString End Function End Class

    more information about the error:

    An unhandled exception of type 'iTextSharp.text.pdf.parser.InlineImageUtils.InlineImageParseException' occurred in itextsharp.dll

    Additional information: Could not find image data or EI



    • Edited by Ko0kiE Wednesday, April 22, 2015 8:37 AM
    Wednesday, April 22, 2015 8:33 AM
  • Hello,

    I am not going to address the error you are getting as this is from the iTextSharpLibrary not the code supplied. What I will address is picking out text via a LINQ/Lambda statement, in this case I picked a random PDF document, picked something to find, in this case Imports then get the text to the right

    Sample from a five page PDF where the highlighted text is what I want

    So keeping with the function in iTextSharpHelper I changed the calling code as follows

    The LINQ/Lambda is complex if you never have done anything like this. In short I created a List(Of LineData) which contains each line and row number for every line in the data returned from the iTextSharp helper then use a where condition to return only LineData where the line contains "Imports" which we then split the line and get the text after Imports. So in your case you would need to have logic that searchs for emailx where x is an integer which I took directly from your original question.

     
    Imports iTextSharpHelper
    Module Module1
        Sub Main()
            Dim Extracter As New Extracter With
                {
                    .FileName = IO.Path.Combine(
                        AppDomain.CurrentDomain.BaseDirectory,
                        "File1.pdf")
                }
    
            Dim Result =
                  (
                      From T In Extracter.ReadPdfFile.Split(
                      CChar(Environment.NewLine)).ToList _
                      .Select(
                      Function(l, i)
                          Return New LineData With
                                 {
                                     .Index = i,
                                     .Line = l.Replace(Environment.NewLine, "")
                                 }
                      End Function) _
                      .Where(
                      Function(data)
                          Return data.Line.Contains("Imports")
                      End Function)
                  ).ToList
    
    
            If Result.Count > 0 Then
    
                Dim arr As String() = {}
                For i As Integer = 0 To Result.Count - 1
                    arr = Result(i).Line.Split(" "c)
                    If arr.Count > 1 Then
                        Console.WriteLine("Line: {0} data: {1}",
                                          Result(i).Index, arr(1))
                    End If
                Next
    
            End If
    
            Console.ReadLine()
        End Sub
    End Module
    Public Class LineData
        Public Property Index As Integer
        Public Property Line As String
        Public Sub New()
        End Sub
    End Class
    

    NOTE Yes I know you are not looking for Imports but this is the best I can do to show how one might go about parsing a PDF document, the logic is sound and can work for you but an effort on your part is required.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.


    Wednesday, April 22, 2015 1:13 PM
  • but how to i solve this?

    An unhandled exception of type 'iTextSharp.text.pdf.parser.InlineImageUtils.InlineImageParseException' occurred in itextsharp.dll

    Additional information: Could not find image data or EI

    Wednesday, April 22, 2015 1:40 PM
  • but how to i solve this?

    An unhandled exception of type 'iTextSharp.text.pdf.parser.InlineImageUtils.InlineImageParseException' occurred in itextsharp.dll

    Additional information: Could not find image data or EI

    Ask in the following forum

    http://support.itextpdf.com/forum


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Wednesday, April 22, 2015 2:00 PM
  • how the code that you posted helps me get the emails from the pdf?

    tks for your time, i have got to pay you a coffe :p

    Wednesday, April 22, 2015 2:26 PM
  • how the code that you posted helps me get the emails from the pdf?

    tks for your time, i have got to pay you a coffe :p

    It helps by showing how to parse for data in text extracted from a PDF document, in my example I used Imports as a token to find text, in your case it would be to focus on email rather than Imports. As mentioned before, I don't have your PDF and even so you still need to resolve the internal error which I recommended asking in the iTextSharp support forum


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Wednesday, April 22, 2015 2:42 PM
  • i tried this but withou any sucess, its giving me that error

    An unhandled exception of type 'iTextSharp.text.pdf.parser.InlineImageUtils.InlineImageParseException' occurred in itextsharp.dll

     
    Wednesday, April 22, 2015 2:48 PM
  • i solve the problem, i can get the emails from the pdf but one buy onë, how can i do it to read a number of files? 

    im doing this

    Using ofd As New OpenFileDialog
                ofd.Filter = "Pdf files|*.pdf"
                If ofd.ShowDialog = DialogResult.OK Then
                    'read the text from the pdf file
                    Dim pdftxt As String = txtExtractor.ReadPdfFile(ofd.FileName)
    
                    'find all email addresses 
                    Dim rgx As New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                    Dim mtchs As MatchCollection = rgx.Matches(pdftxt)
    
                    'adds each email address that was found to a RichTextBox
                    For Each m As Match In mtchs
                        RichTextBox1.AppendText(m.Value & vbNewLine)
                    Next
                End If
            End Using

    it works!

    but i have to select the file

    Wednesday, April 22, 2015 3:07 PM
  • Do you have all of the pdf file paths stored in a List(Of String) or an Array already? If you do then you can use a For Each loop to iterate through the file paths and get the links from each file.

     If you just want to be able to select more than one file using the OpenFileDialog then you can set its MultiSelect property to True and then iterate through each file.

            Using ofd As New OpenFileDialog
                ofd.Filter = "Pdf files|*.pdf"
                ofd.Multiselect = True 'this allows you to select more than one file at a time
    
                If ofd.ShowDialog = DialogResult.OK Then
    
                    For Each fn As String In ofd.FileNames 'iterate through each filename that was selected
    
                        'read the text from the pdf file
                        Dim pdftxt As String = txtExtractor.ReadPdfFile(fn)
    
                        'find all email addresses 
                        Dim rgx As New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                        Dim mtchs As MatchCollection = rgx.Matches(pdftxt)
    
                        'adds each email address that was found to a RichTextBox
                        'you can add them to a List(Of String) instead, if you want
                        For Each m As Match In mtchs
                            RichTextBox1.AppendText(m.Value & vbNewLine)
                        Next
    
                    Next
    
                End If
            End Using


    If you say it can`t be done then i`ll try it

    • Edited by IronRazerz Wednesday, April 22, 2015 3:23 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 3:33 PM
    Wednesday, April 22, 2015 3:22 PM
  • doing this to "create fields"

     Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
            ListBox1.Items.Add("Emails")
            ListBox2.Items.Add("Document Number")
        End Sub

    Doing this to get the file and write on the diferent listbox

    i create a open file dialog and put it withe multiselect equals true bt only give me one email repeated

    if i select 4 files it will give me 4 times the same email and the number of documen will appear 4 times too

       ofd.Filter = "Pdf files|*.pdf"
            If ofd.ShowDialog = DialogResult.OK Then
                'read the text from the pdf file
                For Each fn As String In ofd.FileNames
    
                    Dim pdftxt As String = txtExtractor.ReadPdfFile(ofd.FileName)
    
                    'find all email addresses 
                    Dim rgx As New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                    Dim mtchs As MatchCollection = rgx.Matches(pdftxt)
    
                    'adds each email address that was found to a RichTextBox
    
                    For Each m As Match In mtchs
                       
    
                            ListBox1.Items.Add(m.Value)
                            ListBox2.Items.Add(ofd.FileName)
                      
                    Next
                Next
            End If


    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 3:33 PM
    • Unmarked as answer by Ko0kiE Wednesday, April 22, 2015 3:33 PM
    Wednesday, April 22, 2015 3:31 PM
  • Can you create a mocked PDF that the only change would be fake email addresses so we can better assist you.

    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Wednesday, April 22, 2015 3:32 PM
  • i saw the error :D

     ofd.Filter = "Pdf files|*.pdf"
            If ofd.ShowDialog = DialogResult.OK Then
                'read the text from the pdf file
                For Each fn As String In ofd.FileNames
    
                    Dim pdftxt As String = txtExtractor.ReadPdfFile(fn)
    
                    'find all email addresses 
                    Dim rgx As New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                    Dim mtchs As MatchCollection = rgx.Matches(pdftxt)
    
                    'adds each email address that was found to a RichTextBox
    
                    For Each m As Match In mtchs
                      
    
                            ListBox1.Items.Add(m.Value)
                            ListBox2.Items.Add(fn)
                       
                    Next
                Next
            End If

    Wednesday, April 22, 2015 3:33 PM
  • i just can choose one as answer but i wanna to say thanks to  you both for help me! 

    this is really important!

    have a nice day, if i ever go to US i pay you a cofffe :p

    Wednesday, April 22, 2015 3:35 PM
  • i just can choose one as answer but i wanna to say thanks to  you both for help me! 

    this is really important!

    have a nice day, if i ever go to US i pay you a cofffe :p


    You can choose more than one answer if so desired as there were several who assisted in the solution i.e. my code to read the data in.

    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my webpage under my profile but do not reply to forum questions.

    Wednesday, April 22, 2015 3:48 PM
  • im so nooob ahahahah

    done thank you all!

    Wednesday, April 22, 2015 4:01 PM