Answered read .pdf file

  • Monday, March 05, 2012 3:08 PM
     
     

    Hi,
    At present, my code can access a .pdf file and read a few properties.
    Question:
    How is it possible to extend my c# code so that I can read the text inside txtLastname control which is on the second page of the .pdf file?
    Thanks

    Existing code:

    PdfReader reader = new PdfReader(@"D:\test.pdf");
                // total number of pages
                int n = reader.NumberOfPages;
                // size of the first page
                Rectangle psize = reader.GetPageSize(1);

                // file properties
                Dictionary<string, string> infodict = reader.Info;
                foreach (KeyValuePair<string, string> kvp in infodict)
                    Console.WriteLine(kvp.Key + " => " + kvp.Value);
                   

All Replies

  • Monday, March 05, 2012 6:13 PM
     
     

    There are number of opensource projects for reading and writing to pdf file. You can refer those projects and use it according to your requirement.

    http://pdflib.codeplex.com/
    http://sourceforge.net/projects/itextsharp/
    http://pdfsharp.codeplex.com/releases/view/37054


    Gaurav Khanna | Microsoft VB.NET MVP

  • Tuesday, March 06, 2012 7:18 AM
    Moderator
     
     
    Hi arkiboys,

    If I have not got it wrong, you are using the iText.NET, is that right?
    If so you need close this thread by marking some useful replies as answers and then posting in some dedicated forums like the iText.NET Forum: http://sourceforge.net/projects/itextdotnet/forums/forum/268615.

    Otherwise you need to find other dedicated forums to post your question in.
    Thanks for your understanding.

    Have a nice day,

    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us

  • Sunday, March 11, 2012 9:47 AM
     
     Answered Has Code

    arkiboys,

    You can parse the page by using the following code:

    PdfReader PDFreader = new PdfReader("input file");
    RectangleJ rect = new RectangleJ(
           x: 100f,    //Starting point from the left
           y: 100f,    //Starting point from the bottom
           width: 25f, //Total rectangle to the right
           height: 25f //Total rectangle downwards
           );
    
    RenderFilter filter = new RegionTextRenderFilter(rect);
    
    ITextExtractionStrategy strategy;
    strategy = new FilteredTextRenderListener(
         deleg: new LocationTextExtractionStrategy(),    
         filters:new RenderFilter[] { filter }           
         );
    
    string foundText = PdfTextExtractor.GetTextFromPage(
              reader: PDFreader,
              pageNumber: 1,
              strategy: strategy
              );

    Michéle Johl

  • Tuesday, March 13, 2012 6:21 AM
    Moderator
     
     
    Hi arkiboys,

    So could you help to verify the solution provided by Michéle?
    We are looking forward to hearing from you.

    Have a nice day,

    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us

  • Tuesday, March 13, 2012 11:36 AM
     
     

    Sorry,

    I forgot to mention that I used a library called iTextSharp to read the PDF document....

    Use it so often that I almost consider as part of .NET


    Michéle Johl

  • Wednesday, March 21, 2012 4:02 PM
     
      Has Code
    private void button2_Click(object sender, EventArgs e)
            {
                string strText = string.Empty;
                string filename = @"C:\pdfTestFile.pdf";
                PdfReader pdfReader = new PdfReader(filename);
                for (int nPage = 1; nPage <= pdfReader.NumberOfPages; nPage++)
                {
                    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
                    PdfReader reader2 = new PdfReader(filename);
                    String s = PdfTextExtractor.GetTextFromPage(reader2, nPage, its);
    
                    s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
                    strText = strText + s;
                    reader2.Close();
    
    Hi,
    My goal is to read the texts inside the controls in the
    .pdf file...
    I am using the iTextSharp reference...
    The C# code below only reads the texts (Initially typed in the
    file) but NOT the texts typed in the controls.
    Do you know how
    to solve this please?

    Thanks
  • Thursday, March 22, 2012 9:20 PM
     
     Proposed Has Code

    Hi,

    Sorry for the late reply, I have been rebuilding my Laptop...
    The following code might be of use for you.

    string pdfTemplate = "my.pdf"; 
    PdfReader pdfReader = new PdfReader(pdfTemplate); 
    AcroFields fields = pdfReader.AcroFields.Fields; 
    string val = fields.GetField("fieldname");

    If you do not know what fields are in the PDF document you can extract all form fields with the following:

    private void ListPdfFormFields() 
    {   
        string pdfTemplate = Application.StartupPath + "\\a.pdf";   
        lstfields.Items.Clear();   
        // create a new PDF reader based on the PDF template document   
        PdfReader pdfReader = new PdfReader(pdfTemplate);   
        // create and populate a string builder with each of the   
        // field names available in the subject PDF   
        foreach (DictionaryEntry de in pdfReader.AcroFields.Fields) 
        {   
            var currentfield = de.Key.ToString();   
            currentfield = ParseFormField(currentfield); 
            lstfields.Items.Add(currentfield);   
        }   
        lstfields.Sorted = true;   
    }   
        
    private string ParseFormField(string fieldname) 
    { 
        fieldname = fieldname.Replace("form1[0].#subform[0].", ""); 
        fieldname = fieldname.Replace("[0]", ""); 
        fieldname.TrimEnd(); 
        return fieldname; 
    }

    Hope this puts you on the right path.

    ;)
    Michéle


    Michéle Johl

    • Proposed As Answer by Michéle Johl Friday, March 23, 2012 10:28 AM
    •  
  • Thursday, March 22, 2012 9:43 PM
     
     
    Can I ask how and where can i get "PdfReader" class (a dll file).

    Mitja

  • Friday, March 23, 2012 10:28 AM
     
     

    Hi Mitja,

    The PdfReader class is part of the iTextSharp library.
    You can download the latest version from the following link:

    iTextSharp - Sourceforge

    Cheers,


    Michéle Johl

  • Friday, March 23, 2012 3:15 PM
     
     

    Thank you very much. 

    Voted  +1 :)


    Mitja

  • Friday, March 23, 2012 4:49 PM
     
      Has Code

    Hi,

    Sorry for the late reply, I have been rebuilding my Laptop...
    The following code might be of use for you.

    string pdfTemplate = "my.pdf"; 
    PdfReader pdfReader = new PdfReader(pdfTemplate); 
    AcroFields fields = pdfReader.AcroFields.Fields; 
    string val = fields.GetField("fieldname");

    If you do not know what fields are in the PDF document you can extract all form fields with the following:

    private void ListPdfFormFields() 
    {   
        string pdfTemplate = Application.StartupPath + "\\a.pdf";   
        lstfields.Items.Clear();   
        // create a new PDF reader based on the PDF template document   
        PdfReader pdfReader = new PdfReader(pdfTemplate);   
        // create and populate a string builder with each of the   
        // field names available in the subject PDF   
        foreach (DictionaryEntry de in pdfReader.AcroFields.Fields) 
        {   
            var currentfield = de.Key.ToString();   
            currentfield = ParseFormField(currentfield); 
            lstfields.Items.Add(currentfield);   
        }   
        lstfields.Sorted = true;   
    }   
        
    private string ParseFormField(string fieldname) 
    { 
        fieldname = fieldname.Replace("form1[0].#subform[0].", ""); 
        fieldname = fieldname.Replace("[0]", ""); 
        fieldname.TrimEnd(); 
        return fieldname; 
    }

    Hope this puts you on the right path.

    ;)
    Michéle


    Michéle Johl

    Hi,

    I get an error as follows:
    Error 2 Cannot implicitly convert type 'System.Collections.Generic.IDictionary<string,iTextSharp.text.pdf.AcroFields.Item>' to 'iTextSharp.text.pdf.AcroFields'. An explicit conversion exists (are you missing a cast?) 

  • Friday, March 23, 2012 7:10 PM
     
     

    Hi,

    Could you possibly send me a link to sample PDF document and for which field you would like to read?
    Will do some proper code for you.

    Cheers,


    Michéle Johl

  • Friday, March 23, 2012 7:35 PM
     
      Has Code

    Hi again,

    I tested the following code with a sample interactive PDF document from adobe.com and it work well.
    With the new version of iTextSharp the DictionaryEntry is no longer the correct type. (My code I gave you is 4 years old)
    I downloaded the latest version (5.2.0) and changed the DictionaryEntry to KeyValuePair and is now working like a peach.

    Sample Code:

    using System;
    using System.Collections.Generic;
    using iTextSharp.text.pdf;
    
    namespace ReadPDFFields
    {
        internal class Program
        {
            private static void Main(string[] args)
            {
                ListPDFFormFields(@"C:\Sample.pdf");
            }
    
            private static void ListPDFFormFields(string pdfFile)
            {
                Console.Clear();
    
                PdfReader reader = new PdfReader(pdfFile);
    
                foreach (KeyValuePair<string, AcroFields.Item> de in reader.AcroFields.Fields)
                {
                    var currentField = de.Key.ToString();
                    currentField = ParseFormField(currentField);
                    Console.WriteLine(currentField);
                }
            }
    
            private static string ParseFormField(string fieldname)
            {
                fieldname = fieldname.Replace("form1[0].#subform[0].", "");
                fieldname = fieldname.Replace("[0]", "");
                fieldname.TrimEnd();
                return fieldname;
            }
        }
    }

    Shout if you still get stuck...

    Michéle Johl

  • Monday, March 26, 2012 9:07 AM
     
      Has Code

    Hi again,

    I tested the following code with a sample interactive PDF document from adobe.com and it work well.
    With the new version of iTextSharp the DictionaryEntry is no longer the correct type. (My code I gave you is 4 years old)
    I downloaded the latest version (5.2.0) and changed the DictionaryEntry to KeyValuePair and is now working like a peach.

    Sample Code:

    using System;
    using System.Collections.Generic;
    using iTextSharp.text.pdf;
    
    namespace ReadPDFFields
    {
        internal class Program
        {
            private static void Main(string[] args)
            {
                ListPDFFormFields(@"C:\Sample.pdf");
            }
    
            private static void ListPDFFormFields(string pdfFile)
            {
                Console.Clear();
    
                PdfReader reader = new PdfReader(pdfFile);
    
                foreach (KeyValuePair<string, AcroFields.Item> de in reader.AcroFields.Fields)
                {
                    var currentField = de.Key.ToString();
                    currentField = ParseFormField(currentField);
                    Console.WriteLine(currentField);
                }
            }
    
            private static string ParseFormField(string fieldname)
            {
                fieldname = fieldname.Replace("form1[0].#subform[0].", "");
                fieldname = fieldname.Replace("[0]", "");
                fieldname.TrimEnd();
                return fieldname;
            }
        }
    }

    Shout if you still get stuck...

    Michéle Johl

    Now that I can get the name of the fields i.e. the first fieldname it picks up is "firstname", then how can I get the value of this control i.e. it shows as "james" in that control. ?

    Thanks
  • Monday, March 26, 2012 10:32 AM
     
     Answered Has Code

    Hi,

    The following code will do the trick:

    string pdfTemplate = @"C:\Sample.pdf";
    PdfReader pdfReader = new PdfReader(pdfTemplate);
    
    //Get the Form
    AcroFields form = pdfReader.AcroFields;
    
    //Go thru all fields in the form
    foreach (var field in form.Fields)
    {
        //Get the fields value
        string value = form.GetField(field.Key);
    
        //Print the result with the field name and it's value
        Console.WriteLine("{0}, {1}",
            field.Key,
            value);
    }


    Michéle Johl

    • Marked As Answer by arkiboys Monday, March 26, 2012 1:07 PM
    •  
  • Monday, March 26, 2012 1:07 PM
     
      Has Code

    Hi,

    The following code will do the trick:

    string pdfTemplate = @"C:\Sample.pdf";
    PdfReader pdfReader = new PdfReader(pdfTemplate);
    
    //Get the Form
    AcroFields form = pdfReader.AcroFields;
    
    //Go thru all fields in the form
    foreach (var field in form.Fields)
    {
        //Get the fields value
        string value = form.GetField(field.Key);
    
        //Print the result with the field name and it's value
        Console.WriteLine("{0}, {1}",
            field.Key,
            value);
    }


    Michéle Johl

    Thank you