locked
parse pdf file RRS feed

  • Question

  • User1080785583 posted

    I want to parse a pdf file and locate the xml inside of the document. Suggestions?

    Saturday, February 21, 2015 11:21 PM

Answers

  • User-434868552 posted

    @xequence  TIMTOWTDI

    Depending on how your data is structured and given that this seems like a one time event for you, you might be successful by simply exporting to text and then parsing your text file.

    .pdf files look a bit messy inside.

    There are products, example https://bytescout.com/products/developer/pdfextractorsdk/index.html some of which have demo versions but it's possible that the demo versions are restricted to the number of pages that the can handle.

    you might want to try searches like:

    extract xml from pdf

    msdn .net .pdf to .xml

    et cetera

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Sunday, February 22, 2015 6:04 PM
  • User-271186128 posted

    Hi xequence,<!--?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" /--><!--?xml:namespace prefix = "u1" /--><u1:p></u1:p><!--?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" /--><o:p></o:p>

    As for this issue, I suggest you could try to read PDF content using iTextSharp, then get the XML data by using string method or Regex method. Here are some relevant articles, please refer to them.<u1:p></u1:p><o:p></o:p>

    How to read PDF content using iTextSharp in .NET<o:p></o:p>

    How to Extract Text From PDF File Using C#.Net<o:p></o:p>

    String Methods<o:p></o:p>

    Regex Methods<o:p></o:p>

    Best Regards,
    Dillion<u1:p></u1:p>
    <o:p></o:p>

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, February 25, 2015 4:24 AM

All replies

  • User-434868552 posted

    @xequence

    via Google, or your favourite search engine, 
           c# read .pdf
    will return a gazillion search results.

    a program like PSPad (free) will let you study the file you wish to example in hexadecimal.

    Nick, given that you've posted to forums.asp.net c. 2000 times, you're likely already exceptional with Google et al; so i'm wondering what it is i do not understand about your O.P.

    Have you been to http://partners.adobe.com/public/developer/xml/topic.html ?

    also:  http://partners.adobe.com/public/developer/en/xml/AdobeXMLFormsSamples.pdf 

    I want to parse a pdf file and locate the xml inside of the document. Suggestions?

    is this a one time, one file event?

    are you thinking about any generic .pdf file?   a special .pdf file?

    http://weblogs.asp.net/gerrylowry/clarity-is-important-both-in-question-and-in-answer 

    Sunday, February 22, 2015 12:39 AM
  • User1080785583 posted

    Thanks for the links. I only want to read the xml out of a 275 page .pdf file and then generate objects using xsd.exe (or something that turns xml into code), that I can then generate table hierarchical structure using DbSet<T>.

    Sunday, February 22, 2015 4:36 PM
  • User-434868552 posted

    @xequence  TIMTOWTDI

    Depending on how your data is structured and given that this seems like a one time event for you, you might be successful by simply exporting to text and then parsing your text file.

    .pdf files look a bit messy inside.

    There are products, example https://bytescout.com/products/developer/pdfextractorsdk/index.html some of which have demo versions but it's possible that the demo versions are restricted to the number of pages that the can handle.

    you might want to try searches like:

    extract xml from pdf

    msdn .net .pdf to .xml

    et cetera

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Sunday, February 22, 2015 6:04 PM
  • User-271186128 posted

    Hi xequence,<!--?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" /--><!--?xml:namespace prefix = "u1" /--><u1:p></u1:p><!--?xml:namespace prefix = "o" ns = "urn:schemas-microsoft-com:office:office" /--><o:p></o:p>

    As for this issue, I suggest you could try to read PDF content using iTextSharp, then get the XML data by using string method or Regex method. Here are some relevant articles, please refer to them.<u1:p></u1:p><o:p></o:p>

    How to read PDF content using iTextSharp in .NET<o:p></o:p>

    How to Extract Text From PDF File Using C#.Net<o:p></o:p>

    String Methods<o:p></o:p>

    Regex Methods<o:p></o:p>

    Best Regards,
    Dillion<u1:p></u1:p>
    <o:p></o:p>

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, February 25, 2015 4:24 AM