locked
pdf to text RRS feed

  • Question

  • User2045693258 posted

    can anyone provide a simple solution for reading a pdf in vb.net

    I've tried itextsharp but it's way too complicated for me, also pdfbox is a bit much as well (having a lot of java initialization type errors). All i need to do is get the text out of a pdf with no regard to formatting it. anything that's quick and dirty will do as long as it can get text from pdfs that are stored online. any help is greatly appreciated.


    Friday, February 5, 2010 9:55 AM

Answers

  • User1364706731 posted


    It requires that you have the full version of Adobe installed on your PC so that you can gain access to the Adobe APIs (which doesn't technically qualify as a free way to do it).

    Here is the code I used to read the contents of a PDF. You will have to add a reference to the Adobe APIs in your project:

    Dim objPDFPage As AcroPDPage

    Dim objPDFDoc As New AcroPDDoc
    Dim objPDFAVDoc As AcroAVDoc
    Dim objAcroApp As AcroApp
    Dim objPDFRectTemp As Object
    Dim objPDFRect As New AcroRect
    Dim lngTextRangeCount As Long
    Dim objPDFTextSelection As AcroPDTextSelect
    Dim temptextcount As Long
    Dim strText As String

    Dim lngPageCount As Long
    Dim Fora As Long

    objPDFDoc.Open(tbdocdisplaypath.Text)
    lngPageCount = objPDFDoc.GetNumPages

    For Fora = 0 To lngPageCount - 1

    objPDFPage = objPDFDoc.AcquirePage(Fora)
    objPDFRectTemp = objPDFPage.GetSize
    objPDFRect.Left = 0
    objPDFRect.right = objPDFRectTemp.x
    objPDFRect.Top = objPDFRectTemp.y
    objPDFRect.bottom = 0

    ' objPDFTextSelection = objPDFDoc.CreateTextSelect(lngPageCount, objPDFRect)
    objPDFTextSelection = objPDFDoc.CreateTextSelect(Fora, objPDFRect)

    ' Get The Text Of The Range

    temptextcount = objPDFTextSelection.GetNumText
    For lngTextRangeCount = 1 To objPDFTextSelection.GetNumText
    doctext = doctext & objPDFTextSelection.GetText(lngTextRangeCount - 1)
    Next

    doctext = doctext & vbCrLf

    Next

    doctype = "PDF"

    objPDFDoc.Close()


    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Monday, February 8, 2010 5:01 AM

All replies

  • User1364706731 posted


    It requires that you have the full version of Adobe installed on your PC so that you can gain access to the Adobe APIs (which doesn't technically qualify as a free way to do it).

    Here is the code I used to read the contents of a PDF. You will have to add a reference to the Adobe APIs in your project:

    Dim objPDFPage As AcroPDPage

    Dim objPDFDoc As New AcroPDDoc
    Dim objPDFAVDoc As AcroAVDoc
    Dim objAcroApp As AcroApp
    Dim objPDFRectTemp As Object
    Dim objPDFRect As New AcroRect
    Dim lngTextRangeCount As Long
    Dim objPDFTextSelection As AcroPDTextSelect
    Dim temptextcount As Long
    Dim strText As String

    Dim lngPageCount As Long
    Dim Fora As Long

    objPDFDoc.Open(tbdocdisplaypath.Text)
    lngPageCount = objPDFDoc.GetNumPages

    For Fora = 0 To lngPageCount - 1

    objPDFPage = objPDFDoc.AcquirePage(Fora)
    objPDFRectTemp = objPDFPage.GetSize
    objPDFRect.Left = 0
    objPDFRect.right = objPDFRectTemp.x
    objPDFRect.Top = objPDFRectTemp.y
    objPDFRect.bottom = 0

    ' objPDFTextSelection = objPDFDoc.CreateTextSelect(lngPageCount, objPDFRect)
    objPDFTextSelection = objPDFDoc.CreateTextSelect(Fora, objPDFRect)

    ' Get The Text Of The Range

    temptextcount = objPDFTextSelection.GetNumText
    For lngTextRangeCount = 1 To objPDFTextSelection.GetNumText
    doctext = doctext & objPDFTextSelection.GetText(lngTextRangeCount - 1)
    Next

    doctext = doctext & vbCrLf

    Next

    doctype = "PDF"

    objPDFDoc.Close()


    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Monday, February 8, 2010 5:01 AM
  • User1485622831 posted

    I've been looking all over for this sort of code but I can't find any documentation anywhere, does anybody know where you can get the Adobe documentation?

    Thanks

    Thursday, February 18, 2010 5:42 AM