locked
How to use VBA to View PDF file and use an OCR object? RRS feed

  • Question

  • There are problems to view PDF with VBA.

    I have 2 questions:

    1. How to get text contents from PDF via VBA.

    2. If PDF is a scaned file, is there any OCR object to convert image to text and get the contents?

    Monday, November 19, 2012 11:57 AM

All replies

  • Hi MaerDam,

    If you have OneNote, you can paste the scanned image onto a OneNote page and have that convert the image to text.


    Regards, Jan Karel Pieterse|Excel MVP|http://www.jkp-ads.com
    Monday, November 19, 2012 12:12 PM
  • Hi Jan,

    Open PDF file, OCR, find out key words should be completed by one macro. Which means that I click one button, all job will be done.

    Monday, November 19, 2012 2:35 PM
  • Hi MaerDam,

    Have you got OneNote? If so, perhaps it is better to ask your question in a forum about onenote.


    Regards, Jan Karel Pieterse|Excel MVP|http://www.jkp-ads.com
    Monday, November 19, 2012 3:34 PM
  • VBA can't do what you want so you need to get a utility app that can do what you want and be controlled by VBA. You have some research ahead of you!

    Rod Gill

    The one and only Project VBA Book

    Rod Gill Project Management

    Monday, November 19, 2012 8:12 PM
  • You should to have fill version of Acrobat (not reader)

    Take look on this procedure:

    Sub test_with_PDF()
    Dim objApp As Object
    Dim objPDDoc As Object
    Dim objjso As Object
    Dim wordsCount As Long
    Dim page As Long
    Dim i As Long
    Dim strData As String
    Dim strFileName As String
    
    strFileName = "C:\Temp\File.pdf"
    
    Set objApp = CreateObject("AcroExch.App")
    Set objPDDoc = CreateObject("AcroExch.PDDoc")
    'AD.1 open file, if =false file is damage
    If objPDDoc.Open(strFileName) Then
        Set objjso = objPDDoc.GetJSObject
        For page = 0 To objPDDoc.GetNumPages - 1
            wordsCount = objjso.GetPageNumWords(page)
            For i = 0 To wordsCount
                'AD.2 Set text to variable strData
                strData = strData & " " & objjso.getPageNthWord(page, i)
            Next i
        Next
        MsgBox strData
    Else
        MsgBox "error!"
    End If
    End Sub 


    Oskar Shon, Office System MVP

    Press if Helpful; Answer when a problem solved

    Monday, November 19, 2012 9:39 PM
    Answerer
  • Hi MaerDam,

    Have you got OneNote? If so, perhaps it is better to ask your question in a forum about onenote.


    Regards, Jan Karel Pieterse|Excel MVP|http://www.jkp-ads.com
    Yes, I have got OneNote already. How to use OCR function of OneNote via VBA?
    Tuesday, November 20, 2012 2:20 AM
  • Hi MaerDam,

    How to use OCR function of OneNote via VBA?

    Ask your question here:

    http://answers.microsoft.com/en-us/office/forum/onenote


    Regards, Jan Karel Pieterse|Excel MVP|http://www.jkp-ads.com
    Tuesday, November 20, 2012 6:35 AM
  • I did this several years ago using Office 2007.  Office 2003/2007 comes with an OCR scanner.  It is under Start >> Microsoft Office >> Microsoft Tools >> Document Imaging.  It takes tif files and converts them to word docs.  You can control it with VBA.  It only took a few lines of code to work.  I used a free command line program called ghostview to convert the pdf to tif.  I used VBA to call ghostview using win32 commands.  I think I did it page by page but maybe it works with the whole document.  I seem to remember getting reasonable results.  The pdf was a scan of a lost document.   I set it all up in less than a day.  I remember playing with the resolution of the tif document to improve the results.
    Tuesday, November 20, 2012 4:31 PM
  • but it is not workable in office 2010..
    Monday, December 10, 2012 7:13 AM
  • Hi Oskar Shon, thanks a lot for your answer.

    Are the codes will get all the words from a PDF file?

    Monday, December 10, 2012 7:16 AM
  • Note that Microsoft have not supported their MODI  ( ocr imaging ) in 2010 versions.. works OK in 2003  ( Library 11) and 2007 (library 12)

    C:\programfiles\commonfiles\microsoftshared\modi\12.0\modivc12.dll

      or 11 not 12 for 2003

    .. has a excellent compiled help file from MS with good VBA examples.. an excellent OCR for images

    Kills my version of excell 2010

    A-pdf.com  has one  has a-pdf text extractor  as a free offer at the moment

    http://www.a-pdf.com

    It does a good job on text from a pdf.. ( any better free about ??)

    But does any one know how to get MODI or some equal OCR working with excel 2012

    MS has an article on this but it works to set up 2007 byt not for me on 2010

    So that Image text may be extracted.. without going to PDF or using acrobat


    farmer

    Thursday, December 13, 2012 8:10 PM
  • You could use tesseract-ocr, an open free OCR program. I think Google uses it.  There is a Windows version.

    http://code.google.com/p/tesseract-ocr/downloads/list

    You could use ghostview to convert pdf to tiff.  You can call the programs using Win32 commands in VBA.


    Saturday, December 15, 2012 9:06 PM
  • If works only with full ver. of Acrobate (course if PDF is not from image).

    Oskar Shon, Office System MVP

    Press if Helpful; Answer when a problem solved

    Sunday, December 16, 2012 10:02 PM
    Answerer
  • You should to have fill version of Acrobat (not reader)

    Take look on this procedure:

    Sub test_with_PDF()
    Dim objApp As Object
    Dim objPDDoc As Object
    Dim objjso As Object
    Dim wordsCount As Long
    Dim page As Long
    Dim i As Long
    Dim strData As String
    Dim strFileName As String
    
    strFileName = "C:\Temp\File.pdf"
    
    Set objApp = CreateObject("AcroExch.App")
    Set objPDDoc = CreateObject("AcroExch.PDDoc")
    'AD.1 open file, if =false file is damage
    If objPDDoc.Open(strFileName) Then
        Set objjso = objPDDoc.GetJSObject
        For page = 0 To objPDDoc.GetNumPages - 1
            wordsCount = objjso.GetPageNumWords(page)
            For i = 0 To wordsCount
                'AD.2 Set text to variable strData
                strData = strData & " " & objjso.getPageNthWord(page, i)
            Next i
        Next
        MsgBox strData
    Else
        MsgBox "error!"
    End If
    End Sub 


    Oskar Shon, Office System MVP

    Press if Helpful; Answer when a problem solved

    Hi Oskar, How to output the contents with format?
    Monday, December 17, 2012 7:28 AM
  • You have seen two loop inside code.

    Variable strData collects strings from PDF file.

    You can cut that to pieces, if you need for or in loop write somewhere page by page.


    Oskar Shon, Office System MVP

    Press if Helpful; Answer when a problem solved

    Monday, December 17, 2012 1:00 PM
    Answerer
  • Got it! Thanks a lot.

    I can use instr function to find the key words..

    But I find the codes cannot output Characters. And if I need to output the tables in PDF. Can it be achieved?

    Tuesday, December 18, 2012 3:10 AM
  • You can use Split Function to find and cut sentence from the string.

    Add to variant or to collection.


    Oskar Shon, Office System MVP

    Press if Helpful; Answer when a problem solved

    Tuesday, December 18, 2012 8:29 AM
    Answerer
  • VBA can't do what you want so you need to get a utility app that can do what you want and be controlled by VBA. You have some research ahead of you!

    Why does it have to be so complicated?
    Wednesday, October 7, 2020 5:04 PM
  • It is a little complicated but I did it a few years ago.  Install SharePointDesigner.exe 2007. It includes Microsoft Office Document Imaging (MODI).  Only install MODI.  MODI can be called from VBA.  It uses .TIFF files for OCR.  I used ghostscript (free command line tool) to convert PDF to image files.  I called it from VBA.  Add reference to Microsoft Office Document Imaging.  I couldn't find doc on internet so I put it on my OneDrive.

    Note: This works in Excel 2010.  It seems to fail in later versions of Excel.

    UPDATE: I found my test case and ran it with Excel 2010 and it crashed Excel.  It was working a few years ago so I assume some Windows update is causing it to fail.



    • Edited by mogulman52 Saturday, October 10, 2020 9:59 PM
    Saturday, October 10, 2020 1:22 PM
  • yeah it's a little complicated as I was using this on Excel 2007 version. 


    Saturday, November 21, 2020 6:38 AM