none
How to search each file in a folder to see if it contains a certain text? RRS feed

  • Question

  • Hi,

         I would like to perform a search in each and every file in a certain folder to check if it contains a certain string of text (possible with Jolly simbols).

       In the old days of Office XP I was using Application.FileSearch with something like the following:

      If MySearchedText <> "" Then
        Set fs = Application.FileSearch
        With fs
          .NewSearch
          .LookIn = strMyFolder
          .FileName = "*.doc"
          .TextOrProperty = MySearchedText 
          If .Execute > 0 Then
              Set MyArrFiles = .FoundFiles
          End If
        End With
    

    But now when Application.FileSearch is no more there, what to use? How accomplish the same task?

    I was trying this...

    1. Using Dir or FileSystemObject I can get a collection of filenames (in the folder)
    2. Then on each file I can reiterate one of the following function.

    A)

    Public Function FindTextInFile(strFileName As String, strTextSearched As String) As Boolean
       Dim oWord As Word.Application
       Dim oDoc As Word.Document
       Dim bResult As Boolean
    
       bResult = False
       On Error GoTo FindTextInFile_Err
       Set oWord = CreateObject("Word.application")
       oWord.Visible = False
       Set oDoc = oWord.Documents.Open(strFileName)
       With oDoc.Content.Find
          .ClearFormatting
          .Text = strTextSearched 
          .Replacement.Text = ""
          .Forward = True
          bResult = .Execute
       End With
       oDoc.Close
       oWord.Quit
    
    FindTextInFile_Exit:
       FindTextInFile = bResult
       Exit Function
    FindTextInFile_Err:
       Resume FindTextInFile_Exit
    End Function
    

    But I think

    • it will be TOO SLOW
    • what if the file or Word is already open?

    Onther solution that will avoid the above said problems would be:

    B)

    Public Function FindTexTInFile2(strFileName As String, strCriteria() As String) As Boolean
      Dim intFile As Integer
      Dim strFileContent As String
      Dim bResult As Boolean
      Dim i As Integer
      Dim lngUBound As Long
         
      On Error GoTo errHandler
      intFile = FreeFile
      Open strFileName For Binary Access Read As #intFile
      strFileContent = String(LOF(intFile), " ")
      Get #intFile, , strFileContent
        
      lngUBound = UBound(strCriteria)
      For i = 0 To lngUBound
        bResult = (InStr(1, strFileContent, strCriteria(i), vbTextCompare) <> 0)
        If bResult Then Exit For
      Next i
      FindTexTInFile2= bResult
    
    exitRoutine:
      Close #intFile
      Exit Function
      
    errHandler:
      MsgBox Err.Number & ": " & Err.Description, vbExclamation, "Errore nell'esecuzione di FindTexTInFile2()"
      Resume exitRoutine
    End Function
    

    In this way I don't have to bother if the file is already open or not and probably will be much faster. BUT it doesn't work! Because it searchs in the all bunch of bytes stored in the docx/zip file; if only  I'll be able to look only in the "real" content, i.e. what we humans :-) we'll see in the screen!!!

    Do you please have same suggestions?

    Thanks, Lauro

    Thursday, July 12, 2012 7:45 PM

Answers

  • Hi Lauro

    You use VBA with the Open XML file format, but not using only what VBA offers. It can be done - especially if all you want to do is read (not write) content. See this article:
    http://msdn.microsoft.com/en-us/library/dd819387(office.12).aspx

    Scroll down to "Using class modules" (about half-way through) to find the relevant code that lets you unzip a file in order to access the xml parts in the file. For Word, you'd need only the document.xml part.

    You'll find more on how the code to access the xml parts of an xml file here:
    http://www.jkp-ads.com/articles/Excel2007FileFormat02.asp

    In execution it will be much faster than VBA. But there'd definitely be a major learning curve.

    The only VBA approach that can work is the one you mention in your first message, which you do not like because it would be (and is) slow: using Dir to loop the files in a folder, open them, perform the Find, then close.


    Cindy Meister, VSTO/Word MVP

    • Marked as answer by Lauro2 Sunday, July 15, 2012 9:30 PM
    Sunday, July 15, 2012 8:34 AM
    Moderator
  • Thanks again Cindy,

    I will give it a try and I'll let you know if I'll succed!

    Lauro

    • Marked as answer by Lauro2 Sunday, July 15, 2012 9:30 PM
    Sunday, July 15, 2012 9:29 PM

All replies

  • You're right, the fastest, most resource efficient way would be to read the binary file from the hard disk. And you're also right you can't (or shouldn't) just search the binary file for text. For one thing characters within a Word binary are not neccessarily contiguous. Word binaries have a file system which means the text stream can jump around from place to place within the file.

    Microsoft, in a rare generous spirit, have released the Word binary specification. Using it, I spent a weekend writing a binary file reader (in C++, so probably no good to you). It will take you a solid couple of days, unless you already have an understanding of FAT32/DOS file allocation systems, which it is based upon.

    These two documents are probably the most useful. They're what I referred to.

    Compound File Specification (MS Word binaries are a type of 'compound file')

    MS Word Binary Specification (You will find you need to know how to read a compound file before you can use this information)

    Edit: Actually all probably useless to you, since you're dealing with .docx files rather than .doc files. You should use OpenXML, which is much faster than Interop. There's a separate forum for OpenXML.

    • Edited by JosephFox Thursday, July 12, 2012 10:04 PM
    Thursday, July 12, 2012 9:46 PM
  • Hi Lauro

    Supplementing Joseph's answer:

    I recommend you first go to OpenXMLdeveloper.org where you'll find a lot of basic information on using Office Open XML and the Open XML SDK, which "streamlines" some of the work for you.

    There are also forums on OpenXMLDeveloper.org, as well as http://social.msdn.microsoft.com/forums/en-US/oxmlsdk/threads/ here on MSDN.

    This approach will certainly be much faster in execution than automating the Word application. Indeed, Word does not even have to be installed in the environment doing the processing.

    You will have similar issues as Joseph described about the text not being contiguous. But judicious use of XPath can filter out all the "interference" so that you can pick up just the text.


    Cindy Meister, VSTO/Word MVP

    Friday, July 13, 2012 6:36 AM
    Moderator
  • What she said.

    If you need to search both .doc files and .docx, you will need to utilize both methods. Word interop is the only technology capable of reading .docx and .doc files (and as discussed it's too slow for your task).

    Friday, July 13, 2012 10:21 AM
  • Hi Cindy, hi Joseph,

    thanks to both of you.

    I gave a fast look at OpenXMLdeveloper.org, and I have found it where interesting but discouraging: I think it will be beyond my abililty, time and tools (VBA) to look inside the docx packaging for the desidered text string; also if this would be the most efficient way.

    I think I will stay on the much more easily and familiar way: word automatation and Office 2007/2010 Find object. Will it be a process very very long with aroud 100 o 200 files of 1 or 2 pages?

    I also made some reserch on the web and I didn't find very much; it seems strange: the probem I'm having should be a very common one...

    Bye, Lauro

    Saturday, July 14, 2012 10:06 PM
  • Hi Lauro

    You use VBA with the Open XML file format, but not using only what VBA offers. It can be done - especially if all you want to do is read (not write) content. See this article:
    http://msdn.microsoft.com/en-us/library/dd819387(office.12).aspx

    Scroll down to "Using class modules" (about half-way through) to find the relevant code that lets you unzip a file in order to access the xml parts in the file. For Word, you'd need only the document.xml part.

    You'll find more on how the code to access the xml parts of an xml file here:
    http://www.jkp-ads.com/articles/Excel2007FileFormat02.asp

    In execution it will be much faster than VBA. But there'd definitely be a major learning curve.

    The only VBA approach that can work is the one you mention in your first message, which you do not like because it would be (and is) slow: using Dir to loop the files in a folder, open them, perform the Find, then close.


    Cindy Meister, VSTO/Word MVP

    • Marked as answer by Lauro2 Sunday, July 15, 2012 9:30 PM
    Sunday, July 15, 2012 8:34 AM
    Moderator
  • Thanks again Cindy,

    I will give it a try and I'll let you know if I'll succed!

    Lauro

    • Marked as answer by Lauro2 Sunday, July 15, 2012 9:30 PM
    Sunday, July 15, 2012 9:29 PM