none
Binary Search Algorithm to Find a Paragraph in Word RRS feed

  • General discussion

  • Hi,

    Oftentimes one needs to find some text in a document. As far as I know, Word can only tell you if the paragraph you're searching exists or not in the range being searched. It doesn't tell you the paragraph number, which is of great importance if you want to manipulate the paragraph or do something with it.

    I first coded a brute-force algorithm, one that searches paragraph by paragraph, starting from the first one. If your document happened to be very large and have the word sought in the last paragraph, the performance of the application was terrible. And with VSTO, which is slower than VBA, things were even worse.

    After thinking for a while, I concluded that the Find class wasn't so useless at all. It could still search more efficiently than the brute-force algorithm. The problem was that it didn't give you the paragraph number. Then I noticed that a binary search algorithm could be implemented using the this class.

    Here's how the binary search algorithm works:

    1. Search the whole document first.

    2. Return -1 if the word wasn't found.

    3. If it did find the word, search on the first half.

    4. If the word isn't in the first half, search on the second half.

    5. Repeat steps 3 to 4 until you find the paragraph with the word you're looking for.

    Here's the code:

      ''' <summary>
      ''' Returns the paragraph number where the sought text was found.
      ''' </summary>
      ''' <param name="SoughtText">The text sought.</param>
      ''' <param name="Doc">The document being searched.</param>
      ''' <param name="FirstParagraph">
      ''' The first paragraph of the range being searched.
      ''' </param>
      ''' <param name="LastParagraph">
      ''' The last paragraph of the range being searched.
      ''' </param>
      ''' <returns></returns>
      ''' <remarks></remarks>
      Public Function FindParagraphIndex( _
        ByVal SoughtText As String, _
        ByVal Doc As Word.Document, _
        ByVal FirstParagraph As Integer, _
        ByVal LastParagraph As Integer) As Integer
    
        ' Check that the text isn't null.
        If String.IsNullOrEmpty(SoughtText) Then
          Throw New NullTextSoughtException()
        End If
    
        ' Case-sensitive search.
        Const matchCase As Boolean = True
    
        ' The search is over when the first paragraph equals the last 
        ' paragraph, but be careful if the document has only one paragraph!
        If FirstParagraph = LastParagraph AndAlso FirstParagraph <> 1 Then
          Return FirstParagraph
        ElseIf FirstParagraph = LastParagraph AndAlso FirstParagraph = 1 Then
          If Doc.Paragraphs(FirstParagraph).Range.Find.Execute(CStr(SoughtText), CBool(matchCase)) Then
            Return FirstParagraph
          Else
            Return -1
          End If
        End If
    
        Dim rangeSearched As Word.Range
    
        ' Search the whole range.
        rangeSearched = Doc.Range(CInt(Doc.Paragraphs(FirstParagraph).Range.Start), _
                     CInt(Doc.Paragraphs(LastParagraph).Range.End))
        If rangeSearched.Find.Execute(CStr(SoughtText), CBool(matchCase)) Then
          ' Calculate the mid point of the range being searched.
          Dim rangeMidPoint As Integer = 0
          rangeMidPoint = FirstParagraph + CInt(Math.Floor((LastParagraph - FirstParagraph) / 2))
    
          ' Search the first half of the range.
          rangeSearched = Doc.Range(CInt(Doc.Paragraphs(FirstParagraph).Range.Start), _
                       CInt(Doc.Paragraphs(rangeMidPoint).Range.End))
          If rangeSearched.Find.Execute(CStr(SoughtText), CBool(matchCase)) Then
            ' Start a new search in the first half of the range again.
            FindParagraphIndex = FindParagraphIndex(SoughtText, Doc, FirstParagraph, rangeMidPoint)
          Else
            ' Start a new search in the second half of the range.
            FindParagraphIndex = FindParagraphIndex(SoughtText, Doc, rangeMidPoint + 1, LastParagraph)
          End If
        Else
          Return -1
        End If
    
      End Function
    
    
    Here's the definition of the NullTextSoughtException exception:

    ''' <summary>
    ''' Thrown when the text sought in a document is null.
    ''' </summary>
    ''' <remarks></remarks>
    <Serializable()> _
    Public Class NullTextSoughtException
      Inherits ApplicationException
    
      Public Sub New()
        MyBase.New(My.Resources.NullTextSought)
      End Sub
    
      Public Sub New(ByVal message As String)
        MyBase.New(message)
      End Sub
    
      Public Sub New(ByVal message As String, ByVal inner As Exception)
        MyBase.New(message, inner)
      End Sub
    
      Public Sub New( _
        ByVal info As System.Runtime.Serialization.SerializationInfo, _
        ByVal context As System.Runtime.Serialization.StreamingContext)
        MyBase.New(info, context)
      End Sub
    End Class
    
    
    And My.Resources.NullTextSought = "The text sought cannot be null."

    Hope you find this post useful!

    Carlos Mallen

     

    Wednesday, March 16, 2011 2:03 AM