none
How to get all words in Word Documents RRS feed

  • Question

  • Hi everyone.

    i want to get all words in a Word Document and List all by ascending order then check the words finally replace wrong word with correct word...

    can everyone help me... solutions...suggestions...?

    • Moved by Cindy Meister MVPModerator Sunday, January 23, 2011 6:21 AM Word, not VSTO-specific (From:Visual Studio Tools for Office)
    Saturday, January 22, 2011 4:11 PM

Answers

  • Use:

    Sub WordFrequency()
             Dim SingleWord As String           'Raw word pulled from doc
            Const maxwords = 9000              'Maximum unique words allowed
            Dim Words(maxwords) As String      'Array to hold unique words
            Dim Freq(maxwords) As Integer      'Frequency counter for Unique Words
            Dim WordNum As Integer             'Number of unique words
            Dim ByFreq As Boolean              'Flag for sorting order
            Dim ttlwds As Long                 'Total words in the document
            Dim Excludes As String             'Words to be excluded
            Dim Found As Boolean               'Temporary flag
            Dim j, k, l, Temp As Integer       'Temporary variables
            Dim tword As String                '
             ' Set up excluded words
    '         Excludes = "[the][a][of][is][to][for][this][that][by][be][and][are]"
            Excludes = ""
            Excludes = InputBox$("Enter words that you wish to exclude, surrounding each word with [ ].", "Excluded Words", "")
    '        Excludes = Excludes & InputBox$("The following words are excluded: " & Excludes & ". Enter words that you wish to exclude, surrounding each word with [ ].", "Excluded Words", "")
    ' Find out how to sort
    ByFreq = True
    Ans = InputBox$("Sort by WORD or by FREQ?", "Sort order", "FREQ")
    If Ans = "" Then End
    If UCase(Ans) = "WORD" Then
       ByFreq = False
    End If
    Selection.HomeKey Unit:=wdStory
    System.Cursor = wdCursorWait
    WordNum = 0
    ttlwds = ActiveDocument.Words.Count
    Totalwords = ActiveDocument.BuiltInDocumentProperties(wdPropertyWords)
            ' Control the repeat
            For Each aword In ActiveDocument.Words
                SingleWord = Trim(aword)
                If SingleWord < "A" Or SingleWord > "z" Then SingleWord = "" 'Out of range?
                If InStr(Excludes, "[" & SingleWord & "]") Then SingleWord = "" 'On exclude list?
                If Len(SingleWord) > 0 Then
                    Found = False
                    For j = 1 To WordNum
                        If Words(j) = SingleWord Then
                            Freq(j) = Freq(j) + 1
                            Found = True
                            Exit For
                        End If
                    Next j
                    If Not Found Then
                        WordNum = WordNum + 1
                        Words(WordNum) = SingleWord
                        Freq(WordNum) = 1
                    End If
                    If WordNum > maxwords - 1 Then
                        j = MsgBox("The maximum array size has been exceeded. Increase maxwords.", vbOKOnly)
                        Exit For
                    End If
                End If
                ttlwds = ttlwds - 1
                StatusBar = "Remaining: " & ttlwds & "     Unique: " & WordNum
            Next aword
             ' Now sort it into word order
            For j = 1 To WordNum - 1
                k = j
                For l = j + 1 To WordNum
                    If (Not ByFreq And Words(l) < Words(k)) Or (ByFreq And Freq(l) > Freq(k)) Then k = l
                Next l
                If k <> j Then
                    tword = Words(j)
                    Words(j) = Words(k)
                    Words(k) = tword
                    Temp = Freq(j)
                    Freq(j) = Freq(k)
                    Freq(k) = Temp
                End If
                StatusBar = "Sorting: " & WordNum - j
            Next j
             ' Now write out the results
            tmpName = ActiveDocument.AttachedTemplate.FullName
            Documents.Add Template:=tmpName, NewTemplate:=False
            Selection.ParagraphFormat.TabStops.ClearAll
            With Selection
                For j = 1 To WordNum
                    .TypeText Text:=Words(j) & vbTab & Trim(Str(Freq(j))) & vbCrLf
                Next j
            End With
            ActiveDocument.Range.Select
            Selection.ConvertToTable
            Selection.Collapse wdCollapseStart
            ActiveDocument.Tables(1).Rows.Add BeforeRow:=Selection.Rows(1)
            ActiveDocument.Tables(1).Cell(1, 1).Range.InsertBefore "Word"
            ActiveDocument.Tables(1).Cell(1, 2).Range.InsertBefore "Occurrences"
            ActiveDocument.Tables(1).Range.ParagraphFormat.Alignment = wdAlignParagraphCenter
            ActiveDocument.Tables(1).Rows.Add
            ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count, 1).Range.InsertBefore "Total words in Document"
            ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count, 2).Range.InsertBefore Totalwords
            ActiveDocument.Tables(1).Rows.Add
            ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count, 1).Range.InsertBefore "Number of different words in Document"
            ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count, 2).Range.InsertBefore Trim(Str(WordNum))
            System.Cursor = wdCursorNormal
            j = MsgBox("There were " & Trim(Str(WordNum)) & " different words ", vbOKOnly, "Finished")
        Selection.HomeKey wdStory

    End Sub
     -- Hope this helps.

    Doug Robbins - Word MVP,
    dkr[atsymbol]mvps[dot]org
    Posted via the Community Bridge

    "Adnan Ebrahimi" wrote in message news:8ddb02da-d954-40f4-b4a5-403fa8fa5a69@communitybridge.codeplex.com...

    Hi everyone.

    i want to get all words in a Word Document and List all by ascending order then check the words finally replace wrong word with correct word...

    can everyone help me... solutions...suggestions...?


    Doug Robbins - Word MVP dkr[atsymbol]mvps[dot]org
    • Marked as answer by Bessie Zhao Monday, February 7, 2011 10:02 AM
    Sunday, January 23, 2011 7:33 AM
  • Hi Abdan

    If you're working with the Word object model, going through the .NET/COM, Word OLE interface, and given the speed of processing in that interface, I don't think you can get anything faster than what you've got. It's the nature of the thing you're doing.

    Possibly, if the file is in the Office 2007/2010 Open XML file format (docx, for example) you can work with that file as you would any XML file. But we can't help you here with that. You'll find more information on Open XML file format at openXMLDeveloper.org


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by Bessie Zhao Monday, February 7, 2011 10:02 AM
    Wednesday, January 26, 2011 9:17 AM
    Moderator

All replies

  • Use:

    Sub WordFrequency()
             Dim SingleWord As String           'Raw word pulled from doc
            Const maxwords = 9000              'Maximum unique words allowed
            Dim Words(maxwords) As String      'Array to hold unique words
            Dim Freq(maxwords) As Integer      'Frequency counter for Unique Words
            Dim WordNum As Integer             'Number of unique words
            Dim ByFreq As Boolean              'Flag for sorting order
            Dim ttlwds As Long                 'Total words in the document
            Dim Excludes As String             'Words to be excluded
            Dim Found As Boolean               'Temporary flag
            Dim j, k, l, Temp As Integer       'Temporary variables
            Dim tword As String                '
             ' Set up excluded words
    '         Excludes = "[the][a][of][is][to][for][this][that][by][be][and][are]"
            Excludes = ""
            Excludes = InputBox$("Enter words that you wish to exclude, surrounding each word with [ ].", "Excluded Words", "")
    '        Excludes = Excludes & InputBox$("The following words are excluded: " & Excludes & ". Enter words that you wish to exclude, surrounding each word with [ ].", "Excluded Words", "")
    ' Find out how to sort
    ByFreq = True
    Ans = InputBox$("Sort by WORD or by FREQ?", "Sort order", "FREQ")
    If Ans = "" Then End
    If UCase(Ans) = "WORD" Then
       ByFreq = False
    End If
    Selection.HomeKey Unit:=wdStory
    System.Cursor = wdCursorWait
    WordNum = 0
    ttlwds = ActiveDocument.Words.Count
    Totalwords = ActiveDocument.BuiltInDocumentProperties(wdPropertyWords)
            ' Control the repeat
            For Each aword In ActiveDocument.Words
                SingleWord = Trim(aword)
                If SingleWord < "A" Or SingleWord > "z" Then SingleWord = "" 'Out of range?
                If InStr(Excludes, "[" & SingleWord & "]") Then SingleWord = "" 'On exclude list?
                If Len(SingleWord) > 0 Then
                    Found = False
                    For j = 1 To WordNum
                        If Words(j) = SingleWord Then
                            Freq(j) = Freq(j) + 1
                            Found = True
                            Exit For
                        End If
                    Next j
                    If Not Found Then
                        WordNum = WordNum + 1
                        Words(WordNum) = SingleWord
                        Freq(WordNum) = 1
                    End If
                    If WordNum > maxwords - 1 Then
                        j = MsgBox("The maximum array size has been exceeded. Increase maxwords.", vbOKOnly)
                        Exit For
                    End If
                End If
                ttlwds = ttlwds - 1
                StatusBar = "Remaining: " & ttlwds & "     Unique: " & WordNum
            Next aword
             ' Now sort it into word order
            For j = 1 To WordNum - 1
                k = j
                For l = j + 1 To WordNum
                    If (Not ByFreq And Words(l) < Words(k)) Or (ByFreq And Freq(l) > Freq(k)) Then k = l
                Next l
                If k <> j Then
                    tword = Words(j)
                    Words(j) = Words(k)
                    Words(k) = tword
                    Temp = Freq(j)
                    Freq(j) = Freq(k)
                    Freq(k) = Temp
                End If
                StatusBar = "Sorting: " & WordNum - j
            Next j
             ' Now write out the results
            tmpName = ActiveDocument.AttachedTemplate.FullName
            Documents.Add Template:=tmpName, NewTemplate:=False
            Selection.ParagraphFormat.TabStops.ClearAll
            With Selection
                For j = 1 To WordNum
                    .TypeText Text:=Words(j) & vbTab & Trim(Str(Freq(j))) & vbCrLf
                Next j
            End With
            ActiveDocument.Range.Select
            Selection.ConvertToTable
            Selection.Collapse wdCollapseStart
            ActiveDocument.Tables(1).Rows.Add BeforeRow:=Selection.Rows(1)
            ActiveDocument.Tables(1).Cell(1, 1).Range.InsertBefore "Word"
            ActiveDocument.Tables(1).Cell(1, 2).Range.InsertBefore "Occurrences"
            ActiveDocument.Tables(1).Range.ParagraphFormat.Alignment = wdAlignParagraphCenter
            ActiveDocument.Tables(1).Rows.Add
            ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count, 1).Range.InsertBefore "Total words in Document"
            ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count, 2).Range.InsertBefore Totalwords
            ActiveDocument.Tables(1).Rows.Add
            ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count, 1).Range.InsertBefore "Number of different words in Document"
            ActiveDocument.Tables(1).Cell(ActiveDocument.Tables(1).Rows.Count, 2).Range.InsertBefore Trim(Str(WordNum))
            System.Cursor = wdCursorNormal
            j = MsgBox("There were " & Trim(Str(WordNum)) & " different words ", vbOKOnly, "Finished")
        Selection.HomeKey wdStory

    End Sub
     -- Hope this helps.

    Doug Robbins - Word MVP,
    dkr[atsymbol]mvps[dot]org
    Posted via the Community Bridge

    "Adnan Ebrahimi" wrote in message news:8ddb02da-d954-40f4-b4a5-403fa8fa5a69@communitybridge.codeplex.com...

    Hi everyone.

    i want to get all words in a Word Document and List all by ascending order then check the words finally replace wrong word with correct word...

    can everyone help me... solutions...suggestions...?


    Doug Robbins - Word MVP dkr[atsymbol]mvps[dot]org
    • Marked as answer by Bessie Zhao Monday, February 7, 2011 10:02 AM
    Sunday, January 23, 2011 7:33 AM
  • Hello Mr Doug Robbins ,

    Thanks for your post , your codes are very good and i know i will check this on feature for other projects.

    but i've done my project but there is some problem with Speed of getting a words from Word Document , when i want to get a list of more than 1000 words , this process is too slow ,

    check out my codes :

    using Microsoft.Office.Tools.Ribbon;
    using Word=Microsoft.Office.Interop.Word;
    using System.Windows.Forms;
     
    public void BtnRibbon1_Click(object sender, RibbonControlEventArgs e)
            {
                object[] Word_Array; 
                ArrayList Word_Array_List = new ArrayList();
                int i = 0;
                try
                {

                    Word.Words WordsList = Globals.ThisAddIn.Application.ActiveDocument.Words; 
                                   

                    for (i = 0; i < WordsList.Count; i++) 
                    {
                        if (WordsList[i + 1].Text != "\r")
                        {
                            if (Word_Array_List.Contains(WordsList[i + 1].Text) == false)
                                Word_Array_List.Add(WordsList[i + 1].Text);
                        }

                    }

                    Word_Array = Word_Array_List.ToArray();

                    Form1 form = new Form1(Word_Array); 
                    form.Enabled = true;
                    form.Show();
                }
                catch (Exception ex)
                {
                    MessageBox.Show(ex.ToString());

                }
            }
    Wednesday, January 26, 2011 6:17 AM
  • I can only provide assistance with VBA. -- Hope this helps. Doug Robbins - Word MVP, dkr[atsymbol]mvps[dot]org Posted via the Community Bridge "Adnan Ebrahimi" wrote in message news:256d780d-be9b-43ce-9b56-b41013356385@communitybridge.codeplex.com... > Hello Mr Doug Robbins , > > Thanks for your post , your codes are very good and i know i will check > this on feature for other projects. > > but i've done my project but there is some problem with Speed of getting a > words from Word Document , when i want to get a list of more than 1000 > words , this process is too slow , > > check out my codes : > > [code] > using Microsoft.Office.Tools.Ribbon; > using Word=Microsoft.Office.Interop.Word; > using System.Windows.Forms; > [/code] > > [code] > public void BtnRibbon1_Click(object sender, RibbonControlEventArgs e) > { > object[] Word_Array; > [/code] > > [code] > ArrayList Word_Array_List = new ArrayList(); > int i = 0; > try > { > > Word.Words WordsList = > Globals.ThisAddIn.Application.ActiveDocument.Words; > [/code] > > [code] > > > for (i = 0; i < WordsList.Count; i++) > [/code] > > [code] > { > if (WordsList[i + 1].Text != "\r") > { > if (Word_Array_List.Contains(WordsList[i + > 1].Text) == false) > Word_Array_List.Add(WordsList[i + 1].Text); > } > > } > > Word_Array = Word_Array_List.ToArray(); > > Form1 form = new Form1(Word_Array); > form.Enabled = true; > form.Show(); > } > catch (Exception ex) > { > MessageBox.Show(ex.ToString()); > > } > } > [/code] >
    Doug Robbins - Word MVP dkr[atsymbol]mvps[dot]org
    Wednesday, January 26, 2011 9:04 AM
  • Hi Abdan

    If you're working with the Word object model, going through the .NET/COM, Word OLE interface, and given the speed of processing in that interface, I don't think you can get anything faster than what you've got. It's the nature of the thing you're doing.

    Possibly, if the file is in the Office 2007/2010 Open XML file format (docx, for example) you can work with that file as you would any XML file. But we can't help you here with that. You'll find more information on Open XML file format at openXMLDeveloper.org


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by Bessie Zhao Monday, February 7, 2011 10:02 AM
    Wednesday, January 26, 2011 9:17 AM
    Moderator
  • Thanks Cindy

    Sunday, January 30, 2011 10:52 AM
  • Hi Cindy,

    Along these lines, is it possible to get the Words collection using openXML. I would like to get the .Start and .End of a word. As you mentioned, if more than 1000 words the Application.Content.Words collection takes a long time to execute.

    Please advice other ways to get Start and End quicker.

     

    Thanks, Mohan

     

    Monday, March 7, 2011 6:54 PM
  • Hi ChandramohanG

    i solved this problem using paragraphs ,

    Create two Method

    one of Methods is for getting all Paragraphs

    and another is for getting all words in each paragraphs and this method called in Paragraphs method

    this algorithms solved my problem.

      Word.Words WordsList = Globals.ThisAddIn.Application.ActiveDocument.Words;
            Word.Paragraphs pa=Globals.ThisAddIn.Application.ActiveDocument.Paragraphs;

    Method for Paragraphs:

    public void ParaLoader(int Index)
                {
                    for (int i = Index; i <= pa.Count; i++)
                    {
                        ITERPARA=i;
                        if (flag == true)
                        {
                            break;
                        }
                        Loader(1, pa[i].Range.Words.Count, pa[i].Range.Words); \\ This is for Loading Words in each Paragraghs
                        
                    }
                }
    Hope this help :).
    Tuesday, March 8, 2011 9:48 AM
  • Hello All,

    This was helpful and thanks for your immediate reply Adnan.

    Performancewise, still it takes the same amount of time(15-20) seconds. I tried 4500 words and even the paragraph collection takes a long time.

    Any thoughts?

    I tried even with Parallel.For.

                Microsoft.Office.Interop.Word.Paragraphs pa = oApp.Application.ActiveDocument.Paragraph

                m_wordParas.Clear();

                Parallel.For(1, pa.Count, (i) =>
                {
                    m_wordParas.Add(i,pa[i].Range.Words.Count);
                });

     

    Also I would like to find the index of the Word.

    Paragraph.Range.Words reflects the position(Index) of the words collection with respect to the Paragraph. But I would like to find it with respect to the whole document.

     

    For example the first word in paragraph 2 will have a index of 1, But I would like to get it with respect to the whole document which will an Index greater than 1.

     

    Thanks.

    Regards, Mohan

     

    Tuesday, March 8, 2011 4:56 PM
  • Hi Mohan,

     

    as i know if you want to getting index of a specific word in a whole document , you must have a list of all word:

     Word_ArratList \\List of all Word


    and then you must search your word in a list and getting index of it

    int i=Array.IndexOf(Word_Arraylist.ToArray(),"SomeWord")+1; //getting index of a specific word and +1 because of Word.Words is One based index

    Thursday, March 10, 2011 6:38 AM
  • Thanks Adnan.

    I am trying to populate the Start/End values of Words in a custom Collection from .Net Code.

    Globals.ThisAddIn.Application.ActiveDocument.Words[10000].Start takes a long time to return the value.

    When trying to access the words collection from .net code it takes a long time, because of .net/com Interop. Is there a better way by which I can loop thru all the words collection in efficient way.

    Performance is very important. I also have a template file with Macros.

    Does making the project "Make COM visible" helps? This is a Word Add-In project by the way.

    Is there a way to solve this problem?

    Thanks for your time.

    Regards, Mohan

     

    Thursday, March 10, 2011 3:43 PM
  • Hi Mohan,

    What i told you is what i know till now.

    Hope you will find it and i'm will be happy if you share it here :)

    Sunday, March 13, 2011 6:51 AM
  •  

    I am kind of settling down with one solution. I called a Macro from .net code and loaded the Array of Words in the document. This way the code was faster since the Macro was close to the Word Object Model and no .net/COM interop was needed.

    With this, the problem I had with performance was solved.

    Thanks every one for the pointers.

     

    Code I used to call a Macro from .Net. Just in case for someone who is interested.

    Application.Run ("MacroToCall", ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);

    Regards,

    Chandramohan Ganesan

     

     

     

     

    Monday, March 14, 2011 4:10 PM
  • HI Doug,

    This is great code and solves part of my problem. I want to use word to find out count of words in Thani, Bahasa and other languages.

    What changes do i make to your code so that what this code does for english language it also repeats for Thai and Bahasa Language

    Thursday, January 7, 2016 7:30 AM