none
interop.word extract sections between titles from word RRS feed

  • Question

  • Hi guys,

    I am a beginner with c#. I have got a project where I need to extract some text from a word document. To do so, I understood that the easiest way is to use interop.word. I am prefectly fine with that. I manage to extract the whole document.

    But, I do not manage to "recognize" the titles and extract only the sections between the titles. Furthermore, I would like also extract only the bold text, is that possible? Some pieces of code or examples would very useful...

    By the way, I did not manage to find a clear documentation of the interop.word referency... If you know where I can get it, do not hesitate!!

     

    Thanks in advance

    • Moved by Leo Liu - MSFT Monday, December 5, 2011 7:04 AM Moved for better support. (From:Visual C# General)
    Friday, December 2, 2011 6:05 PM

Answers

  • Hi David

    Try this

      Word.

    Document doc = Globals.ThisAddIn.Application.ActiveDocument;
      Word.
    Rangerng = doc.Content;
     
    object oH3 = Word.WdBuiltinStyle.wdStyleHeading3;
     
    object oH4 = Word.WdBuiltinStyle.wdStyleHeading4;
     
    object findText = "";
     
    object oTrue = true;
     
    object oWrapFind = Word.WdFindWrap.wdFindContinue;
     
    object oReplaceAll = Word.WdReplace.wdReplaceAll;
      rng.Find.ClearFormatting();
      rng.Find.Font.Bold = -1;
      rng.Find.Replacement.set_Style(
    refoH3);
     
    bool s = rng.Find.Execute(ref findText, ref missing, ref missing, ref missing, refmissing, ref missing, ref missing, ref missing, ref oTrue, ref missing, refoReplaceAll, ref missing, ref missing, ref missing, refmissing);
      rng.Find.ClearFormatting();
      findText =
    "<??????[0-9]*>";
      rng.Find.set_Style(
    refoH3);
      rng.Find.Replacement.set_Style(
    refoH4);
      rng.Find.Replacement.set_Style(
    refoH4);
      rng.Find.Execute(
    ref findText, ref missing, ref missing, ref oTrue, refmissing, ref missing, ref missing, ref oWrapFind, ref oTrue, ref missing, refoReplaceAll, ref missing, ref missing, ref missing, ref missing);

     


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by Bruce Song Tuesday, December 20, 2011 2:44 AM
    Tuesday, December 6, 2011 10:53 AM
    Moderator

All replies

  • Hi, 

    Welcome to Msdn Forum. I hope this will helpful for you to understand the word interoperability with dot net. 

    -Renjith 


    Please remember to mark the replies as answers if they help and unmark them if they provide no help.
    Friday, December 2, 2011 6:39 PM
  • Hi davidoudou,

    I am moving your thread into the Word for Developers Forum for dedicated support.
    Have a nice day,

    Leo Liu [MSFT]
    MSDN Community Support | Feedback to us
    Monday, December 5, 2011 7:05 AM
  • thanks, sorry I did not put it in the right place
    Monday, December 5, 2011 7:17 AM
  • Hi David

    Could you please tell us the version of Word you're targeting? The options available depend very much on whether these are the old *.doc files or the newer Open XML files (*.docx, for example).


    Cindy Meister, VSTO/Word MVP
    Monday, December 5, 2011 7:32 AM
    Moderator
  • Hi Cindy,

    I have word 2010 and I want to do for both .doc and .docx. Maybe I can start with .docx and then, modify my code to recognize the extenssion and then change the options accordingly to the file extension.

     

    David

    Monday, December 5, 2011 8:00 AM
  • Hi David

    I think that would involve a lot more than simply "modifying your code". The file formats and the ways to access them differ at a very basic level. My recommendation, in this case, would be to stick with the "interop".

    Documentation: pretty much everything beyond the basic object model reference is in VB(A). That's because VBA has been the native programming language since 1997 and C# is comparatively new :-) In order to really profit from the wealth of information available you will need to learn to read VB(A). IMO it's much simpler to go from C# to VB than the reverse :-) Some basic things you need to know:

    1. Declarations are "reversed": Dim theVariable As dataType
    2. Working with arrays/collections uses () instead of []
    3. null = Nothing
    4. Capitalization is not "special", so tHis == This == this
    5. If a property takes parameters, the PIA will translate this in C# to a method that will start with get_, respectively set_ so that you can pass the parameters
    6. You'll need to expand any enumerations to fully qualify them (instead of wdCollapseEnd it would Word.WdCollapseDirection.wdCollapseEnd, for example)
    7. for optional parameters you need to declare an object, cast the required value to that, then use ref object. If you don't want to pass a value (which is recommended if you aren't interested in that parameter) you should pass an object of data type System.Type.Missing.

    <<I would like also extract only the bold text>>

    Yes. To see this, open such a document in Word and press Ctrl+F to use the "Find" functionality. Since this is Word 2010, a task pane will open. Click the arrow next to the drop down and choose the "Advanced" option to open the dialog box.

    In that, click "More", the Format and choose Font. Select "Bold", then OK. You'll see that "Bold" is listed under the box where you'd type the text you want to search. Leave that box empty. Click "Find Next". You should see that this picks up all the bold text. So far, so good?

    Close the dialog box, go to the "View" tab*, from the "Macros" button list, select "Record Macro". Enter a name, then repeat the steps above. When you've "found" an instance of Bold text, go back to the button list and choose "Stop macro".

    Press Alt+F11 to open the VBA IDE. In the Project window, look for "Normal". If the project is collapsed, click the + sign until you see the folder "Modules". Expand this and double-click on the entrie "NewMacros" to display the code window. You should see your macro there, and it should contain the basic syntax required for the task. With me so far?

    Copy/paste that code into your reply and we can show you how this can be "translated" to C#.

    Note: Yes, I know this is a lot more work than my simply giving you the code. But it's the basic way to find out how Word does something and get the object model syntax, so you need to know how.

     


    Cindy Meister, VSTO/Word MVP
    Monday, December 5, 2011 8:33 AM
    Moderator
  • Thanks for the trick!

    Now, I do have the VBA macro. Basically this macroselect all the Arial Bold 11.5 points text and puts the format "Heading 3". Here is the macro code:

    Sub Header3()
    '
    ' Header3 Macro
    '
    '
        Selection.Find.ClearFormatting
        With Selection.Find.Font
            .Size = 11.5
            .Bold = True
        End With
        Selection.Find.Replacement.ClearFormatting
        Selection.Find.Replacement.Style = ActiveDocument.Styles("Heading 3")
        With Selection.Find
            .Text = ""
            .Replacement.Text = ""
            .Forward = True
            .Wrap = wdFindContinue
            .Format = True
            .MatchCase = False
            .MatchWholeWord = False
            .MatchWildcards = False
            .MatchSoundsLike = False
            .MatchAllWordForms = False
        End With
        Selection.Find.Execute Replace:=wdReplaceAll
    End Sub

    1/ How do I convert this into c# ?

    2/ I would like to check if the seventh character for all text with heading 3 formatting is a number and if this is not a number, I would like to change the formatting to another style (Heading 4 for instance).

    Thanks for your help

     

    david

     

    Monday, December 5, 2011 1:01 PM
  • Hi David

    I've been away today, and it's the end of my day. Tomorrow is busy, but I'll try to get to this tomorrow. If you're in a hurry, or just feel like stretching your brain muscles :-), you'll find any number of C# Find/Replace examples in this forum and the VSTO forum (http://social.msdn.microsoft.com/Forums/en-US/vsto/threads) that could give you a head start on the equivalent C# syntax.

    Hints:
    1. Replace the Selection object with a Range object (Word.Range rng = theDocumentObject.Content) so that you're working with Range.Find.

    2. Instead of With object just repeat the basic object qualification for the properties and methods.

    3. C# can't "do" DocumentObject.Styles["index"], you'll need DocumentObject.get_Style("index")

    4. If this isn't .NET 4.0 you'll need to supply all the parameters for Find.Execute. In this case, some of the properties set prior to this point in the code could be set with those parameters (save some lines of code and avoid duplication).

    Your request (2) will mean putting the Find/Replace into a loop, where you do only one replacement at a time. You'll also find examples for this, but it might be a bit tricky to get it working. Concentrate for the moment on (1) so that you get a feel for the basic syntax.


    Cindy Meister, VSTO/Word MVP
    Monday, December 5, 2011 5:12 PM
    Moderator
  • Hi David

    Try this

      Word.

    Document doc = Globals.ThisAddIn.Application.ActiveDocument;
      Word.
    Rangerng = doc.Content;
     
    object oH3 = Word.WdBuiltinStyle.wdStyleHeading3;
     
    object oH4 = Word.WdBuiltinStyle.wdStyleHeading4;
     
    object findText = "";
     
    object oTrue = true;
     
    object oWrapFind = Word.WdFindWrap.wdFindContinue;
     
    object oReplaceAll = Word.WdReplace.wdReplaceAll;
      rng.Find.ClearFormatting();
      rng.Find.Font.Bold = -1;
      rng.Find.Replacement.set_Style(
    refoH3);
     
    bool s = rng.Find.Execute(ref findText, ref missing, ref missing, ref missing, refmissing, ref missing, ref missing, ref missing, ref oTrue, ref missing, refoReplaceAll, ref missing, ref missing, ref missing, refmissing);
      rng.Find.ClearFormatting();
      findText =
    "<??????[0-9]*>";
      rng.Find.set_Style(
    refoH3);
      rng.Find.Replacement.set_Style(
    refoH4);
      rng.Find.Replacement.set_Style(
    refoH4);
      rng.Find.Execute(
    ref findText, ref missing, ref missing, ref oTrue, refmissing, ref missing, ref missing, ref oWrapFind, ref oTrue, ref missing, refoReplaceAll, ref missing, ref missing, ref missing, ref missing);

     


    Cindy Meister, VSTO/Word MVP
    • Marked as answer by Bruce Song Tuesday, December 20, 2011 2:44 AM
    Tuesday, December 6, 2011 10:53 AM
    Moderator