Algorithm for Extracting Data from Ill-formed Word Document RRS feed

  • Question

  • Hi,

    This could be a design pattern issue or a .NET issue or a language issue. This thread follows from a suggestion in another forum to repost in this forum.

    I wanted to use Word Interop to go through a document and pull information out into an xml document. The document was created by a regular user, so there is no way to easily extract the data (no controls, no xml). I would develop some way of detecting the keywords and storing them in an object. If I cannot extract the odd document or two because its format is too divergent from the rest, that's okay. I plan to use specific words as the keywords, and they generally follow a particular order in the document. The data of interest is either the word, the paragraph or the paragraphs after the keyword.

    I need to get some ideas on how to plough through the document looking for keywords. 

    1. the best way to use the keyword to find the value location;
    2. how to assign the right method of extracting the value based on the keyword (nextWord or nextParagraph);
    3. how to associate the location in the xml document with the keyword.

    So it seems like it's about the algorithm, but the solution might be helped with knowing the capabilities of .NET better than I do. I use VB.NET, but I don't think this is language specific, is it? I wonder if generics or delegates might help in this situation. For example, I just discovered recently when dealing with strings that represent a file path on a drive in the Windows OS, using the System.IO Path class makes it easy to pull out the part of the path (e.g. the folder name), rather than use your own text manipulation.

    The best I can think of is to first develop a list of words or phrases that are keywords. Then, with each keyword, nominate if I want the sentance, list or a number of paragraphs after the keyword paragraph. I might also nominate a style that is used for that keyword. Then, I would process a list of documents, one-by-one, extracting the data and perhaps reporting whether all keywords were found.


    1. Each keyword may exist with other words in some documents. For example, the keyword "objectives" might be found under the heading, "Course Objectives."
    2. The keyword might not exist in some contexts, while others are mandatory.
    3. There may be a paragraph before the information that I need.

    As I said earlier, the problem with this kind of data gathering is not to strive for perfection, but to reduce workload to just a few documents.

    Thanks for any ideas, references or suggestions.

    Tuesday, April 10, 2012 2:33 AM


All replies