Algorithm for Extracting Data from Ill-formed Word Document
-
martes, 10 de abril de 2012 2:33
Hi,
This could be a design pattern issue or a .NET issue or a language issue. This thread follows from a suggestion in another forum to repost in this forum.
I wanted to use Word Interop to go through a document and pull information out into an xml document. The document was created by a regular user, so there is no way to easily extract the data (no controls, no xml). I would develop some way of detecting the keywords and storing them in an object. If I cannot extract the odd document or two because its format is too divergent from the rest, that's okay. I plan to use specific words as the keywords, and they generally follow a particular order in the document. The data of interest is either the word, the paragraph or the paragraphs after the keyword.
I need to get some ideas on how to plough through the document looking for keywords.
- the best way to use the keyword to find the value location;
- how to assign the right method of extracting the value based on the keyword (nextWord or nextParagraph);
- how to associate the location in the xml document with the keyword.
So it seems like it's about the algorithm, but the solution might be helped with knowing the capabilities of .NET better than I do. I use VB.NET, but I don't think this is language specific, is it? I wonder if generics or delegates might help in this situation. For example, I just discovered recently when dealing with strings that represent a file path on a drive in the Windows OS, using the System.IO Path class makes it easy to pull out the part of the path (e.g. the folder name), rather than use your own text manipulation.
The best I can think of is to first develop a list of words or phrases that are keywords. Then, with each keyword, nominate if I want the sentance, list or a number of paragraphs after the keyword paragraph. I might also nominate a style that is used for that keyword. Then, I would process a list of documents, one-by-one, extracting the data and perhaps reporting whether all keywords were found.
Problems:
- Each keyword may exist with other words in some documents. For example, the keyword "objectives" might be found under the heading, "Course Objectives."
- The keyword might not exist in some contexts, while others are mandatory.
- There may be a paragraph before the information that I need.
As I said earlier, the problem with this kind of data gathering is not to strive for perfection, but to reduce workload to just a few documents.
Thanks for any ideas, references or suggestions.
Todas las respuestas
-
jueves, 12 de abril de 2012 8:11Moderador
Hi Em-squared,
Thank you for posting.
I will help you involve others to help you. There might be some delay about the response. Appreciate your patience.
Best Regards,
Bruce Song [MSFT]
MSDN Community Support | Feedback to us
-
lunes, 16 de abril de 2012 3:25
Hi,
I'm thinking of OpenXML SDK with this requirement. OpenXML SDK is capable of document manipulation.
Word, Excel and PowerPoint format follows the Office Open XML File Format specification, in which xml is used to stores the data. Refer to samples at: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=17985
I'm trying to suggest a direction, but not sticking to exact problems you run into.
thanks,
Dylan
-
lunes, 16 de abril de 2012 7:17For some code to find key terms in a file, using another file as the source, see my post of April 06, 2012 5:24 AM at: http://social.technet.microsoft.com/Forums/en-US/word/thread/228d49ed-53a4-487f-9829-316f76abbe13. The key terms can be single words, phrases, etc. The Do While .Find.Found loop can be used for extending/moving the 'found' range to encompass whatever related content you're interested in. As your specs are couched in very general terms, more specific advice can't be given at this time.
Cheers
Paul Edstein
[MS MVP - Word] -
martes, 17 de abril de 2012 11:41
Thanks for the reply, macropod.
That's kind of the idea that I ended up going with. I thought it was brute force, but maybe it's the only way. I used VB.NET instead of VBA, and I had to limit it to headings because I could end up with the following situation
Requirements
The requirements for this unit of study are....
"Requirements" shows up twice, but I only want to find the heading. Not a big problem, really, but just thought there might be an easier way to match up the headings.
-
martes, 17 de abril de 2012 11:50
Thanks, Dylan:
I'm thinking you might have something there. I'm tied up on another project right now, but are you thinking maybe if I had an OpenXML view of the document, then I might be able to use XSLT to find my sections and pull relevant nodes out after? I haven't used OpenXML, except a cursory read of it a couple years back. If that's what you had in mind and might think it will work, I'll give it a try, but it might take a couple weeks for me to get a bit of time. (Really bogged down now :-| )
-
miércoles, 18 de abril de 2012 15:47
Yes, that's what in my mind. You can get familiar with OpenXML programming by a series of howto:
How to: Accept All Revisions in a Word Processing Document
How to: Add Tables to Word Processing Documents
How to: Apply a Style to a Paragraph in a Word Processing Document
How to: Change the Print Orientation of a Word Processing Document
How to: Change Text in a Table in a Word Processing Document
How to: Convert a Word Processing Document from the DOCM to the DOCX File Format
How to: Create and Add a Character Style to a Word Processing Document
How to: Create and Add a Paragraph Style to a Word Processing Document
How to: Create a Word Processing Document by Providing a Filename
How to: Delete Comments By All or a Specific Author in a Word Processing Document
How to: Extract Styles from a Word Processing Document
How to: Insert a Comment into a Word Processing Document
How to: Insert a Picture into a Word Processing Document
How to: Insert a Table into a Word Processing Document
How to: Open and Add Text to a Word Processing Document
How to: Open a Word Processing Document for Read-only Access
How to: Open a Word Processing Document from a Stream
How to: Remove Hidden Text from a Word Processing Document
How to: Remove the Headers and Footers from a Word Processing Document
How to: Replace the Header in a Word Processing Document
How to: Replace the Styles Parts in a Word Processing Document
How to: Retrieve Comments from a Word Processing Document
How to: Retrieve Property Values from a Word Processing Document
How to: Set a Custom Property in a Word Processing Document
- Marcado como respuesta cjatmsModerator jueves, 19 de abril de 2012 19:35

