reading a simple table cell from a word document
-
Sunday, September 09, 2012 3:00 PM
hi,
i have a really basic question that i don't seem to understand:
i am trying to read a table from a word document witch has 3 cells per row, everything works fine except one thing:
on cells that have more then one line of text, i don't know why but sometimes "word" saves the text in 2 paragraphs sometimes in 2 different runs, sometimes in 2 different texts with a "br" in the middle and sometimes a combination of them all...
can anyone help me =[
i need a simple way of reading a cell content
thank you very much in advance
jony
All Replies
-
Monday, September 10, 2012 6:24 AMModerator
Hi jony,
Thanks for posting in the MSDN Forum.
I hope following snippet can help you.
using System; using System.Collections.Generic; using System.Linq; using System.Text; using log4net; using System.Windows.Forms; using DocumentFormat.OpenXml.Packaging; using System.IO.Packaging; using System.IO; using DocumentFormat.OpenXml.Wordprocessing; namespace ConsoleApplication3 { class Program { [STAThread] static void Main(string[] args) { ILog log = log4net.LogManager.GetLogger(typeof(Program)); OpenFileDialog OFD = new OpenFileDialog(); OFD.Filter = "Document|*.docx;*domx"; OFD.Multiselect = true; OFD.ShowDialog(); string[] Paths = OFD.FileNames; foreach (string Path in Paths) { log.Info("Try open " + Path); try { using (WordprocessingDocument wpd = WordprocessingDocument .Open(Path, false)) { MainDocumentPart mdp = wpd.MainDocumentPart; if (mdp != null) { log.Info("Can get Main Part"); Body body = mdp.Document.Body; foreach (Table tb in mdp.Document .Descendants<Table>().ToList()) { log.Info("Beng to retrieve table"); foreach (TableCell tc in tb.Descendants<TableCell>().ToList()) { log.Info("===========Cell Content========"); foreach (Text t in tc.Descendants<Text>() .ToList()) log.Info(t.Text); log.Info("==============================="); } log.Info("End to retrieve table"); } } } } catch (Exception ex) { log.Fatal(ex); } } Console.ReadKey(); } } }Have a good day,
Tom
Tom Xu [MSFT]
MSDN Community Support | Feedback to us
-
Monday, September 10, 2012 8:05 AM
thank you very much for the replay it really helped me [=
i do have a little problem though, how do i know when to break the line?
i tried to do something like this:
private string ExtractTextFromCell(TableCell cell) { string content = string.Empty; foreach (Text text in cell.Descendants<Text>().ToList()) { content += text.Text+"\n"; } return content; }but for a line that looks like this in the table:
Hello. My name is Nitsana Bellehsen.
Today is August 11th, 2011.
i get something like this:
Hello. My name is
Nitsana
Bellehsen
.
Today is August 11th, 2011.how do i format it?
thanks again for the help! [=
jony
-
Tuesday, September 11, 2012 9:05 AM
anyone?
I'm stuck =[
-
Wednesday, September 12, 2012 2:14 AMModerator
Hi jony,
The snippet which I have shown will not use style. It's only iterate plant text content of the Table cell. The style have been set in Paragraph's ParagraphProperties node's ParagraphStyleId node or ParagraphMarkRunProperty node. And if ParagraphStyleId has been used, the specific style will be defined in StyleDefinitionsPart.
Have a good day,
Tom
Tom Xu [MSFT]
MSDN Community Support | Feedback to us
-
Wednesday, September 12, 2012 9:10 AM
so it a problem of styles?
the style determines how the line will be break?
if so where can i get more info on this matter?
thank you again for the replay [=
jony
- Edited by jony feldman Wednesday, September 12, 2012 9:11 AM spelling mistake
-
Thursday, September 13, 2012 12:13 PMModerator
so it a problem of styles?
the style determines how the line will be break?
Hi jony
The code in your sample extracts only the text elements from the WordOpenXML, with none of the elements that instruct Word how to lay the text out on the page.
I recommend you look at the "InnerXML" of such a TableCell. In it, you probably see not only the text, but other elements that define how the text should be formatted. If, for example, the words Nitsana and Bellehsen are formatted differently than the surrounding text, they won't be in the same "string" as the preceding and following text - they come in separately.
Your code would have to first pick up the paragraphs in the TableCell, then get the Text for each paragraph in order to distinguish things "line-by-line".
Cindy Meister, VSTO/Word MVP, my blog
-
Sunday, September 16, 2012 8:00 AM
is there an example somewhere?
or some place i can read more about this?
thanks a lot!
jony
-
Monday, September 17, 2012 7:07 AMModerator
Hi jony,
I think Cindy recommend you use PIA to retrieve the TableCell information like that
foreach (Word.Table table in Application.ActiveDocument.Tables) { foreach (Word.Row row in table.Rows) { foreach (Word.Cell cell in row.Cells) { foreach (Word.Paragraph p in cell.Range.Paragraphs) { Debug.Print(p.Range.XML); } } } }Is it right Cindy?
Have a good day,
Tom
Tom Xu [MSFT]
MSDN Community Support | Feedback to us
-
Monday, September 17, 2012 8:06 AMModerator
Hi Tom
No, I wasn't recommending that, really, as this is an Open XML forum :-)
What I was saying is that
1. Jony should take a moment to look at a table in the UI, then at the corresponding WordOpenXML and compare my comments with what he sees.
2. If he really wants just the text from the table cells, then he should do something like the following, which extracts the text from each paragraph of every cell in the document.
string cellText = ""; using (WordprocessingDocument docFile = WordprocessingDocument.Open(filename, false)) { Document docMainStory = docFile.MainDocumentPart.Document; IEnumerable<TableCell> cels = docMainStory.Descendants<TableCell>(); foreach (TableCell cel in cels) { IEnumerable<Paragraph> paras = cel.Descendants<Paragraph>(); foreach (Paragraph para in paras) { IEnumerable<Run> paraContent = para.Descendants<Run>(); foreach (Run r in paraContent) { cellText += r.InnerText; } cellText+="\r\n"; } //Show each cell's content MessageBox.Show(cellText); cellText = String.Empty; } }
Cindy Meister, VSTO/Word MVP, my blog
- Edited by Cindy Meister MVPMVP, Moderator Monday, September 17, 2012 8:07 AM edited code
- Marked As Answer by Cindy Meister MVPMVP, Moderator Tuesday, September 25, 2012 4:42 AM
-
Thursday, September 20, 2012 10:46 AM
the problem is i get the file from the costumer,
so i can assume much about the styles, all i know is that he sees 2 lines of text,
i don't care really about the style either, i just need to separate the 2 lines =\
thank you again for the help!
jony [=
-
Monday, September 24, 2012 6:08 AMModerator
Hi jony,
i just need to separate the 2 lines
what's mean of "2 lines"? Is it mean two paragraph? As far as I know, Word will auto wrap the paragraph into next line if the paragraph's length is longer than table cell's width. However, we only can see paragraph in the WordML and we don't how many lines this paragraph divided.
I suggest you pay more attention to the Cindy's suggestion. It's really make sense.
@Cindy,
Thanks for your great work.
Have a good day,
Tom
Tom Xu [MSFT]
MSDN Community Support | Feedback to us
-
Monday, September 24, 2012 8:07 AM
i tried what cindy said,
i don't understand how "word" saves the info
as i said in the first post, it sometimes dived the lines into paragraphs,
sometimes in divides a line of text into 2 runs or 2 text inside a run,
sometimes it dived the lines into 3 runs, the middle run is a brake,
and sometimes it divides the text into 2 runs in witch the first run contains the text + brake,
i just don't understand the why,
the code i had to write to deal with this seems really messy to me,
what am i missing here?
thank you again for all the help [=
jony
-
Monday, September 24, 2012 8:59 AMModerator
Hi jony
There are a number of factors involved with whether runs (w:r) divide up text (w:t) or not. If, for example, formatting has been applied to some of the text, but not other parts, that will generate separate runs for different formatting.
If ENTER or SHIFT+ENTER was pressed, that will also divide the text up into runs. For example, when SHIFT is pressed, that inserts a <w:p> element, which means the runs must be "closed" as Word Open XML mayn NOT do this: <w:p><w:r><w:t>some text here<w:p>more text here</w:t></w:r></w:p>
Since the above is NOT allowed, it must be like this:
<w:p><w:r><w:t>some text here</w:t></w:r></w:p><w:p><w:r><w:t>more text here</w:t></w:r></w:p>The <w:p> (in Word, a paragraph and what I believe you mean with "break") is not part of the text. In Word is does not equate to "/r", even though C# can generate a paragraph with that symbol. This is because Word store formatting information as part of the paragraph. As it also stores formatting information as part of a run. This is how Word was designed to work more than twenty years ago, before the days of XML. The Word Open XML reflects Word's internal architechture and structures.
These are the rules for Word Open XML. We cannot answer why the rules were made like this - they just were.
When you work with Word Open XML using the Open XML SDK it helps a lot to familiarize yourself with the underlying XML. The Open XML SDK will not make a lot of sense without understanding the XML it's manipulating. And, at the next level, a basic understanding of how the Word application works helps to understand why the Word Open XML is the way it is.
Cindy Meister, VSTO/Word MVP, my blog
- Marked As Answer by jony feldman Monday, September 24, 2012 9:40 AM
-
Monday, September 24, 2012 9:40 AM
thank you very much for the replay
it like i thought then, i have no other way then the "messy" check all possibilities way,
anyway,
thank you very much for the long and deep explanation it really help me understand the OpenXml world much better [=
ill post more questions as they come ;),
thanks again
jony

