How to search for a word in an RTF in a way that it skips RTF tags for searching
-
09 Agustus 2012 11:06
I wish to write a regex for searching for a character or a word in an RTF string. However, it should also ensure that the search is performed only in the text of the RTF and not within its tags.
Below is the example scenario -
Lets say a part of my RTF is below :
-----------------
{\rtf1\sste16000\ansi\deflang1033\ftnbj\uc1\deff0
{\fonttbl{\f0 \fmodern \fcharset0 Courier New;}{\f1 \fnil Courier New;}{\f2 \fnil Arial;}{\f3 \fnil \fcharset0 Times New Roman;}{\f4 \fnil \fcharset2 Wingdings;}{\f5 \fnil \fcharset0 Arial;}{\f6 \fswiss \fcharset0 Arial;}{\f7 \fnil Times New Roman;}{\f8 \fnil Wingdings;}{\f9 \fnil Symbol;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;\red0\green0\blue0 ;\red0\green0\blue0 ;\red255\green255\blue0;}
{\stylesheet{\f1\fs24 Normal;}{\cs1 Default Paragraph Font;}{\s2\snext2\f2\fs48\b\cf2\fi0\li0\ri0\qc EC Title;}{\s3\snext3\f3\fs24\b\caps\cf2\fi0\li0\ri0 EC Section Heading 2;}{\s4\snext4\f3\fs24\fi-360\li360\ri0 EC Bullet;}{\s5\snext5\f3\fs24\fi-360\li360\ri0\ls2\ilvl0
EC Numbers;}{\s6\snext6\f3\fs24\cf3\fi0\li0\ri0 EC Normal;}}
{\*\revtbl{Unknown;}}
{\*\listtable
{\list\listtemplateid1
{\listlevel\levelnfc1\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'00.}{\levelnumbers \'01}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listlevel\levelnfc3\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'01.}{\levelnumbers \'01}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'02.}{\levelnumbers \'01}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listlevel\levelnfc4\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'03)}{\levelnumbers \'01}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listlevel\levelnfc2\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'04)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listlevel\levelnfc4\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'05)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'06)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'07)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'08)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
{\listname ;}\listid1
}
{\list\listtemplateid2\listsimple
{\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'00.}{\levelnumbers \'01}\f3\fcs1\f3\af3\fcs0\rtlch\f3\af3\ltrch}
{\listname ;}\listid2
}
{\list\listtemplateid4\listsimple
{\listlevel\levelnfc23\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'d8\'00}{\levelnumbers \'02}\f4\fs20\cf2}
{\listname ;}\listid4
}
}
{\*\listoverridetable
{\listoverride\listid1\listoverridecount0\ls1}
{\listoverride\listid2\listoverridecount0\ls2}
{\listoverride\listid4\listoverridecount0\ls3}
}
\paperw12240\paperh15840\margl1800\margr1800\margt1440\margb1440\headery720\footery720\nogrowautofit\deftab720\formshade\fet4\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn12240\pghsxn15840\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard\ssparaaux0\s2\ltrpar\qc\plain\f1\fs24\plain\f5\fs48\lang1033\hich\f5\dbch\f5\loch\f5\cf2\fs48\ltrch\bThis is the starting of my Paragraph\par \pard\ssparaaux0\s0\ltrpar\ql\plain\f1\fs24\plain\f3\fs24\lang1033\hich\f3\dbch\f3\loch\f3\cf3\fs24\ltrch\par
{\listtext\pard\plain\f7\fs24 1.\tab}
\pard\ssparaaux0\s5\tx360\ls2\ilvl0\fi-360\li360\ltrpar\ql\plain\f1\fs24\plain\f3\fs24\lang1033\hich\f3\dbch\f3\loch\f3\fs24\ltrch Throw away all cigarettes.\par
{\listtext\pard\plain\f7\fs24 2.\tab}
Prepare a list of your priorities.\par
{\listtext\pard\plain\f7\fs24 3.\tab}----------truncated further as the RTF was very big
In the RTF above, if i write a regex for searching for the word 'Paragraph' (i have made it bold in the RTF for easy identification), then it should only search for the word that occurs in the RTF text - This is the starting of my Paragraph. It should not search for the word that comes within RTF tags - {\cs1 Default Paragraph Font;}. Likewise, if I search for 'List', it should search for it in the line - Prepare a list of your priorities.\par. It should not search and locate the one that occurs in - {\list\listtemplateid1
Same way, the tags like - \sectd\pgwsxn12240\pghsxn15840\marglsxn1800 - should also not be searched for when I perform the search for a word.
The same should apply when I need to search for a single character in the entire RTF. Only those characters that are a part of RTF text or content should be searched for. It should not search for those characters that occur within RTF tags in the RTF file.
Can anyone help with this?
Thanks in advance,
ashutoshx
- Diedit oleh ashutoshx 09 Agustus 2012 11:10
Semua Balasan
-
09 Agustus 2012 14:09
What you should do in order to get 'just' the text of an rtf is this:
RichTextBox rtBox = new RichTextBox();
String example = File.ReadAllText("example.rtf");
rtBox.Rtf = example;//Then
String exampleUnformatted = rtBox.Text;
This will remove all the tags, all the formatting and give you JUST the text.
-
10 Agustus 2012 4:56
Thanks for the reply JohnGrove, but in my case, I dont want to extract the text from the tags and then find the word.
Basically, my requirement is as under -
I wish to search for all the occurances of a given word in the RTF body (exclusive of the occurance in the RTF tags). Once I find all these occurances, I wish to put the highlight tag around the located words so that I can highlight it with yellow (as it happens in most of the websites etc.)
In order to achieve the above, I cannot extract the text part of the RTF first, put the highlight tag around the searched word and then put the Text back in the original RTF string. I need something like - a regex to locate the word (or character) such that it locates the word/character only in RTF body (text) and not within its tags sothat I dont end up putting the highlight tag in the words that are located within RTF tags.
Regards,
Ashutoshx
- Diedit oleh ashutoshx 10 Agustus 2012 4:58
-
10 Agustus 2012 11:28No, Impossible.
Ghost,
Call me ghost for short, Thanks
To get the better answer, it should be a better question. -
13 Agustus 2012 6:53so you can try to search the pure text in a richtextbox, and find the location, and then search the RTF file to find the tags.
Ghost,
Call me ghost for short, Thanks
To get the better answer, it should be a better question. -
12 September 2012 14:13
Please take a look at this post:
http://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string
- Disarankan sebagai Jawaban oleh Mike FengMicrosoft Contingent Staff, Moderator 17 September 2012 2:36
- Ditandai sebagai Jawaban oleh Mike FengMicrosoft Contingent Staff, Moderator 24 September 2012 14:51