How to search for a word in an RTF in a way that it skips RTF tags for searching

Locked How to search for a word in an RTF in a way that it skips RTF tags for searching

  • Thursday, August 09, 2012 11:06 AM
     
     

    I wish to write a regex for searching for a character or a word in an RTF string. However, it should also ensure that the search is performed only in the text of the RTF and not within its tags.

    Below is the example scenario -

    Lets say a part of my RTF is below :

    -----------------

    {\rtf1\sste16000\ansi\deflang1033\ftnbj\uc1\deff0
    {\fonttbl{\f0 \fmodern \fcharset0 Courier New;}{\f1 \fnil Courier New;}{\f2 \fnil Arial;}{\f3 \fnil \fcharset0 Times New Roman;}{\f4 \fnil \fcharset2 Wingdings;}{\f5 \fnil \fcharset0 Arial;}{\f6 \fswiss \fcharset0 Arial;}{\f7 \fnil Times New Roman;}{\f8 \fnil Wingdings;}{\f9 \fnil Symbol;}}
    {\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;\red0\green0\blue0 ;\red0\green0\blue0 ;\red255\green255\blue0;}
    {\stylesheet{\f1\fs24 Normal;}{\cs1 Default Paragraph Font;}{\s2\snext2\f2\fs48\b\cf2\fi0\li0\ri0\qc EC Title;}{\s3\snext3\f3\fs24\b\caps\cf2\fi0\li0\ri0 EC Section Heading 2;}{\s4\snext4\f3\fs24\fi-360\li360\ri0 EC Bullet;}{\s5\snext5\f3\fs24\fi-360\li360\ri0\ls2\ilvl0
    EC Numbers;}{\s6\snext6\f3\fs24\cf3\fi0\li0\ri0 EC Normal;}}
    {\*\revtbl{Unknown;}}
    {\*\listtable
    {\list\listtemplateid1
    {\listlevel\levelnfc1\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'00.}{\levelnumbers \'01}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listlevel\levelnfc3\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'01.}{\levelnumbers \'01}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'02.}{\levelnumbers \'01}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listlevel\levelnfc4\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'03)}{\levelnumbers \'01}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listlevel\levelnfc2\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'04)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listlevel\levelnfc4\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'05)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'06)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'07)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'03(\'08)}{\levelnumbers \'02}\fcs1\f2\af2\fcs0\rtlch\f2\af2\ltrch}
    {\listname ;}\listid1
    }
    {\list\listtemplateid2\listsimple
    {\listlevel\levelnfc0\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'00.}{\levelnumbers \'01}\f3\fcs1\f3\af3\fcs0\rtlch\f3\af3\ltrch}
    {\listname ;}\listid2
    }
    {\list\listtemplateid4\listsimple
    {\listlevel\levelnfc23\levelfollow0\levelstartat1\levelindent360{\leveltext \'02\'d8\'00}{\levelnumbers \'02}\f4\fs20\cf2}
    {\listname ;}\listid4
    }
    }
    {\*\listoverridetable
    {\listoverride\listid1\listoverridecount0\ls1}
    {\listoverride\listid2\listoverridecount0\ls2}
    {\listoverride\listid4\listoverridecount0\ls3}
    }
    \paperw12240\paperh15840\margl1800\margr1800\margt1440\margb1440\headery720\footery720\nogrowautofit\deftab720\formshade\fet4\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
    \sectd\pgwsxn12240\pghsxn15840\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440\headery720\footery720\sbkpage\pgncont\pgndec
    \plain\plain\f1\fs24\pard\ssparaaux0\s2\ltrpar\qc\plain\f1\fs24\plain\f5\fs48\lang1033\hich\f5\dbch\f5\loch\f5\cf2\fs48\ltrch\b

     This is the starting of my Paragraph\par \pard\ssparaaux0\s0\ltrpar\ql\plain\f1\fs24\plain\f3\fs24\lang1033\hich\f3\dbch\f3\loch\f3\cf3\fs24\ltrch\par
    {\listtext\pard\plain\f7\fs24 1.\tab}
    \pard\ssparaaux0\s5\tx360\ls2\ilvl0\fi-360\li360\ltrpar\ql\plain\f1\fs24\plain\f3\fs24\lang1033\hich\f3\dbch\f3\loch\f3\fs24\ltrch Throw away all cigarettes.\par
    {\listtext\pard\plain\f7\fs24 2.\tab}
    Prepare a list of your priorities.\par
    {\listtext\pard\plain\f7\fs24 3.\tab}

    ----------truncated further as the RTF was very big

    In the RTF above, if i write a regex for searching for the word 'Paragraph' (i have made it bold in the RTF for easy identification), then it should only search for the word that occurs in the RTF text - This is the starting of my Paragraph. It should not search for the word that comes within RTF tags - {\cs1 Default Paragraph Font;}. Likewise, if I search for 'List', it should search for it in the line - Prepare a list of your priorities.\par. It should not search and locate the one that occurs in - {\list\listtemplateid1

    Same way, the tags like - \sectd\pgwsxn12240\pghsxn15840\marglsxn1800 - should also not be searched for when I perform the search for a word.

    The same should apply when I need to search for a single character in the entire RTF. Only those characters that are a part of RTF text or content should be searched for. It should not search for those characters that occur within RTF tags in the RTF file.

    Can anyone help with this?

    Thanks in advance,

    ashutoshx


    • Edited by ashutoshx Thursday, August 09, 2012 11:10 AM
    •  

All Replies

  • Thursday, August 09, 2012 2:09 PM
     
     

    What you should do in order to get 'just' the text of an rtf is this:

    RichTextBox rtBox = new RichTextBox();
    String example = File.ReadAllText("example.rtf");
    rtBox.Rtf = example;

    //Then
    String exampleUnformatted = rtBox.Text;
    This will remove all the tags, all the formatting and give you JUST the text.

    • Edited by JohnGrove Thursday, August 09, 2012 2:11 PM
    • Edited by JohnGrove Thursday, August 09, 2012 2:27 PM
    •  
  • Friday, August 10, 2012 4:56 AM
     
     

    Thanks for the reply JohnGrove, but in my case, I dont want to extract the text from the tags and then find the word.

    Basically, my requirement is as under -

    I wish to search for all the occurances of a given word in the RTF body (exclusive of the occurance in the RTF tags). Once I find all these occurances, I wish to put the highlight tag around the located words so that I can highlight it with yellow (as it happens in most of the websites etc.)

    In order to achieve the above, I cannot extract the text part of the RTF first, put the highlight tag around the searched word and then put the Text back in the original RTF string. I need something like - a regex to locate the word (or character) such that it locates the word/character only in RTF body (text) and not within its tags sothat I dont end up putting the highlight tag in the words that are located within RTF tags.

    Regards,

    Ashutoshx


    • Edited by ashutoshx Friday, August 10, 2012 4:58 AM
    •  
  • Friday, August 10, 2012 11:28 AM
     
     
    No, Impossible.

    Ghost,
    Call me ghost for short, Thanks
    To get the better answer, it should be a better question.

  • Monday, August 13, 2012 6:53 AM
     
     
    so  you can try to search the pure text in a richtextbox, and find the location, and then search the RTF file to find the tags.

    Ghost,
    Call me ghost for short, Thanks
    To get the better answer, it should be a better question.

  • Wednesday, September 12, 2012 2:13 PM
     
     Answered