locked
Regular expression for MS Word HTML Markup RRS feed

  • Question

  • I am trying to validate a rich text box to remove some ms word content but I want to keep the other attributes that are not MS office related.

    input: <span style="font-size: 11pt; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: &#34;Times New Roman&#34;; mso-bidi-font-family: Times New Roman&#34;; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; font-family: Arial&#34;, &#34;sans-serif&#34;">mso-bidi-language: AR-SA; mso-My Text Here; mso-My Text Here2</span>

    I first run this regex to remove the unformed html (\&\#34\;).  Then, I remove everything inside the =" ending with "

    regex: (?<=\=\")(.*?)(?=\")
    output
    : <span font="" style="">mso-bidi-language: AR-SA; mso-My Text Here; mso-My Text Here2</span>

    Now it removes everything inside the tags and not the text. With the following, it removes the matching pattern, but it also removes the text that i need and the end tags.

    regex:(\s*mso-[^;"]*;?\s*)
    output
    : <span font="font-size: 11pt;" style="font-family: Arial, sans-serif;">

    When i try and combine the two expressions, it does not provide the correct output.

    regex: (?<=\=\")(\s*mso-[^;"]*;?\s*)(?=\")  does not work

    Any ideas on how to combine both expressions to give the correct output of
    <span font="font-size: 11pt;" style="font-family: Arial, sans-serif;">mso-bidi-language: AR-SA; mso-My Text Here; mso-My Text Here2</span>

    Thanks.

    Thursday, April 7, 2011 8:23 PM

Answers

  • Hi,

    hope I got that right ;)

    Your last pattern can't work, because you test for " in front of and after every mso-style - and because &34; has also a ";" in it...

    To resolve the first problem, use

    (?<==\"[^"]*) and (?=[^"]*\")

    to test for presence of ="... before and ..." after the match.

    The second one can be solved by using

    (?:[^;"&]|&#\d+;)+

    instead of [^;"]*

    I use + instead of * because there has to be at least one character after mso-... I also made some other optimizations (you can improve it further by using (?>...)).

    (?<=="[^"]*)\s*\bmso-(?:&(?:\#\d+|\w+);|[^;"&])+;?\s*(?=[^"]*")

    You don't need to escape " nor =. You don't need it for #, too - as long as you don't use RegexOptions.IgnorePatternWhitespace (that's why I always escape # and whitespaces).

    Beware of styles like background-image:url() and content:"". They can contain characters that breaks this regex-pattern (e.g. content:'mso-test';)...

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de
    • Marked as answer by kryptonkal Thursday, April 7, 2011 10:47 PM
    Thursday, April 7, 2011 10:26 PM

All replies

  • Hi,

    hope I got that right ;)

    Your last pattern can't work, because you test for " in front of and after every mso-style - and because &34; has also a ";" in it...

    To resolve the first problem, use

    (?<==\"[^"]*) and (?=[^"]*\")

    to test for presence of ="... before and ..." after the match.

    The second one can be solved by using

    (?:[^;"&]|&#\d+;)+

    instead of [^;"]*

    I use + instead of * because there has to be at least one character after mso-... I also made some other optimizations (you can improve it further by using (?>...)).

    (?<=="[^"]*)\s*\bmso-(?:&(?:\#\d+|\w+);|[^;"&])+;?\s*(?=[^"]*")

    You don't need to escape " nor =. You don't need it for #, too - as long as you don't use RegexOptions.IgnorePatternWhitespace (that's why I always escape # and whitespaces).

    Beware of styles like background-image:url() and content:"". They can contain characters that breaks this regex-pattern (e.g. content:'mso-test';)...

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de
    • Marked as answer by kryptonkal Thursday, April 7, 2011 10:47 PM
    Thursday, April 7, 2011 10:26 PM
  • Wolfgang, I appreciate your response.  Your post was extremely helpful and it clarified many of the issues I was having.  Regarding the background image, this is being used within a rich text box that limits some of the html tags used so it wotn be an issue.  Thanks again.
    Thursday, April 7, 2011 10:37 PM