Answered by:
Regular expression for MS Word HTML Markup

Question
-
I am trying to validate a rich text box to remove some ms word content but I want to keep the other attributes that are not MS office related.
input: <span style="font-size: 11pt; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: "Times New Roman"; mso-bidi-font-family: Times New Roman"; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA; font-family: Arial", "sans-serif"">mso-bidi-language: AR-SA; mso-My Text Here; mso-My Text Here2</span>
I first run this regex to remove the unformed html (\&\#34\;). Then, I remove everything inside the =" ending with "
regex: (?<=\=\")(.*?)(?=\")
output: <span font="" style="">mso-bidi-language: AR-SA; mso-My Text Here; mso-My Text Here2</span>Now it removes everything inside the tags and not the text. With the following, it removes the matching pattern, but it also removes the text that i need and the end tags.
regex:(\s*mso-[^;"]*;?\s*)
output: <span font="font-size: 11pt;" style="font-family: Arial, sans-serif;">When i try and combine the two expressions, it does not provide the correct output.
regex: (?<=\=\")(\s*mso-[^;"]*;?\s*)(?=\") does not work
Any ideas on how to combine both expressions to give the correct output of
<span font="font-size: 11pt;" style="font-family: Arial, sans-serif;">mso-bidi-language: AR-SA; mso-My Text Here; mso-My Text Here2</span>Thanks.
Thursday, April 7, 2011 8:23 PM
Answers
-
Hi,
hope I got that right ;)
Your last pattern can't work, because you test for " in front of and after every mso-style - and because &34; has also a ";" in it...
To resolve the first problem, use
(?<==\"[^"]*) and (?=[^"]*\")
to test for presence of ="... before and ..." after the match.
The second one can be solved by using
(?:[^;"&]|&#\d+;)+
instead of [^;"]*
I use + instead of * because there has to be at least one character after mso-... I also made some other optimizations (you can improve it further by using (?>...)).
(?<=="[^"]*)\s*\bmso-(?:&(?:\#\d+|\w+);|[^;"&])+;?\s*(?=[^"]*")
You don't need to escape " nor =. You don't need it for #, too - as long as you don't use RegexOptions.IgnorePatternWhitespace (that's why I always escape # and whitespaces).
Beware of styles like background-image:url() and content:"". They can contain characters that breaks this regex-pattern (e.g. content:'mso-test';)...
Greetings,
Wolfgang Kluge
gehirnwindung.de- Marked as answer by kryptonkal Thursday, April 7, 2011 10:47 PM
Thursday, April 7, 2011 10:26 PM
All replies
-
Hi,
hope I got that right ;)
Your last pattern can't work, because you test for " in front of and after every mso-style - and because &34; has also a ";" in it...
To resolve the first problem, use
(?<==\"[^"]*) and (?=[^"]*\")
to test for presence of ="... before and ..." after the match.
The second one can be solved by using
(?:[^;"&]|&#\d+;)+
instead of [^;"]*
I use + instead of * because there has to be at least one character after mso-... I also made some other optimizations (you can improve it further by using (?>...)).
(?<=="[^"]*)\s*\bmso-(?:&(?:\#\d+|\w+);|[^;"&])+;?\s*(?=[^"]*")
You don't need to escape " nor =. You don't need it for #, too - as long as you don't use RegexOptions.IgnorePatternWhitespace (that's why I always escape # and whitespaces).
Beware of styles like background-image:url() and content:"". They can contain characters that breaks this regex-pattern (e.g. content:'mso-test';)...
Greetings,
Wolfgang Kluge
gehirnwindung.de- Marked as answer by kryptonkal Thursday, April 7, 2011 10:47 PM
Thursday, April 7, 2011 10:26 PM -
Wolfgang, I appreciate your response. Your post was extremely helpful and it clarified many of the issues I was having. Regarding the background image, this is being used within a rich text box that limits some of the html tags used so it wotn be an issue. Thanks again.Thursday, April 7, 2011 10:37 PM