.NET Framework Developer Center >
.NET Development Forums
>
Regular Expressions
>
How to replace character entities
How to replace character entities
- I have yanked the html tags out of a web page. I need to yank out the character entities as well. For example, " is replaced with a space. I've tried using word boundaries with no luck. It seems like it would be pattern matching but I can't get it to work for anything.
I've tried variations of([\&][\;])
with +*-. in between but can't get anywhere. Given the number of character entities in html, I don't want to write a string.replace statement for each char entity. Help!
Answers
- It could potentially match unintended text since using the dot will match any character, including spaces and other semicolons. An HTML entity won't have a space between the ampersand and semicolon symbols, so specifying that in the match would prevent possible mismatches. Also, there's no need to escape the semicolon in the character class. I suspect your code refers to it using the @ verbatim string, which means your character class matches both a semicolon and a backslash. You could have specified the semicolon normally as part of the pattern, no need to escape it or place it in a character class.
I suggest using this pattern: "&[^ ;]+;" - it matches the ampersand, followed by any character that is not a space or a semicolon, and ends with a semicolon match. Of course you may want to replace the "+" with a specific number of characters, such as "&[^ ;]{2,5};"
Here's a comparison of both patterns:
string input = @"foo&bar ; bar & foo; more & \; text"; string yourPattern = @"&.{2,5}[\;]"; string pattern = "&[^ ;]+;"; Console.WriteLine("Your Pattern"); foreach (Match m in Regex.Matches(input, yourPattern)) { Console.WriteLine("Match: {0}", m.Value); } Console.WriteLine("Replace: {0}", Regex.Replace(input, yourPattern, "")); Console.WriteLine(Environment.NewLine + "Pattern"); foreach (Match m in Regex.Matches(input, pattern)) { Console.WriteLine("Match: {0}", m.Value); } Console.WriteLine("Replace: {0}", Regex.Replace(input, pattern, ""));
Output:
// Your Pattern // Match: & // Match: ; // Match: & foo; // Match: & \; // Replace: foobar bar more text // Pattern // Match: & // Match: // Replace: foobar ; bar & foo; more & \; text
Document my code? Why do you think it's called "code"?- Marked As Answer byCodeButcher Wednesday, October 28, 2009 1:14 PM
- I would change:to:
string pattern = "&[^ ;]+;";
so that the pattern will not only ignore spaces, but other forms of whitespace (new lines, tabs, etc...) as well. Another thing you could consider would be using System.Web.HttpUtility.HtmlDecode to get rid of these. I will replace " with " and & with &, if that is acceptable.string pattern = "&[^\s;]+;";
- Marked As Answer byCodeButcher Wednesday, October 28, 2009 1:15 PM
All Replies
- I came up with this. it appears to work. anything bad about it?
&.{2,5}[\;] - It could potentially match unintended text since using the dot will match any character, including spaces and other semicolons. An HTML entity won't have a space between the ampersand and semicolon symbols, so specifying that in the match would prevent possible mismatches. Also, there's no need to escape the semicolon in the character class. I suspect your code refers to it using the @ verbatim string, which means your character class matches both a semicolon and a backslash. You could have specified the semicolon normally as part of the pattern, no need to escape it or place it in a character class.
I suggest using this pattern: "&[^ ;]+;" - it matches the ampersand, followed by any character that is not a space or a semicolon, and ends with a semicolon match. Of course you may want to replace the "+" with a specific number of characters, such as "&[^ ;]{2,5};"
Here's a comparison of both patterns:
string input = @"foo&bar ; bar & foo; more & \; text"; string yourPattern = @"&.{2,5}[\;]"; string pattern = "&[^ ;]+;"; Console.WriteLine("Your Pattern"); foreach (Match m in Regex.Matches(input, yourPattern)) { Console.WriteLine("Match: {0}", m.Value); } Console.WriteLine("Replace: {0}", Regex.Replace(input, yourPattern, "")); Console.WriteLine(Environment.NewLine + "Pattern"); foreach (Match m in Regex.Matches(input, pattern)) { Console.WriteLine("Match: {0}", m.Value); } Console.WriteLine("Replace: {0}", Regex.Replace(input, pattern, ""));
Output:
// Your Pattern // Match: & // Match: ; // Match: & foo; // Match: & \; // Replace: foobar bar more text // Pattern // Match: & // Match: // Replace: foobar ; bar & foo; more & \; text
Document my code? Why do you think it's called "code"?- Marked As Answer byCodeButcher Wednesday, October 28, 2009 1:14 PM
- I would change:to:
string pattern = "&[^ ;]+;";
so that the pattern will not only ignore spaces, but other forms of whitespace (new lines, tabs, etc...) as well. Another thing you could consider would be using System.Web.HttpUtility.HtmlDecode to get rid of these. I will replace " with " and & with &, if that is acceptable.string pattern = "&[^\s;]+;";
- Marked As Answer byCodeButcher Wednesday, October 28, 2009 1:15 PM
- ccbristo: that's a good point :)
Document my code? Why do you think it's called "code"? - I read that calling the httputility would cause the program to be viewed as calling unmanaged code. I need to read more on unmanaged vs. managed and the implications.
- Thanks, I would have over-looked that (& foo;). I'm new to regex and I figured just because I created something that worked didn't mean it was the right thing!
I read that calling the httputility would cause the program to be viewed as calling unmanaged code. I need to read more on unmanaged vs. managed and the implications.
That's not correct. Code that is part of the .NET framework and that runs on the Common Language Runtime (CLR) is managed. Brad Abrams has an old blog post that describes this: What is managed code? I'm sure you can do a web/Wikipedia search and find out more. To keep things simple, all your .NET code is managed unless you go out of your way to call unmanaged code and access external APIs etc.
Document my code? Why do you think it's called "code"?- excellent...thanks. I'll probably stick with my code unless using the http utility can speed up / improve what I'm doing. But for now...it's off to read teh blog on managed code. thanks for the link!


