Ask a questionAsk a question
 

AnswerHow to replace character entities

  • Tuesday, October 27, 2009 6:13 PMCodeButcher Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Has Code
    I have yanked the html tags out of a web page.  I need to yank out the character entities as well.  For example, " is replaced with a space.  I've tried using word boundaries with no luck.  It seems like it would be pattern matching but I can't get it to work for anything. 
    I've tried variations of
    ([\&][\;])
    with +*-. in between but can't get anywhere.  Given the number of character entities in html, I don't want to write a string.replace statement for each char entity.  Help!

Answers

  • Tuesday, October 27, 2009 9:10 PMAhmad Mageed Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     AnswerHas Code
    It could potentially match unintended text since using the dot will match any character, including spaces and other semicolons. An HTML entity won't have a space between the ampersand and semicolon symbols, so specifying that in the match would prevent possible mismatches. Also, there's no need to escape the semicolon in the character class. I suspect your code refers to it using the @ verbatim string, which means your character class matches both a semicolon and a backslash. You could have specified the semicolon normally as part of the pattern, no need to escape it or place it in a character class.

    I suggest using this pattern: "&[^ ;]+;" - it matches the ampersand, followed by any character that is not a space or a semicolon, and ends with a semicolon match. Of course you may want to replace the "+" with a specific number of characters, such as "&[^ ;]{2,5};"

    Here's a comparison of both patterns:

    string input = @"foo&bar  ; bar & foo; more & \; text";
    string yourPattern = @"&.{2,5}[\;]";
    string pattern = "&[^ ;]+;";
    
    Console.WriteLine("Your Pattern");
    foreach (Match m in Regex.Matches(input, yourPattern))
    {
    	Console.WriteLine("Match: {0}", m.Value);
    }
    Console.WriteLine("Replace: {0}", Regex.Replace(input, yourPattern, ""));
    
    Console.WriteLine(Environment.NewLine + "Pattern");
    foreach (Match m in Regex.Matches(input, pattern))
    {
    	Console.WriteLine("Match: {0}", m.Value);
    }
    Console.WriteLine("Replace: {0}", Regex.Replace(input, pattern, ""));
    
     
    Output:

    // Your Pattern
    // Match: &
    // Match:  ;
    // Match: & foo;
    // Match: & \;
    // Replace: foobar  bar  more  text
    
    // Pattern
    // Match: &
    // Match:  
    // Replace: foobar ; bar & foo; more & \; text
    





    Document my code? Why do you think it's called "code"?
    • Marked As Answer byCodeButcher Wednesday, October 28, 2009 1:14 PM
    •  
  • Wednesday, October 28, 2009 12:47 AMccbristo Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     AnswerHas Code
    I would change:
    string pattern = "&[^ ;]+;";
    
    to:
    string pattern = "&[^\s;]+;";
    
    so that the pattern will not only ignore spaces, but other forms of whitespace (new lines, tabs, etc...) as well.  Another thing you could consider would be using System.Web.HttpUtility.HtmlDecode to get rid of these.  I will replace " with " and & with &, if that is acceptable.
    • Marked As Answer byCodeButcher Wednesday, October 28, 2009 1:15 PM
    •  

All Replies

  • Tuesday, October 27, 2009 6:44 PMCodeButcher Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Has Code
    I came up with this.  it appears to work.  anything bad about it?
    &.{2,5}[\;]
    
  • Tuesday, October 27, 2009 9:10 PMAhmad Mageed Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     AnswerHas Code
    It could potentially match unintended text since using the dot will match any character, including spaces and other semicolons. An HTML entity won't have a space between the ampersand and semicolon symbols, so specifying that in the match would prevent possible mismatches. Also, there's no need to escape the semicolon in the character class. I suspect your code refers to it using the @ verbatim string, which means your character class matches both a semicolon and a backslash. You could have specified the semicolon normally as part of the pattern, no need to escape it or place it in a character class.

    I suggest using this pattern: "&[^ ;]+;" - it matches the ampersand, followed by any character that is not a space or a semicolon, and ends with a semicolon match. Of course you may want to replace the "+" with a specific number of characters, such as "&[^ ;]{2,5};"

    Here's a comparison of both patterns:

    string input = @"foo&bar  ; bar & foo; more & \; text";
    string yourPattern = @"&.{2,5}[\;]";
    string pattern = "&[^ ;]+;";
    
    Console.WriteLine("Your Pattern");
    foreach (Match m in Regex.Matches(input, yourPattern))
    {
    	Console.WriteLine("Match: {0}", m.Value);
    }
    Console.WriteLine("Replace: {0}", Regex.Replace(input, yourPattern, ""));
    
    Console.WriteLine(Environment.NewLine + "Pattern");
    foreach (Match m in Regex.Matches(input, pattern))
    {
    	Console.WriteLine("Match: {0}", m.Value);
    }
    Console.WriteLine("Replace: {0}", Regex.Replace(input, pattern, ""));
    
     
    Output:

    // Your Pattern
    // Match: &
    // Match:  ;
    // Match: & foo;
    // Match: & \;
    // Replace: foobar  bar  more  text
    
    // Pattern
    // Match: &
    // Match:  
    // Replace: foobar ; bar & foo; more & \; text
    





    Document my code? Why do you think it's called "code"?
    • Marked As Answer byCodeButcher Wednesday, October 28, 2009 1:14 PM
    •  
  • Wednesday, October 28, 2009 12:47 AMccbristo Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     AnswerHas Code
    I would change:
    string pattern = "&[^ ;]+;";
    
    to:
    string pattern = "&[^\s;]+;";
    
    so that the pattern will not only ignore spaces, but other forms of whitespace (new lines, tabs, etc...) as well.  Another thing you could consider would be using System.Web.HttpUtility.HtmlDecode to get rid of these.  I will replace " with " and & with &, if that is acceptable.
    • Marked As Answer byCodeButcher Wednesday, October 28, 2009 1:15 PM
    •  
  • Wednesday, October 28, 2009 2:59 AMAhmad Mageed Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    ccbristo: that's a good point :)

    Document my code? Why do you think it's called "code"?
  • Wednesday, October 28, 2009 1:14 PMCodeButcher Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I read that calling the httputility would cause the program to be viewed as calling unmanaged code.  I need to read more on unmanaged vs. managed and the implications.
  • Wednesday, October 28, 2009 1:17 PMCodeButcher Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Thanks, I would have over-looked that  (& foo;).  I'm new to regex and I figured just because I created something that worked didn't mean it was the right thing!
  • Wednesday, October 28, 2009 3:53 PMAhmad Mageed Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I read that calling the httputility would cause the program to be viewed as calling unmanaged code.  I need to read more on unmanaged vs. managed and the implications.
    That's not correct. Code that is part of the .NET framework and that runs on the Common Language Runtime (CLR) is managed. Brad Abrams has an old blog post that describes this: What is managed code? I'm sure you can do a web/Wikipedia search and find out more. To keep things simple, all your .NET code is managed unless you go out of your way to call unmanaged code and access external APIs etc.

    Document my code? Why do you think it's called "code"?
  • Wednesday, October 28, 2009 7:17 PMCodeButcher Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    excellent...thanks.  I'll probably stick with my code unless using the http utility can speed up / improve what I'm doing.  But for now...it's off to read teh blog on managed code.  thanks for the link!