locked
Need a method that removes illegal XML characters from a String RRS feed

  • Question

  • User-952121411 posted

     The other day I came across the following exception:

    "Response is not well-formed XML  System.Xml.XmlException: '', hexadecimal value 0x13, is an invalid character."

    This obviously occurred from an illegal character in something I was sending to a web service.  I found that the best way for my application to prevent this was to remove any such characters long before I send the data to the web service.

    I found the following link, which I seemed like the solution....the only problem is the code is in Java and I can not get it converted.(http://benjchristensen.com/2008/02/07/how-to-strip-invalid-xml-characters/)

    ...Then I found some .net code links claiming code for this issue, but I didn't really like the implementation.  So here is my question:  Does anyone have a decent method they could post that takes an input 'String' and checks to remove any illegal XML characters?

    Thank you! Smile

    Wednesday, October 21, 2009 9:28 AM

Answers

  • User-2052324419 posted

    Here's a C# conversion of the Java code you posted a link to (you'll need to add "using System.Text;" at the top of your C# file for the StringBuilder class):

             public String stripNonValidXMLCharacters(string textIn) 
            {
                StringBuilder textOut = new StringBuilder(); // Used to hold the output.
                char current; // Used to reference the current character.
    
                if (textIn == null || textIn == string.Empty) return string.Empty; // vacancy test.
                for (int i = 0; i < textIn.Length; i++) {
                    current = textIn[i]; 
    
                    if ((current == 0x9 || current == 0xA || current == 0xD) ||
                        ((current >= 0x20) && (current <= 0xD7FF)) ||
                        ((current >= 0xE000) && (current <= 0xFFFD)) ||
                        ((current >= 0x10000) && (current <= 0x10FFFF)))
                    {
                        textOut.Append(current);
                    }
                }
                return textOut.ToString();
            }   


     

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, October 21, 2009 11:22 AM
  • User1835330922 posted

    I think you need to use AscW to convert the character to an integer for the comparison and then ChrW to convert back to a character that you append:

        Public Function stripNonValidXMLCharacters(ByVal textIn As String) As [String]
            Dim textOut As New StringBuilder()
            ' Used to hold the output.
            Dim current As Integer
            ' Used to reference the current character.
            If textIn Is Nothing OrElse textIn = String.Empty Then
                Return String.Empty
            End If
            ' vacancy test.
            For i As Integer = 0 To textIn.Length - 1
                current = AscW(textIn(i))
    
                If (current = &H9 OrElse current = &HA OrElse current = &HD) OrElse ((current >= &H20) AndAlso (current <= &HD7FF)) OrElse ((current >= &HE000) AndAlso (current <= &HFFFD)) OrElse ((current >= &H10000) AndAlso (current <= &H10FFFF)) Then
                    textOut.Append(ChrW(current))
                End If
            Next
            Return textOut.ToString()
        End Function

    Untested!

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, October 21, 2009 1:13 PM
  • User-952121411 posted

    Yes the last (2) posts were very helpful - many thanks to Brent and Martin.  Brent especially, thank you for converting that Java to C#, and Martin, the 'AscW' function was needed as shown below.

    This link has the full expination as well:

    http://allen-conway-dotnet.blogspot.com/2009/10/how-to-strip-illegal-xml-characters.html

    Here is the VB.NET working version of removing illegal XML characters from a String:

        Public Shared Function RemoveIllegalXMLCharacters(ByVal Content As String) As String
    
            'Used to hold the output.
            Dim textOut As New StringBuilder()
            'Used to reference the current character.
            Dim current As Char
            'Exit out and return an empty string if nothing was passed in to method
            If Content Is Nothing OrElse Content = String.Empty Then
                Return String.Empty
            End If
    
            'Loop through the lenght of the content (1) character at a time to see if there
            'are any illegal characters to be removed:
            For i As Integer = 0 To Content.Length - 1
                'Reference the current character
                current = Content(i)
                'Only append back to the StringBuilder valid non-illegal characters
                If (AscW(current) = &H9 OrElse AscW(current) = &HA OrElse AscW(current) = &HD) _
                   OrElse ((AscW(current) >= &H20) AndAlso (AscW(current) <= &HD7FF)) _
                   OrElse ((AscW(current) >= &HE000) AndAlso (AscW(current) <= &HFFFD)) _
                   OrElse ((AscW(current) >= &H10000) AndAlso (AscW(current) <= &H10FFFF)) Then
                    textOut.Append(current)
                End If
            Next
    
            'Return the screened content with only valid characters
            Return textOut.ToString()
    
        End Function


     

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, October 21, 2009 2:24 PM
  • User-1293249277 posted

    string result = "<A/>";

    XmlDocument xDoc = new XmlDocument();

    xDoc.LoadXml(result);

    xDoc.Save("MyNewFile.xml");

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Friday, December 18, 2009 1:57 AM

All replies

  • User-2052324419 posted

    Here's a C# conversion of the Java code you posted a link to (you'll need to add "using System.Text;" at the top of your C# file for the StringBuilder class):

             public String stripNonValidXMLCharacters(string textIn) 
            {
                StringBuilder textOut = new StringBuilder(); // Used to hold the output.
                char current; // Used to reference the current character.
    
                if (textIn == null || textIn == string.Empty) return string.Empty; // vacancy test.
                for (int i = 0; i < textIn.Length; i++) {
                    current = textIn[i]; 
    
                    if ((current == 0x9 || current == 0xA || current == 0xD) ||
                        ((current >= 0x20) && (current <= 0xD7FF)) ||
                        ((current >= 0xE000) && (current <= 0xFFFD)) ||
                        ((current >= 0x10000) && (current <= 0x10FFFF)))
                    {
                        textOut.Append(current);
                    }
                }
                return textOut.ToString();
            }   


     

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, October 21, 2009 11:22 AM
  • User-952121411 posted

  • if ((current == 0x9 || current == 0xA || current == 0xD) ||   
  •             ((current >= 0x20) && (current <= 0xD7FF)) ||   
  •             ((current >= 0xE000) && (current <= 0xFFFD)) ||   
  •             ((current >= 0x10000) && (current <= 0x10FFFF)))

 

I am having difficulty getting the above line converted to VB.NET.  See C# and VB.NET do character to integer conversion a little differently.  What I end up with is the following design time error:

" Operator '=' is not defined for types 'Char' and 'Integer'"

The code by the way in VB.NET for that line is as follows:

                If (current = &H9 OrElse current = &HA OrElse current = &HD) OrElse ((current >= &H20) AndAlso (current <= &HD7FF)) OrElse ((current >= &HE000) AndAlso (current <= &HFFFD)) OrElse ((current >= &H10000) AndAlso (current <= &H10FFFF)) Then
                    textOut.Append(current)
                End If

Now I have been trying some combinations of getting the Ascii value via 'Asc()' or converting values to thier hex value via conversion functions, but to no immdeate avail. 

Any ideas on how to make proper comparisons on the code above?

 

Wednesday, October 21, 2009 12:26 PM
  • User1835330922 posted

    I think you need to use AscW to convert the character to an integer for the comparison and then ChrW to convert back to a character that you append:

        Public Function stripNonValidXMLCharacters(ByVal textIn As String) As [String]
            Dim textOut As New StringBuilder()
            ' Used to hold the output.
            Dim current As Integer
            ' Used to reference the current character.
            If textIn Is Nothing OrElse textIn = String.Empty Then
                Return String.Empty
            End If
            ' vacancy test.
            For i As Integer = 0 To textIn.Length - 1
                current = AscW(textIn(i))
    
                If (current = &H9 OrElse current = &HA OrElse current = &HD) OrElse ((current >= &H20) AndAlso (current <= &HD7FF)) OrElse ((current >= &HE000) AndAlso (current <= &HFFFD)) OrElse ((current >= &H10000) AndAlso (current <= &H10FFFF)) Then
                    textOut.Append(ChrW(current))
                End If
            Next
            Return textOut.ToString()
        End Function

    Untested!

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, October 21, 2009 1:13 PM
  • User-952121411 posted

    Yes the last (2) posts were very helpful - many thanks to Brent and Martin.  Brent especially, thank you for converting that Java to C#, and Martin, the 'AscW' function was needed as shown below.

    This link has the full expination as well:

    http://allen-conway-dotnet.blogspot.com/2009/10/how-to-strip-illegal-xml-characters.html

    Here is the VB.NET working version of removing illegal XML characters from a String:

        Public Shared Function RemoveIllegalXMLCharacters(ByVal Content As String) As String
    
            'Used to hold the output.
            Dim textOut As New StringBuilder()
            'Used to reference the current character.
            Dim current As Char
            'Exit out and return an empty string if nothing was passed in to method
            If Content Is Nothing OrElse Content = String.Empty Then
                Return String.Empty
            End If
    
            'Loop through the lenght of the content (1) character at a time to see if there
            'are any illegal characters to be removed:
            For i As Integer = 0 To Content.Length - 1
                'Reference the current character
                current = Content(i)
                'Only append back to the StringBuilder valid non-illegal characters
                If (AscW(current) = &H9 OrElse AscW(current) = &HA OrElse AscW(current) = &HD) _
                   OrElse ((AscW(current) >= &H20) AndAlso (AscW(current) <= &HD7FF)) _
                   OrElse ((AscW(current) >= &HE000) AndAlso (AscW(current) <= &HFFFD)) _
                   OrElse ((AscW(current) >= &H10000) AndAlso (AscW(current) <= &H10FFFF)) Then
                    textOut.Append(current)
                End If
            Next
    
            'Return the screened content with only valid characters
            Return textOut.ToString()
    
        End Function


     

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, October 21, 2009 2:24 PM
  • User-83779474 posted

    Any thoughts on how one might save the resulting string back off as an xml file?

    Thursday, December 17, 2009 5:57 PM
  • User-1293249277 posted

    string result = "<A/>";

    XmlDocument xDoc = new XmlDocument();

    xDoc.LoadXml(result);

    xDoc.Save("MyNewFile.xml");

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Friday, December 18, 2009 1:57 AM
  • User411669167 posted

    I used a regular expression (see http://stackoverflow.com/questions/730133/invalid-characters-in-xml ). This below is the code.

    public static string CleanInvalidXmlChars(string text) 
    {
    // From xml spec valid chars:
    // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
    string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-u10FFFF]";
    return Regex.Replace(text, re, "");
    }
    Tuesday, October 8, 2013 2:44 PM