locked
XML read with a invalid character problem, like 0x1F RRS feed

  • Question

  • User-1668014665 posted

    Asp net 2.0, vb .net MSVS 2005

    I have code which reads a lot of RSS XML feeds. And some feeds are not that clean.

    This how I read RSS XML into XML DOC

     Dim MyRssRequest As HttpWebRequest = Nothing
                Dim MyRssResponse As HttpWebResponse = Nothing
                Dim reader As StreamReader = Nothing
    
    
    
                    Dim MyRssDocument As XmlDocument = New XmlDocument()
                    Dim MyRssList As XmlNodeList = Nothing
    
                    Try 'http:'
                        MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                        MyRssResponse = TryCast(MyRssRequest.GetResponse(), HttpWebResponse)
                        reader = New StreamReader(MyRssResponse.GetResponseStream())
    
    *****NEED TO CLEAN XML HERE OF 0x1F ******
    
                        MyRssDocument.Load(reader)
                    Catch 'https:'
                        MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                        MyRssRequest.Method = "GET"
                        MyRssRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
                        MyRssResponse = CType(MyRssRequest.GetResponse(), HttpWebResponse)
                        Dim dataStream As Stream = MyRssResponse.GetResponseStream()
                        reader = New StreamReader(dataStream)
    
    *****NEED TO CLEAN XML HERE OF 0x1F ******
    
                        Dim XMLstr As String = reader.ReadToEnd()
                        If dataStream IsNot Nothing Then
                            dataStream.Close()
                            dataStream = Nothing
                        End If
                        MyRssDocument.LoadXml(XMLstr)
                    End Try

    As you can see I have two methods to load the reader.

    Where I have *****NEED TO CLEAN XML HERE OF 0x1F *****

    II need to clean the XML before I load into a XML doc, as it will fail due to bad XML

    I see code like this on web

    https://stackoverflow.com/questions/10645559/remove-illegal-0x1f-charector-from-xml
    https://social.msdn.microsoft.com/Forums/vstudio/en-US/17ace5bf-9822-4eac-b5fb-b66a471b87b3/xmltextreader-get-rid-of-quot0x1fquot?forum=vbgeneral

    QUESTION, any ideas how to clean XML at the two different places in my code?

    Thanks

    Monday, April 29, 2019 6:56 PM

Answers

  • User-1668014665 posted

    Ok finally got my code working and capturing error correctly

    The bad URL was this: 
    site: SkepticalScience.com,
    URL: https://www.skepticalscience.com/feed.xml,
    exMess: '', hexadecimal value 0x1F, is an invalid character. Line 245, position 327

    My better code 

                Dim MyRssRequest As HttpWebRequest = Nothing
                Dim MyRssResponse As HttpWebResponse = Nothing
                Dim reader As StreamReader = Nothing
    
                    Dim MyRssDoc As XmlDocument = New XmlDocument()
                    Dim MyRssList As XmlNodeList = Nothing
    
                    Try 'http:'
                        MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                        MyRssResponse = TryCast(MyRssRequest.GetResponse(), HttpWebResponse)
                        reader = New StreamReader(MyRssResponse.GetResponseStream(), Encoding.GetEncoding("ISO-8859-1"))
                        MyRssDoc.Load(reader)
                    Catch 'https:'
                        Try
                            MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                            MyRssRequest.Method = "GET"
                            MyRssRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
                            MyRssResponse = CType(MyRssRequest.GetResponse(), HttpWebResponse)
                            Dim dataStream As Stream = MyRssResponse.GetResponseStream()
                            reader = New StreamReader(dataStream, Encoding.GetEncoding("ISO-8859-1"))
                            Dim XMLstr As String = reader.ReadToEnd()
                            If dataStream IsNot Nothing Then
                                dataStream.Close()
                                dataStream = Nothing
                            End If
                            MyRssDoc.LoadXml(XMLstr)
                        Catch ex As Exception
                            Throw New Exception("[HTTPS:-9009], RSSSource: " & RSSSource & ", URL: " & sURL & ", exMess: " & RSSHelpers.XMLClean(ex.Message.ToString))
                        End Try
                    End Try

    using this function

            Public Shared Function XMLClean(ByVal sLine As String) As String
                Dim res As String = sLine
                If String.IsNullOrEmpty(res) Then
                    Return res
                End If
                'https://ascii.cl/'
                res = res.Replace((ChrW(&H4)).ToString(), "")
                res = res.Replace((ChrW(&H14)).ToString(), "")
                res = res.Replace((ChrW(&H1F)).ToString(), "")
                Return res
            End Function

    Works better now! Thanks

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, May 1, 2019 6:47 PM

All replies

  • User-1174608757 posted

    Hi icm63,

    icm63

    QUESTION, any ideas how to clean XML at the two different places in my code?

    According to your description, I think you want to match the xml tag firstly then you could replace the character in tag. So I suggest you should use Regular expression to create the xml tag you want to match, then you could remove the illegal character in xml tag.

    Here is the demo. I hope it could help you.

    Private Sub SurroundingSub()
        Dim pattern As String = "<from_id>(.*?)</from_id>"     'Dim xml tag you want to match
        Dim input As String = "<?xml version=""1.0"" encoding=""utf-8""?><response list=""true""><count>2802</count><post><id>4210</id><from_id>2176594</from_id><to_id>-11423648</to_id><date>1365088358</date><text>dsadsad #ADMIN</text>"
        Dim match As Match = Regex.Match(input, pattern)
    'if match, we could just replace the character If match.Success Then Dim file As String = Regex.Replace(pattern, "[]", String.Empty) My.Computer.FileSystem.WriteAllText("C:\Documents and Settings\FileList.txt", file, True) End If End Sub

    Best Regards

    Wei

    Tuesday, April 30, 2019 5:53 AM
  • User-1668014665 posted

    You assume the offending char is within an element. 

    The char  listed "0x1F" can appear in the first line of XML doc,outside any element or node,   there fore it has to be cleaned before the reader assign the XML doc, not after. 

    0x1F or &H1F is a Unit separator (see asscii table)

    http://circuitgizmos.com/documentation/tools-tips-and-tricks/ascii-tables/

    QUestion is how to clean this char not matter where it is in the XML, in a node or element or header?

    Tuesday, April 30, 2019 8:36 AM
  • User753101303 posted

    Hi,

    You can't just use String.Replace ? If not done already and their feed is invalid, I would also try to contact them so that they can fix that once for all on their side rather than having everyone uisng their find having to fix this.

    Tuesday, April 30, 2019 8:50 AM
  • User-1668014665 posted

    Cant do that here? As no string.

    MyRssRequest = TryCast(WebRequest.Create(sURL),

    HttpWebRequest) MyRssResponse = TryCast(MyRssRequest.GetResponse(), HttpWebResponse)

    reader = New StreamReader(MyRssResponse.GetResponseStream())

    MyRssDocument.Load(reader)

    YOU.." I would also try to contact them so that they can fix that once for all on their side rather than having everyone uisng their find having to fix this."...

    I down load 100s of RSS, so I  have not the time to do this

    Tuesday, April 30, 2019 8:54 AM
  • User753101303 posted

    In your first post you had :

    Dim XMLstr As String = reader.ReadToEnd()

    So you should be able to then use something such as XMLstr=XMLstr.Replace(Chr(&H1f),"") (if my VB is not too rusty) to drop those unwanted characters.

    I understand but then everyone using this feed have to workaround again and again this issue when the malformed XML could be fixed once for all on the other side (I prefer to always try to have things fixed at the source when I find something wrong).

    Tuesday, April 30, 2019 9:14 AM
  • User-1668014665 posted

    ."In your first post you had "

    YES replace works with STRING

    BUT if you re read my original post, there is a TRY CATCH XML import

    First part is TRY for HTTP

    The first part is a STREAM 

    The second part for HTTPS is a string, and replace will work for that.

    NOW what to do with the STREAM (for HTTP)?

    Tuesday, April 30, 2019 9:20 AM
  • User753101303 posted

    You could use a custom StreamReader that would skip (or maybe replace ?) this unwanted character. I would use the same approach on both sides rather than working directly on a stream in one case or going first through a string in the other case.

    See https://stackoverflow.com/questions/14242112/how-to-remove-all-instances-of-a-character-from-a-file-in-c/14242617#14242617

    Edit: I gave a closer look and according to https://stackoverflow.com/questions/6693153/what-is-character-0x1f/6693203 the root cause seems to be a XML 1.0 vs XML 1.1 issue. Digging further it seems .NET doesn't have support for XML 1.1

    Tuesday, April 30, 2019 4:36 PM
  • User-1668014665 posted

            Public Class SanitizedStreamReader
                Inherits StreamReader
    
                Public Sub New(ByVal filename As String)
                    MyBase.New(filename)
                End Sub
    
                Public Overrides Function ReadLine() As String
                    Return Sanitize(MyBase.ReadLine())
                End Function
    
                Private Shared Function Sanitize(ByVal unclean As String) As String
                    Dim res As String = unclean
                    If String.IsNullOrEmpty(unclean) Then
                        Return res
                    End If
                    res = res.Replace((ChrW(4)).ToString(), "")
                    res = res.Replace((ChrW(&H14)).ToString(), "")
                    res = res.Replace((ChrW(&H1F)).ToString(), "")
    
    
                    Return res
                End Function
            End Class

    FROM  https://stackoverflow.com/questions/14242112/how-to-remove-all-instances-of-a-character-from-a-file-in-c/14242617#14242617

    Is this conversion to VB ok, chr or ChrW ??  

    https://ascii.cl/

    Tuesday, April 30, 2019 9:02 PM
  • User475983607 posted

    You are replacing a char not a string.

    https://docs.microsoft.com/en-us/dotnet/api/system.string.replace?view=netframework-4.8

    res.Replace(ChrW(&H1F), ChrW(&H9))

    Or 

    res.Replace(ChrW(&H1F), ChrW(&H20))

    0x09 is a tab and 0x20 is a space. 0x1F is a unit separator so you might want a tab over a space.

    http://www.asciitable.com/

    Tuesday, April 30, 2019 9:21 PM
  • User-1668014665 posted

    My original code post

                Dim MyRssResponse As HttpWebResponse = Nothing
                Dim reader As StreamReader = Nothing
    
    
    
                    Dim MyRssDocument As XmlDocument = New XmlDocument()
                    Dim MyRssList As XmlNodeList = Nothing
    
                    Try 'http:'
                        MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                        MyRssResponse = TryCast(MyRssRequest.GetResponse(), HttpWebResponse)
                        reader = New StreamReader(MyRssResponse.GetResponseStream())
    
    *****NEED TO CLEAN XML HERE OF 0x1F ******
    
                        MyRssDocument.Load(reader)
                    Catch 'https:'
                        MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                        MyRssRequest.Method = "GET"
                        MyRssRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
                        MyRssResponse = CType(MyRssRequest.GetResponse(), HttpWebResponse)
                        Dim dataStream As Stream = MyRssResponse.GetResponseStream()
                        reader = New StreamReader(dataStream)
    
    *****NEED TO CLEAN XML HERE OF 0x1F ******
    
                        Dim XMLstr As String = reader.ReadToEnd()
                        If dataStream IsNot Nothing Then
                            dataStream.Close()
                            dataStream = Nothing
                        End If
                        MyRssDocument.LoadXml(XMLstr)
                    End Try

    To use this code posted above (converted from C#), requires a FILE NAME , my code has no fileName, it has a 'reader' output from a Stream?

    How can I use the class 'SanitizedStreamReader' in the above code, which has no file name and has only stream 'reader'.

    I believe I need to check XML before it gets in a XML doc, as that is where the bad chr throws an error.

            Public Class SanitizedStreamReader
                Inherits StreamReader
    
                Public Sub New(ByVal filename As String)
                    MyBase.New(filename)
                End Sub
    
                Public Overrides Function ReadLine() As String
                    Return Sanitize(MyBase.ReadLine())
                End Function
    
                Private Shared Function Sanitize(ByVal unclean As String) As String
                    Dim res As String = unclean
                    If String.IsNullOrEmpty(unclean) Then
                        Return res
                    End If
                    res = res.Replace((ChrW(4)).ToString(), "")
                    res = res.Replace((ChrW(&H14)).ToString(), "")
                    res = res.Replace((ChrW(&H1F)).ToString(), "")
    
    
                    Return res
                End Function
            End Class

    Tuesday, April 30, 2019 9:28 PM
  • User-1174608757 posted

    Hi icm63,

     Dim XMLstr As String = reader.ReadToEnd()

    According to your description, you have got response string using Dim XMLstr As String = reader.ReadToEnd().So,it is obvious that you could get the original code, as a result, you could replace this string with regex, your XmlDocument could load  the xml  successfully.If you still have problem.I hope that you could post the url your request the xml, so that we could make a test for you,

    Best Regards

    Wei.

    Wednesday, May 1, 2019 8:21 AM
  • User-1668014665 posted

    Dude, I wish you would the original code.

    There are two approaches in the original code

    One for HTTP, another for HTTPS

    your solution works for the HTTPS code, not for HTTP,

    please re read the original post

    Wednesday, May 1, 2019 5:43 PM
  • User-1668014665 posted

    I have added this code to my soltution

    Encoding.GetEncoding("ISO-8859-1")

    REF: https://stackoverflow.com/questions/8275825/how-to-prevent-system-xml-xmlexception-invalid-character-in-the-given-encoding

    I will see how it goes, and post back!

                    Dim MyRssDoc As XmlDocument = New XmlDocument()
                    Dim MyRssList As XmlNodeList = Nothing
    
                    Try 'http:'
                        MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                        MyRssResponse = TryCast(MyRssRequest.GetResponse(), HttpWebResponse)
                        reader = New StreamReader(MyRssResponse.GetResponseStream(), Encoding.GetEncoding("ISO-8859-1"))
                        MyRssDoc.Load(reader)
                    Catch 'https:'
                        MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                        MyRssRequest.Method = "GET"
                        MyRssRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
                        MyRssResponse = CType(MyRssRequest.GetResponse(), HttpWebResponse)
                        Dim dataStream As Stream = MyRssResponse.GetResponseStream()
                        reader = New StreamReader(dataStream, Encoding.GetEncoding("ISO-8859-1"))
                        Dim XMLstr As String = reader.ReadToEnd()
                        If dataStream IsNot Nothing Then
                            dataStream.Close()
                            dataStream = Nothing
                        End If
                        MyRssDoc.LoadXml(XMLstr)
                    End Try

    Wednesday, May 1, 2019 6:09 PM
  • User-1668014665 posted

    Ok finally got my code working and capturing error correctly

    The bad URL was this: 
    site: SkepticalScience.com,
    URL: https://www.skepticalscience.com/feed.xml,
    exMess: '', hexadecimal value 0x1F, is an invalid character. Line 245, position 327

    My better code 

                Dim MyRssRequest As HttpWebRequest = Nothing
                Dim MyRssResponse As HttpWebResponse = Nothing
                Dim reader As StreamReader = Nothing
    
                    Dim MyRssDoc As XmlDocument = New XmlDocument()
                    Dim MyRssList As XmlNodeList = Nothing
    
                    Try 'http:'
                        MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                        MyRssResponse = TryCast(MyRssRequest.GetResponse(), HttpWebResponse)
                        reader = New StreamReader(MyRssResponse.GetResponseStream(), Encoding.GetEncoding("ISO-8859-1"))
                        MyRssDoc.Load(reader)
                    Catch 'https:'
                        Try
                            MyRssRequest = TryCast(WebRequest.Create(sURL), HttpWebRequest)
                            MyRssRequest.Method = "GET"
                            MyRssRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
                            MyRssResponse = CType(MyRssRequest.GetResponse(), HttpWebResponse)
                            Dim dataStream As Stream = MyRssResponse.GetResponseStream()
                            reader = New StreamReader(dataStream, Encoding.GetEncoding("ISO-8859-1"))
                            Dim XMLstr As String = reader.ReadToEnd()
                            If dataStream IsNot Nothing Then
                                dataStream.Close()
                                dataStream = Nothing
                            End If
                            MyRssDoc.LoadXml(XMLstr)
                        Catch ex As Exception
                            Throw New Exception("[HTTPS:-9009], RSSSource: " & RSSSource & ", URL: " & sURL & ", exMess: " & RSSHelpers.XMLClean(ex.Message.ToString))
                        End Try
                    End Try

    using this function

            Public Shared Function XMLClean(ByVal sLine As String) As String
                Dim res As String = sLine
                If String.IsNullOrEmpty(res) Then
                    Return res
                End If
                'https://ascii.cl/'
                res = res.Replace((ChrW(&H4)).ToString(), "")
                res = res.Replace((ChrW(&H14)).ToString(), "")
                res = res.Replace((ChrW(&H1F)).ToString(), "")
                Return res
            End Function

    Works better now! Thanks

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, May 1, 2019 6:47 PM