locked
Getting rid of invalid character in XML. RRS feed

  • Question

  • When i am trying to parse a XML document from time to time it will have an invalid character i would like to be able to remove this without having to go into the document myself. This is the error i get "System.Xml.XmlException: '_', hexadecimal value 0x02, is an invalid character." for the following line of XML - <command user="" name="" command="" />

    Thanks in advance for the help.

    Wednesday, November 22, 2017 2:31 PM

All replies

  •  So what do you do to fix it when you do it manually?

     Apparently somehow there is an invalid character being added to the document along its way to you or maybe you are creating the file and inserting this character somehow without knowing it.  That would be the best place to fix the problem so it never gets put in the xml file to begin with.

     However,  maybe you have no control over the creation of the file.  In that case,  an Xml file is basically just a glorified text file which you can open and replace all the (Chr(2) or Chr(&H02)) characters using standard text file methods,  then save it back to the hard drive.  You could set something like this up to look for the character and replace it if needed before opening and reading it with the xml methods....

    Imports System.Xml

    Public Class Form1
        Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
            Dim fileChars() As Char = IO.File.ReadAllText("C:\TestFolder\MyFile.xml", System.Text.Encoding.UTF8).ToCharArray
            If fileChars.Where(Function(x) Not XmlConvert.IsXmlChar(x)).Count > 0 Then
                fileChars = fileChars.Where(Function(x) XmlConvert.IsXmlChar(x)).ToArray
                IO.File.WriteAllText("C:\TestFolder\MyFile.xml", fileChars, System.Text.Encoding.UTF8)
            End If
            fileChars = Nothing

            'now open and read the xml file...
        End Sub
    End Class


    If you say it can`t be done then i`ll try it

    • Edited by IronRazerz Thursday, November 23, 2017 2:45 PM Updated Code
    • Proposed as answer by Frank L. Smith Thursday, November 23, 2017 3:03 PM
    Wednesday, November 22, 2017 8:32 PM
  • When i am trying to parse a XML document from time to time it will have an invalid character i would like to be able to remove this without having to go into the document myself. This is the error i get "System.Xml.XmlException: '_', hexadecimal value 0x02, is an invalid character." for the following line of XML - <command user="" name="" command="" />

    Thanks in advance for the help.

    Have a look at this on SO:

    LINK

    The accepted answer is in C# but it doesn't look like it would be hard to change to VB. I've seen similar utilities on the net about "cleaning the XML" but I've always wondered how they got there to start with.

    If you created the XML to start with then let's talk about what you've got there - that will be the ultimate best solution.


    "A problem well stated is a problem half solved.” - Charles F. Kettering



    • Edited by Frank L. Smith Thursday, November 23, 2017 12:16 PM
    • Proposed as answer by IronRazerz Thursday, November 23, 2017 2:46 PM
    Thursday, November 23, 2017 12:14 PM
  •  Frank,  i knew you would get in on this one since you seem to answer most of the xml questions around here.  I guess i don't use xml enough because,  i have never run into the XmlConvert.IsXmlChar Method before.  Quite handy for a situation like this if you have no control over the creation of the xml file.

     After testing it,  i noticed that the document that was being saved back to the hard drive in my prior example was not the same size as the original.  It seems i was missing the part of using UTF8 encoding when reading and writing the file,  that fixed the size problem for my saved xml file which uses the UTF8 encoding.

     Anyways, i am updating my prior example to use both of these fixes but,  i wanted to ask you if there are xml files using other encoding like,  utf7,  utf32, or even plain ascii?  It seems like i have ever only seen UTF8 in all that i have messed around with.


    If you say it can`t be done then i`ll try it

    Thursday, November 23, 2017 2:44 PM
  •  Frank,  i knew you would get in on this one since you seem to answer most of the xml questions around here.  I guess i don't use xml enough because,  i have never run into the XmlConvert.IsXmlChar Method before.  Quite handy for a situation like this if you have no control over the creation of the xml file.

     After testing it,  i noticed that the document that was being saved back to the hard drive in my prior example was not the same size as the original.  It seems i was missing the part of using UTF8 encoding when reading and writing the file,  that fixed the size problem for my saved xml file which uses the UTF8 encoding.

     Anyways, i am updating my prior example to use both of these fixes but,  i wanted to ask you if there are xml files using other encoding like,  utf7,  utf32, or even plain ascii?  It seems like i have ever only seen UTF8 in all that i have messed around with.


    If you say it can`t be done then i`ll try it

    I'd still like to know how the blemish got there to start with -- that's the best way to deal with it; prevention. ;-)


    "A problem well stated is a problem half solved.” - Charles F. Kettering

    Thursday, November 23, 2017 3:03 PM
  • I'd still like to know how the blemish got there to start with -- that's the best way to deal with it; prevention. ;-)

    "A problem well stated is a problem half solved.” - Charles F. Kettering

     I agree.  I mentioned that too but,  i figured i would also try giving an option for fixing the file just in case Gixxerluke has no control over the creation of it.  8)

    If you say it can`t be done then i`ll try it

    Thursday, November 23, 2017 3:25 PM
  •  I agree.  I mentioned that too but,  i figured i would also try giving an option for fixing the file just in case Gixxerluke has no control over the creation of it.  8)

    If you say it can`t be done then i`ll try it

    If it has a bunch of odd anomalies in it, can you really count on the XML itself to now be valid though?

    Anyway, I hope he gets to the bottom of it all. :)


    "A problem well stated is a problem half solved.” - Charles F. Kettering

    Thursday, November 23, 2017 3:30 PM
  • When i am trying to parse a XML document from time to time it will have an invalid character i would like to be able to remove this without having to go into the document myself. This is the error i get "System.Xml.XmlException: '_', hexadecimal value 0x02, is an invalid character." for the following line of XML - <command user="" name="" command="" />

    Thanks in advance for the help.


    How are you parsing the XML document?  Do you know how the document was corrupted?

    "Those who use Application.DoEvents() have no idea what it does and those who know what it does never use it" - MSDN User JohnWein

    Friday, November 24, 2017 12:40 PM
  •  Frank,  i knew you would get in on this one since you seem to answer most of the xml questions around here.  I guess i don't use xml enough because,  i have never run into the XmlConvert.IsXmlChar Method before.  Quite handy for a situation like this if you have no control over the creation of the xml file.

     After testing it,  i noticed that the document that was being saved back to the hard drive in my prior example was not the same size as the original.  It seems i was missing the part of using UTF8 encoding when reading and writing the file,  that fixed the size problem for my saved xml file which uses the UTF8 encoding.

     Anyways, i am updating my prior example to use both of these fixes but,  i wanted to ask you if there are xml files using other encoding like,  utf7,  utf32, or even plain ascii?  It seems like i have ever only seen UTF8 in all that i have messed around with.


    If you say it can`t be done then i`ll try it

    Ray, XML is just an enhanced HTML file with a focus on data instead of presenting.

    Plain ASCII is a 7 bit format created for papertape. Likewise the papertape is it not anymore the best text code system  (it contains even a lot codes for printerhandling)

    HTML and XML are just string files. What kind of charactercode is used is depending from the OS. 


    Success
    Cor



    Friday, November 24, 2017 1:16 PM