Answered Replacing degree char for XML

  • Tuesday, July 24, 2012 2:38 PM
     
     

    Hi Forum this problem has been bothering me for some time and I have yet to find a simple solution.  My code parses txt files to data objects containing strings that eventually gets serialize to XML.  Since the source files contain characters which must be converted to escape sequences within XML the plan is to always do this ahead of time.  The simple .NET C# function I've written is below...

            private string ReplaceDegreeChars(string s)
            {
                const string repstr = "°";
                //const char degree = (char)176;

                string degreestr = "°";
                //string degreestr = degree.ToString();
                string newstr = s.Replace(degreestr, repstr);
     
                return newstr;
            }

    As you suspect the code in any form above fails miserably.  Does anyone know why?  I'm assuming that the TextReader I'm using and the Split method is handling unicode as expected. But even s.Contains(degree) on the parameter above never returns true.

    A copy paste from a single line in my test file is below.  I use Split with a comma delimeter and then trim of the quotations, and then call ReplaceDegreeChars for each item on the line.

    "EVNT","0F000002","APU Oil Temp High - Oil Temp >135°C",1

    Thanks for any hints as to why this code is very very bad (other than its performance of course).

    Wil

All Replies

  • Tuesday, July 24, 2012 4:04 PM
     
     Answered Has Code

    Sorry, I don't understand what the problem is or rather think that a misconception about XML lets you try to do something that is not necessary. XML builds on and fully supports Unicode so a degree character is nothing special and can be but as content without problems in any XML document encoded with an Unicode encoding like UTF-8.

    If you want to construct well-formed XML then simply use the .NET framework's APIs assisting you in that task, there is XmlWriter and there is LINQ to XML, they will do any necessary escaping (as escaping '<' as '&lt;') for you.

    Here is an example using XmlWriter

                string data = "a < b, foo & bar, a < b && b < c";
    
                using (XmlWriter xw = XmlWriter.Create(Console.Out, new XmlWriterSettings() { Indent = true }))
                {
                    xw.WriteStartDocument();
                    xw.WriteStartElement("root");
                    xw.WriteStartElement("items");
                    foreach (string item in data.Split(','))
                    {
                        xw.WriteElementString("item", item);
                    }
                    xw.WriteEndDocument();
                }

    That outputs

    <root>
      <items>
        <item>a &lt; b</item>
        <item> foo &amp; bar</item>
        <item> a &lt; b &amp;&amp; b &lt; c</item>
      </items>
    </root>
    

    Here is an example using LINQ to XML:

                XDocument doc = new XDocument(
                    new XElement("root",
                        new XElement("items",
                            from item in data.Split(',')
                            select new XElement("item", item))));

    Again the output has anything escaped that needs to be escaped:

    <root>
      <items>
        <item>a &lt; b</item>
        <item> foo &amp; bar</item>
        <item> a &lt; b &amp;&amp; b &lt; c</item>
      </items>
    </root
    

    For a quick demonstration I wrote respectively saved to the console but of course you can write to file or a stream or TextWriter as well.


    MVP Data Platform Development My blog

  • Tuesday, July 24, 2012 5:29 PM
     
      Has Code

    To add to my previous answer, if you really want to create an XML document in an encoding that does not have the degree character, like US-ASCII, then XmlWriter or XDocument (with the help of XmlWriter) do encode the sign as needed, so the sample

                string data = "a < b,foo & bar,a < b && b < c,temperature < 35°";
    
                
                using (XmlWriter xw = XmlWriter.Create("../../output1.xml", new XmlWriterSettings() { Indent = true, Encoding = Encoding.ASCII }))
                {
                    xw.WriteStartDocument();
                    xw.WriteStartElement("root");
                    xw.WriteStartElement("items");
                    foreach (string item in data.Split(','))
                    {
                        xw.WriteElementString("item", item);
                    }
                    xw.WriteEndDocument();
                }

    creates a file with

    <?xml version="1.0" encoding="us-ascii"?>
    <root>
      <items>
        <item>a &lt; b</item>
        <item>foo &amp; bar</item>
        <item>a &lt; b &amp;&amp; b &lt; c</item>
        <item>temperature &lt; 35&#xB0;</item>
      </items>
    </root>


    MVP Data Platform Development My blog

  • Wednesday, July 25, 2012 12:29 PM
     
     

    Hi Martin:

    Thanks for your response.  Actually I'm trying to do what you are also describing using .NET's String Replace method.  Perhaps I should have posted there but I figured this issue must come up often when creating well-formed XML from text files.  The 0x00B0 unicode character found in the text must be replaced with "&#176;" within the XML. Unfortunately the Replace method in my example code fails to perform this for me.  It does work when search/replacing chars other than  0x00B0 though.

    Wil 

  • Wednesday, July 25, 2012 1:12 PM
     
     Answered Has Code

    Thanks for your response.  Actually I'm trying to do what you are also describing using .NET's String Replace method.  Perhaps I should have posted there but I figured this issue must come up often when creating well-formed XML from text files.  The 0x00B0 unicode character found in the text must be replaced with "&#176;" within the XML. Unfortunately the Replace method in my example code fails to perform this for me.  It does work when search/replacing chars other than  0x00B0 though.

    My main point is that the degree character is nothing special in the Unicode world which XML uses and supports so you don´t need to escape it at all in most cases. And if you need to escape it (in the case you want to create an XML file in US-ASCII encoding) and if you want to construct well-formed XML in the .NET framework in general, then you shouldn´t mess with string replacements, you should simply use the XML APIs like XmlWriter or XDocument the Microsoft .NET framework offers to assist you to create well-formed XML without needing to worry about escaping characters, those classes do that for you.

    As for the mere string replacement not working, I can´t reproduce that here, sample code is

                const string repstr = "&#176;";
    
                string degreestr = "°";
    
                string input = "APU Oil Temp High - Oil Temp >135°C";
                string replacement = input.Replace(degreestr, repstr);
    
                Console.WriteLine("|{0}| replaced by |{1}|", input, replacement);

    that outputs

    |APU Oil Temp High - Oil Temp >135°C| replaced by |APU Oil Temp High - Oil Temp>135&#176;C|
    But as a I said, using the string replacement functions is not the way to go to create well-formed XML, use the API in System.Xml to construct your XML.


    MVP Data Platform Development My blog


  • Wednesday, August 01, 2012 5:08 PM
     
     

    Hi Martin;

    First my apologies for being away.  Business goes on as usual.  

    I understand that XMLWriter should be doing the replacements for me since I'm calling XMLSerializer to do the work. However for some reason it wasn't working for me so I came up with the Replace solution in my code. Which still didn't work.  Unfortunately my quick assumptions that this was a serialization issue was where I went wrong.

    Digging a little deeper into my code I rewrote my function as follows and stepped through it using the debugger:

            private string ReplaceDegreeChars(string s)
            {
                StringBuilder result = new StringBuilder();
                for (int i = 0; i < s.Length; i++)
                {
                    char c = s[i];
                    byte b = (byte)c;
                    if (b == (byte)176)
                        result.Append("&#176;");
                    else
                        result.Append(c);
                }
                return result.ToString();
            }

    What I missed was that the degree char was never actually making it into my string value.  It was actually '0xfffd' the unicode replacement character.  So my problem occurs when reading the text file and not during serialization.

    When I search the text files for the degree character using a text editor there was no problem finding several occurances.  However when reading in the text files I was using 

    TextReader reader = new StreamReader(defFileName);

    String text;

    while ((text = reader.ReadLine()) != null)
     { ...

    Turns out TextReader was replacing the degree characters from my "not-so-7-bit" text file to '0xfffd' within the text string which eventually made it to my data class.  My ReplaceDegreeChars function never saw the degree chars to replace and the Serializer just left things as they are.  

    Now I need to learn more about the TextReader or replace it with a binary reader since it's mangling my input.  Is this just another reason to simplify tasks by writing one's own code?  Or is there a way to get TextReader to behave?

    Wil

  • Wednesday, August 01, 2012 5:17 PM
     
     Answered

    Well there are lots of possible encoding and you haven´t even told us which encoding your text file uses. So you need to find out that encoding and then use http://msdn.microsoft.com/en-us/library/ms143456.aspx

      using (TextReader reader = new StreamReader(defFileName, Encoding.GetEncoding(encodingNameOrCodePageNumberGoesHere)))

      { ... }


    MVP Data Platform Development My blog

    • Marked As Answer by wilcode Monday, August 06, 2012 7:16 PM
    •  
  • Wednesday, August 01, 2012 7:44 PM
     
     

    :-) Here in the good ol' USA on my DELL Keyboard and Windows XP I've tried ASCIIEncoding.ASCII and ASCIIEncoding.UTF8 to no avail.  As I mentioned earlier TextPad, NotePad, and my WinHex editors have no problems finding the 0x00B0 8 bit characters in the file(s).  What encoding or codepage would you suggest?  I appreciate your help and tolerance.

    Wil

  • Wednesday, August 01, 2012 8:33 PM
     
     

    Phew - Trial an error found the solution for my particular text file to be: 

    TextReader reader = new StreamReader(defFileName, ASSCIIEncoding.UTF7)

    And as Martin correctly pointed out the XMLSerialize will perform the replacements as required when writing XML

    Thanks!