none
.NET 2.0: A0 being converted to EF BF BD

    General discussion

  • It appears that there has been some Microsoft auto-update that has occurred within this month (November 2007) that has caused a .NET 2.0 application that I wrote (using C#) to convert A0 to EF BF BD when running the following code:

     

    newString = newString.Replace(".htm", ".php");

    tmpFile.WriteLine(newString);

     

    Previous to Nov 3, 2007 this application did not convert A0 and all was well.

     

    A0 happens to be a &nbsp in ISO 8859-1. I have done some web searching and found that illegal characters are converted to EF BF BD. A0 is a legal character in that character set. Therefore, my .NET app should not be implementing this conversion.

     

    Has anyone else run into this problem?

     

     

    Saturday, November 24, 2007 8:23 PM

All replies

  • Could you please post the repro steps? The sample code above didn't seem to provide enough information.
    Tuesday, November 27, 2007 7:50 AM
    Moderator
  • private void Process_Files()

    {

    const bool OVERWRITE = true;

    const bool APPEND = true;

    ArrayList file_array = new ArrayList();

    string newString;

    bool found = false;

    foreach (string fileName in fileList)

    {

    Console.WriteLine("Processing: " + fileName + ".htm");

    // Add PHP authentication info to top of file

    FileInfo srcFile = new FileInfo("PHP-Auth.txt");

    srcFile.CopyTo(HTM_FILE_PATH + "temp.php", OVERWRITE);

    StreamReader htmFile = new StreamReader(HTM_FILE_PATH + fileName + ".htm");

    StreamWriter tmpFile = new StreamWriter(HTM_FILE_PATH + "temp.php", APPEND);

    // Read the entire .htm file into memory.

    while (!htmFile.EndOfStream)

    {

    file_array.Add(htmFile.ReadLine());

    }

    htmFile.Close();

    // Append a blank line to the end of the temp.php file.

    tmpFile.WriteLine();

    tmpFile.WriteLine();

    // Process each line of the .htm file.

    foreach (string line in file_array)

    {

    // Compare each name in the fileList to each line of this particular

    // .htm file.

    foreach (string name in fileList)

    {

    // If this line of the file contains one of the filenames in "fileList"

    // with an extension of .htm, replace it with an extension of .php, with

    // the exception of a line that contains "family.htm".

    if (line.Contains(name + ".htm") && !line.Contains("family.htm"))

    {

    found = true;

    break;

    }

    }

    if (found)

    {

    newString = line;

    newString = newString.Replace(".htm", ".php");

    tmpFile.WriteLine(newString);

    found = false;

    }

    else

    // Use the existing line.

    tmpFile.WriteLine(line);

    }

    tmpFile.Close();

    file_array.Clear();

    // Rename the temp file to it's final .php name.

    File.Move(HTM_FILE_PATH + "temp.php", HTM_FILE_PATH + fileName + ".php");

    // Delete the old .htm filename.

    File.Delete(HTM_FILE_PATH + fileName + ".htm");

    }

    }

    Sunday, February 17, 2008 12:18 AM
  • I have resolved the problem.

     

    The default character set encoding that StreamWriter uses is not ISO 8599-1 (WesternEuropean) and needs to be when processing a file that has been encoded with ISO 8599-1 (such as is the case with some webpages).

    Problem resolution:

    Encoding isoWesternEuropean = Encoding.GetEncoding(28591);
    StreamReader htmFile = new StreamReader(HTM_FILE_PATH + fileName + ".htm", isoWesternEuropean);
    StreamWriter tmpFile = new StreamWriter(HTM_FILE_PATH + "temp.php", APPEND, isoWesternEuropean);

    To find all the .NET supported character sets, look at the MSDN documentation in the Encoding class. This shows that "28591" is the .NET reference for ISO 8859-1.

    Thursday, February 21, 2008 1:39 AM
  • You are great! I had the same problem that the encoding failed even if I am sure, that it was running several months before,...

    In my case I had a text file in DOS format and some special characters (german characters) were translated into EF BF BD.

    I do not understand this conversion because I read in the .NET documentation, that unknown characters get converted into a "?" character,...

    So the solution for me was using Encoding.GetEncoding(850).

    Thank you for posting your code.
    Tuesday, January 13, 2009 4:24 PM
  • So I ran into this problem today in a different context. In my context, I am given text files from unknown sources, that could be in any format, and perofrming various operations on them.

    The problem arose when the program had to handle an ANSI Text file (ie: 8bits per character) and I tried to use the StreamReader constructor for encoding detection. The file contained a 0xA0 character, and the StreamReader, which did not find a BOM in the document defaulted to UTF8.

    Unfortunately, the UTF8 spec says that any character code > 0x80 (127) can only be used as the second byte in a mult-byte sequence, so the UTF8 decoder, behaving correctly, encountered the 0xA0 character, and assumed that it must be in a mult-byte sequence, realized that it couldn't resolve a good set of bytes for UTF8, and then replaced the character with the correct UTF8 fallback character byte sequence 0xEF 0xBF 0xBD (aka 0xFFFD)...

    In otherwords, the Encoding was behaving correctly, were the file to be UTF8 encoding. The real problem is that the file was not UTF8 encoded text, it was ANSI text. If there is a bug in the framework, it is that the StreamReader defaults to UTF8 when it does not detect a BOM.

    In order to work around this problem, and still do BOM detection, I had to write my own BOM detection routines. They look like this: 

        public static class EncodingDetector
        {
            public static Encoding DetectEncoding(string filename)
            {
                using (var fs = new FileStream(filename, FileMode.Open, FileAccess.Read))
                    return DetectEncoding(fs);
            }
    
            public static Encoding DetectEncoding(Stream inputStream)
            {
                byte[] bomBuffer = new byte[_maxPreambleLength];
    
                inputStream.Read(bomBuffer, 0, bomBuffer.Length);
    
                foreach (var spec in _encodingPreambles)
                    if (CheckBytesForBom(bomBuffer, spec.Preamble)) return spec.Encoding;
    
                return Encoding.Default;
            }
    
            static EncodingDetector()
            {
                _maxPreambleLength = GetMaxPreambleLength(_encodingPreambles);
            }
    
            private static readonly int _maxPreambleLength;
    
            private struct EncodingPreambleSpec
            {
                public byte[] Preamble;
                public Encoding Encoding;
            }
    
            private static readonly List<EncodingPreambleSpec> _encodingPreambles = new List<EncodingPreambleSpec>
                { 
                    new EncodingPreambleSpec
                    {
                        Encoding = Encoding.UTF8, 
                        Preamble = Encoding.UTF8.GetPreamble()
                    },
                    new EncodingPreambleSpec
                    {
                        Encoding = Encoding.Unicode, 
                        Preamble = Encoding.Unicode.GetPreamble()
                    },
                    new EncodingPreambleSpec
                    {
                        Encoding = Encoding.BigEndianUnicode, 
                        Preamble = Encoding.BigEndianUnicode.GetPreamble()
                    }
                };
            
    
            private static int GetMaxPreambleLength(IEnumerable<EncodingPreambleSpec> preambleSpecs)
            {
                int retval = 0;
                foreach (var spec in preambleSpecs)
                    retval = Math.Max(spec.Preamble.Length, retval);
                
                return retval;
            }
    
            private static bool CheckBytesForBom(byte[] bomBuffer, byte[] preamble)
            {
                if (bomBuffer.Length < preamble.Length) return false;
    
                for (int i = 0; i < preamble.Length; i++)
                    if (bomBuffer[i] != preamble[i]) return false;
                
    
                return true;
            }
            
        }
    and an example usage is:

            private void DoSomethingWithATextFile(string filename)
            {
                Encoding encoding = EncodingDetector.DetectEncoding(filename);
                using (StreamReader reader = new StreamReader(filename, encoding))
                {
                    // ...
                }
            }

    This resolved the problem for me.. However there is a notable exception that will slip by this... documents without BOMs will not be detected. UTF-8 does not require the BOM, and in fact, the presence of the BOM is problematic on some systems, and so is avoided in output by many programs. The same is true for UTF-16, except that UTF-16 with no BOM is rare.

    Hope that helps,
    Troy


    if at first you don't succeeed, you must be a progammer.
    Thursday, August 20, 2009 11:51 PM