locked
Conversion of text files from ANSI to UTF-8

    Question

  •  Currently I'm reading and writing text files in ANSI format and
    writing html-files in Charset ISO-8859-1 (Western Europe).

    For reading I'm using currently StreamReader(file_name,
    System.Text.Encoding.Default)
    For writing I'm using currently StreamWriter(file_name,
    System.Text.Encoding.Default)

    To allow other Charsets and a mix of Western + Eastern Europe and
    others I would like to change to UTF-8.

    Using VB 2008 Express, I've following questions:

    1. Reading a text file by using a StreamReader, how can I find out if
    this file is ANSI or UTF-8 encoded ?
    2. If in ANSI, how can I convert the text to UTF-8 ?
    3. Is there an easy way to convert special characters to html entities
    for the html output file ?

    Any hint would be helpful.

    Best regards
    Diedrich
    Friday, October 31, 2008 8:38 AM

Answers

  • I dont have all info you need re conversion but I can share conversion technique, Here is an example of how you convert:

    Imports System.Text

    Dim ansi As Encoding = Encoding.Default
            Dim utf8 As Encoding = Encoding.UTF8

            Dim stringtoconvert As String = "Blah Blah Blah"

            Dim ansiBytes As Byte() = ansi.GetBytes(stringtoconvert)

            'conversion
            Dim utf8Bytes As Byte() = Encoding.Convert(ansi, utf8, ansiBytes)
            Dim uniChars(utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length) - 1) As Char
            utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, uniChars, 0)
            Dim utfString As New String(uniChars)
            MessageBox.Show(utfString)

    Arjun Paudel
    • Marked as answer by d_hesmer Monday, November 03, 2008 1:55 PM
    Friday, October 31, 2008 8:11 PM
  • 1) Properly formatted UTF8 files contain a BOM.  However, not all UTF8 files contain one.  Without one, you cannot find out.
    2) No conversion is necessary, the encoding of characters from the ASCII set is identical in UTF8.
    3) HTML pages have an encoding too, specified in their HTTP header.  A class like HtmlDocument converts automatically.  Or use System.Text.Encoding.GetBytes().

    Hans Passant.
    • Marked as answer by d_hesmer Monday, November 03, 2008 1:55 PM
    • Marked as answer by d_hesmer Monday, November 03, 2008 1:55 PM
    Sunday, November 02, 2008 7:58 PM

All replies

  • I dont have all info you need re conversion but I can share conversion technique, Here is an example of how you convert:

    Imports System.Text

    Dim ansi As Encoding = Encoding.Default
            Dim utf8 As Encoding = Encoding.UTF8

            Dim stringtoconvert As String = "Blah Blah Blah"

            Dim ansiBytes As Byte() = ansi.GetBytes(stringtoconvert)

            'conversion
            Dim utf8Bytes As Byte() = Encoding.Convert(ansi, utf8, ansiBytes)
            Dim uniChars(utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length) - 1) As Char
            utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, uniChars, 0)
            Dim utfString As New String(uniChars)
            MessageBox.Show(utfString)

    Arjun Paudel
    • Marked as answer by d_hesmer Monday, November 03, 2008 1:55 PM
    Friday, October 31, 2008 8:11 PM
  • 1) Properly formatted UTF8 files contain a BOM.  However, not all UTF8 files contain one.  Without one, you cannot find out.
    2) No conversion is necessary, the encoding of characters from the ASCII set is identical in UTF8.
    3) HTML pages have an encoding too, specified in their HTTP header.  A class like HtmlDocument converts automatically.  Or use System.Text.Encoding.GetBytes().

    Hans Passant.
    • Marked as answer by d_hesmer Monday, November 03, 2008 1:55 PM
    • Marked as answer by d_hesmer Monday, November 03, 2008 1:55 PM
    Sunday, November 02, 2008 7:58 PM
  • Thanks to Arjun and Hans for their helpful answers.

    Diedrich

    Monday, November 03, 2008 1:57 PM