[Ecode] Why I get worng encode return ?


  • I have a web page DO NOT have set encode, if I use browser, it can show some Chinese Char, but if I use following c# code to copy it to text file, it show wrong encode like "#&33330#

    Is my coding wrong ?

    I already set as UTF-8

    Thanks very much

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using HtmlAgilityPack;
    using System.IO;
    using System.Globalization;
    namespace GetVol
        class Program
            static void Main(string[] args)
                int counter = 0;
                string line;
                string dayfile = ("C:\\stocks\\ra_horse.txt");
                // Read the file and display it line by line.
                StreamReader file =
                new System.IO.StreamReader("c:\\stocks\\symbols.txt");
                while ((line = file.ReadLine()) != null)
                            Encoding unicode = Encoding.Unicode;
                            HtmlWeb webClient = new HtmlWeb();
                            webClient.OverrideEncoding = Encoding.GetEncoding("utf-8");
                            HtmlAgilityPack.HtmlDocument doc = webClient.Load($"{line}");
                           string mystring= (doc.DocumentNode.InnerText);
                            File.AppendAllText(dayfile, mystring + Environment.NewLine, Encoding.UTF8);
                            // }
                    catch (Exception ex)
                        //do nothing
                        //or put break point to investigate ex
                        string exceptionInformation = ex.Message;

    Friday, April 21, 2017 8:24 AM

All replies

  • It looks correct to me. The ability to display foreign characters has nothing to do with the string content. What you're seeing is likely the Unicode character code for the values (although the # on each end is odd). Unless you have installed the language pack for the language in question and you have switched your app over to use the specific culture, you won't see the foreign characters. You'll just see the Unicode equivalent.

    However this is HTML so you'll still likely just see the Unicode characters in the raw string. It wouldn't be until you tried to render them in an HTML editor (eg. web browser) that the characters would be translated. IIRC HTML only allows ANSI characters so any non-ANSI character is converted to its encoded equivalent. The browser will convert it to the correct character (when possible) at rendering.

    Michael Taylor

    Friday, April 21, 2017 1:51 PM
  • Thanks Micheal,

    I tried webbrowser, it works, but I cannot found any method to store these char.

    String - Failed

    Byte[] - Failed

    So I can only directly save to text file.

    I hope can find some thing can store these chars, then I can edit, such as: replace, remove etc.

    Any hints for this problem ?



    Sunday, April 23, 2017 12:22 AM
  • Since the text is encoded it will come across as a string. Find that character sequence in the actual web page and identify what it maps to. Unfortunately I cannot tell what actual page you're loading so if you could post the relevant section of the HTML and what you're see it might help. Also I notice you're using the HTML agility pack. Have you tried just reading the data directly using WebClient and then sending it through HAP? HAP may be doing some extra translating that you might not want.
    Sunday, April 23, 2017 2:23 AM