none
c# 爬取网站html源码并将其转换成字符串进行进一步的网页数据获取,但爬取某些网站html源码转出的字符串全是乱码,爬取方法写的不对还是其他原因? RRS feed

  • 问题

  • //爬取网页整个html源码的方法
    private static string GetHtmlSource(string url, Encoding charset)
            {
                string _html = string.Empty;
                try
                {
                    HttpWebRequest _request = (HttpWebRequest)WebRequest.Create(url);
                    HttpWebResponse _response = (HttpWebResponse)_request.GetResponse();
                    using (Stream _stream = _response.GetResponseStream())
                    {
                        using (StreamReader _reader = new StreamReader(_stream, charset))
                        {
                            _html = _reader.ReadToEnd();
                        }
                    }
                }
    
                catch
                {
                    _html = null;
    
                }
                return _html;
            }
    //调用
    GetHtmlSource(url,Encoding.UTF8);

    :

    有两个获取测试。但有不同的结果,一个显示的内容几乎全是乱码,另一个显示的内容很正常。

    测试1:尝试获取网页http://news.sohu.com/上的网页源码,但是显示出来的却是乱码。

    测试2:获取京东首页https://www.jd.com/?cu=true&utm_source=baidu-pinzhuan&utm_medium=cpc&utm_campaign=t_288551095_baidupinzhuan&utm_term=0f3d30c8dba7459bb52f2eb5eba8ac7d_0_2036dbe175a14099a7072259e0709b35,该网址的网页源码,获取的内容显示正常,没有乱码

    同样的获取网页源码的方法,获取不同的网页源码时,获取某些网址的源码时,会出现乱码,这是怎么回事?是GetHtmlSource的方法内容有问题还是其他的问题,或者遭遇反爬虫?


    2018年5月7日 5:12

答案

  • 你好,

    可以尝试下面的方法,根据请求返回的响应头的Content-Type类型中的charset编码类型去编码抓取的内容,达到解决乱码的目的

    public static string GetHtmlSource(string url)
            {
                string htmlCode;
                HttpWebRequest webRequest = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(url);
                webRequest.Timeout = 30000;
                webRequest.Method = "GET";
                webRequest.UserAgent = "Mozilla/4.0";
                webRequest.Headers.Add("Accept-Encoding", "gzip, deflate");
    
    
                HttpWebResponse webResponse = (System.Net.HttpWebResponse)webRequest.GetResponse();
    
                //获取目标网站的编码格式
                string contentype = webResponse.Headers["Content-Type"];
                Regex regex = new Regex("charset\\s*=\\s*[\\W]?\\s*([\\w-]+)", RegexOptions.IgnoreCase);
                if (webResponse.ContentEncoding.ToLower() == "gzip")//如果使用了GZip则先解压
                {
                    using (System.IO.Stream streamReceive = webResponse.GetResponseStream())
                    {
                        using (var zipStream = new System.IO.Compression.GZipStream(streamReceive, System.IO.Compression.CompressionMode.Decompress))
                        {
    
                            //匹配编码格式
                            if (regex.IsMatch(contentype))
                            {
                                Encoding ending = Encoding.GetEncoding(regex.Match(contentype).Groups[1].Value.Trim());
                                using (StreamReader sr = new System.IO.StreamReader(zipStream, ending))
                                {
                                    htmlCode = sr.ReadToEnd();
                                }
                            }
                            else
                            {
                                using (StreamReader sr = new System.IO.StreamReader(zipStream, Encoding.UTF8))
                                {
                                    htmlCode = sr.ReadToEnd();
                                }
                            }
                        }
                    }
                }
                else
                {
                    using (System.IO.Stream streamReceive = webResponse.GetResponseStream())
                    {
                        using (System.IO.StreamReader sr = new System.IO.StreamReader(streamReceive, Encoding.Default))
                        {
                            htmlCode = sr.ReadToEnd();
                        }
                    }
                }
                return htmlCode;
            }

    Best regards,

    Zhanglong


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    2018年5月8日 3:13
    版主