none
C# How to parallel download web page fast yet correctly RRS feed

  • Question

  • Hello:

    I have to login to one web site, and browse many similar web pages, those web pages have URL like this: https://www.myweb.com/page1/ https://www.myweb.com/page100/

    The number of pages vary from time to time, some times, it has only one page, but some times, it has up to 400+ pages.

    I want to use httpclient to download all the pages, but to save time, I want to use Parallel loop. However, as the httpclient has to carry the cookies got from the login page, so I have to create each httpclient with necessary cookies.

    I setup a counter, to count each page’s length to know their difference.

    The following is my C# (.net core) code:

    public const string _userAgent =
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3844.0 Safari/537.36";
    public static ParallelOptions Para_Option = 
        new ParallelOptions() { MaxDegreeOfParallelism = Environment.ProcessorCount };
    public static ConcurrentDictionary<string, int> Dpage_Counter =
        new ConcurrentDictionary<string, int>();
    public static string login_cookies = "csrftoken=H6EyIoa9njatwHUPH2PdPRIlULApxZDQ4mie6; _gid=GA1.2.657661824.1565778096";
    
    public static async Task<HttpClient> Create_HttpClient()
    {
    ServicePointManager.UseNagleAlgorithm = true;
    ServicePointManager.Expect100Continue = true;
    ServicePointManager.DefaultConnectionLimit = int.MaxValue;
    ServicePointManager.EnableDnsRoundRobin = true;
    ServicePointManager.ReusePort = true;
    HttpClientHandler clientHandler = new HttpClientHandler()
    {
        AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip
    };
    HttpClient client1 = new HttpClient(clientHandler);
    client1.DefaultRequestHeaders.Accept.Clear();
    client1.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("*/*"));
    client1.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate");
    client1.DefaultRequestHeaders.AcceptLanguage.Add(new StringWithQualityHeaderValue("en-US"));
    client1.DefaultRequestHeaders.Add("User-Agent", _userAgent);
    client1.DefaultRequestHeaders.TryAddWithoutValidation("Cookie", login_cookies);
    await Task.Delay(1);
    return (client1);
    }
    
    public static async Task Http_Page_Length(string url1)
    {
        using (HttpClient client1 = await Create_HttpClient())
        {
        using (HttpResponseMessage http_reply1 = await client1.GetAsync(url1))
        {
        string html_content1 = await http_reply1.Content.ReadAsStringAsync();
        Dpage_Counter.AddOrUpdate(url1, html_content1.Length, (k, v) => v);
        }
        }
    }
    
    static async Task Main()
    {
    int max_page_num = 20;
    HashSet<string> Page_URLs = new HashSet<string>();
    for (int i = 1; i <= max_page_num; i++)
    {
        string page_url1 = "https://www.myweb.com/page" + i.ToString();
        Page_URLs.Add(page_url1);
    }
    Parallel.ForEach(Page_URLs, Para_Option, async (page_url1) =>
    {
    await Http_Page_Length(page_url1).ConfigureAwait(false);
    });
    }
    }
    

    When I run my code, I found some issue: if the total number of web page is less than 10, then it always works. But if the total web page is more than 10, then I have issue, the counter has only 10 web page’s length data.

    I searched around, someone said: the default connection limit for httpclient is 10 or 2; but when I create httpclient, I setup the default connection limit to a huge number:

    ServicePointManager.DefaultConnectionLimit = int.MaxValue;

    But I still got the error, the number of pages and the page’s length counter did not match.

    Please advice, what went wrong with my code.

    By the way, if I change my code from Parallel to ordinary in sequence, then my code works without the error.

    Finally, I am use Visual Studio 2019 (Version 16.2.2) on Windows 10.

    Wednesday, August 14, 2019 11:20 AM

All replies