locked
Apply parallelism in C# to my code RRS feed

  • Question

  • User932968695 posted

    Hi all, 

    Sorry to bother you all, Im having a hard time to add multithreads to my scraping app, please could anyone tell me where to add the threads?

    Ive tried adding before reading each pages but not working.

    //First Process (Get Products Url) 
    string newer1 = Server.UrlEncode(FirstUrl); 
    string NewUrl = GetUrl("https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + newer1); 
    
    //Second Process (Get the HTML Parsed) 
    var doc = GetHtmlDoc("https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + NewUrl); 
    
    //Get the number of pages to be navigated 
    num = GetNumberofPages(doc); 
    
    //Get all the Url's on a product page and the Next link href 
    Tuple<List<string>, string> myVal = GetAllhrefs(doc); 
    string NPage = myVal.Item2; 
    LinkProduct = myVal.Item1;
    
    
    for (int i = 0; i < num; i++)
    {
    //If its first page then print the process
    if (i == 0)
    {
    Thread hiloNuevo = new Thread(() => EnterAllUrl(LinkProduct));
    hiloNuevo.Start();
    }
    else //Second time, get new HTML information
    {
    i++;
    var docw = GetHtmlDoc(newPage);
    Tuple<List<string>, string> test = GetAllhrefs(docw);
    string NPage2 = test.Item2;
    LinkProduct = test.Item1;
    
    
    Thread hiloNuevo2 = new Thread(() => EnterAllUrl(LinkProduct));
    hiloNuevo2.Start();
    
    
    NPage = NPage2;
    
    }
    }
    
    
    //METHOD THAT ORINTS EVERY URL
            private void EnterAllUrl(List<string> LinkProduct)
            {
                for (int j = 0; j < LinkProduct.Count; j++)
                {
                    string Linking = LinkProduct[j];
                    string proxlin = "https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + Linking;
                    var DataInfo = GetHtmlDoc(proxlin);
                    showData(DataInfo);
                }
            }


    Please Id be gratefull to know where to insert the threading because currently adding the threads they dont work, but without it takes too much time.

    Tuesday, May 1, 2018 8:25 AM

All replies

  • User283571144 posted

    Hi Phanthom241,

    Ive tried adding before reading each pages but not working.

    According to your codes and description, I couldn't understand your issue clearly.

    Do you mean you want to read all the pages by using multiple threads?

    I found some error in your codes.

    for (int i = 0; i < num; i++)
    {
    //If its first page then print the process
    if (i == 0)
    {
    Thread hiloNuevo = new Thread(() => EnterAllUrl(LinkProduct));
    hiloNuevo.Start();
    }
    else //Second time, get new HTML information
    {
    i++;
    var docw = GetHtmlDoc(newPage);
    Tuple<List<string>, string> test = GetAllhrefs(docw);
    string NPage2 = test.Item2;
    LinkProduct = test.Item1;

    Firstly, I found you call i++ inside for loop, the i++ has already been added in the for loop.

    If you add it twice, you will find you just read a half of pages.

    Besides, I found you replace LinkProduct value directly in the for loop.

    Since you have start a new thread to read the page, before reading completely you have changed the LinkProduct value.

    It will return error.

    I suggest you could try new LinkProduct and pass into the second thread.

    Best Regards,

    Brando

    Wednesday, May 2, 2018 9:20 AM
  • User475983607 posted

    Also, use async when making HTTP requests.  Parallel threads are for CPU intensive processes.  The code actually creates blocking threads. 

    Wednesday, May 2, 2018 10:04 AM
  • User932968695 posted

    Brando ZWZ

    Firstly, I found you call i++ inside for loop, the i++ has already been added in the for loop.

    You are right, I placed that there because i wanted to change the page but the loop has done it. Thanks


    Secondly, I understand, that must be the reason why Ii get no results because the value is being changed while the thread is being executed.

    Brando ZWZ

    Besides, I found you replace LinkProduct value directly in the for loop.

    *


    Im testing

    I get only one product from first page with those changes, dont understand why.

    //First Process (Get Products Url) 
    string newer1 = Server.UrlEncode(FirstUrl); 
    string NewUrl = GetUrl("https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + newer1); 
    
    //Second Process (Get the HTML Parsed) 
    var doc = GetHtmlDoc("https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + NewUrl); 
    
    //Get the number of pages to be navigated 
    num = GetNumberofPages(doc); 
    
    //Get all the Url's on a product page and the Next link href 
    Tuple<List<string>, string> myVal = GetAllhrefs(doc); 
    string NPage = myVal.Item2; 
    LinkProduct = myVal.Item1;
    
    
    for (int i = 0; i < num; i++)
    {
    //If its first page then print the process
    if (i == 0)
    {
    Thread hiloNuevo = new Thread(() => EnterAllUrl(LinkProduct));
    hiloNuevo.Start();
    }
    else //Second time, get new HTML information
    {
    var docw = GetHtmlDoc(newPage);
    Tuple<List<string>, string> test = GetAllhrefs(docw);
    string NPage2 = test.Item2;
    LinkProduct2 = test.Item1;
    
    
    Thread hiloNuevo2 = new Thread(() => EnterAllUrl(LinkProduct2));
    hiloNuevo2.Start();
    
    
    NPage = NPage2;
    
    }
    }
    
            private void EnterAllUrl(List<string> LinkProduct)
            {
                for (int j = 0; j < LinkProduct.Count ; j++)
                {
                    string Linking = LinkProduct[j];
                    string proxlin =  + Linking;
                    var DataInfo = GetHtmlDoc(proxlin);
                    
                    showData(DataInfo);
                }
    }
    
            private void showData(HtmlDocument DataInfo)
            {
                var HeaderNames = DataInfo.DocumentNode.SelectNodes("//h1[@id='title'] | //*[@id='priceblock_ourprice'] | //*[@id='productDetails_detailBullets_sections1']/tr[1]/td | //*[@id='productDetails_detailBullets_sections1']/tr[3]/td/span/span[1]");
                if (HeaderNames != null)
                {
                    var nodeList = HeaderNames.ToList();
                    foreach (var item in nodeList)
                    {
                        OutputLabel.Text += item.InnerText + "\t" + "<br />";
                    }
                }
    }
    
    

    Wednesday, May 2, 2018 9:57 PM