Asked by:
Apply parallelism in C# to my code

Question
-
User932968695 posted
Hi all,
Sorry to bother you all, Im having a hard time to add multithreads to my scraping app, please could anyone tell me where to add the threads?
Ive tried adding before reading each pages but not working.
//First Process (Get Products Url) string newer1 = Server.UrlEncode(FirstUrl); string NewUrl = GetUrl("https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + newer1); //Second Process (Get the HTML Parsed) var doc = GetHtmlDoc("https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + NewUrl); //Get the number of pages to be navigated num = GetNumberofPages(doc); //Get all the Url's on a product page and the Next link href Tuple<List<string>, string> myVal = GetAllhrefs(doc); string NPage = myVal.Item2; LinkProduct = myVal.Item1; for (int i = 0; i < num; i++) { //If its first page then print the process if (i == 0) { Thread hiloNuevo = new Thread(() => EnterAllUrl(LinkProduct)); hiloNuevo.Start(); } else //Second time, get new HTML information { i++; var docw = GetHtmlDoc(newPage); Tuple<List<string>, string> test = GetAllhrefs(docw); string NPage2 = test.Item2; LinkProduct = test.Item1; Thread hiloNuevo2 = new Thread(() => EnterAllUrl(LinkProduct)); hiloNuevo2.Start(); NPage = NPage2; } } //METHOD THAT ORINTS EVERY URL private void EnterAllUrl(List<string> LinkProduct) { for (int j = 0; j < LinkProduct.Count; j++) { string Linking = LinkProduct[j]; string proxlin = "https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + Linking; var DataInfo = GetHtmlDoc(proxlin); showData(DataInfo); } }
Please Id be gratefull to know where to insert the threading because currently adding the threads they dont work, but without it takes too much time.Tuesday, May 1, 2018 8:25 AM
All replies
-
User283571144 posted
Hi Phanthom241,
Ive tried adding before reading each pages but not working.According to your codes and description, I couldn't understand your issue clearly.
Do you mean you want to read all the pages by using multiple threads?
I found some error in your codes.
for (int i = 0; i < num; i++) { //If its first page then print the process if (i == 0) { Thread hiloNuevo = new Thread(() => EnterAllUrl(LinkProduct)); hiloNuevo.Start(); } else //Second time, get new HTML information { i++; var docw = GetHtmlDoc(newPage); Tuple<List<string>, string> test = GetAllhrefs(docw); string NPage2 = test.Item2; LinkProduct = test.Item1;
Firstly, I found you call i++ inside for loop, the i++ has already been added in the for loop.
If you add it twice, you will find you just read a half of pages.
Besides, I found you replace LinkProduct value directly in the for loop.
Since you have start a new thread to read the page, before reading completely you have changed the LinkProduct value.
It will return error.
I suggest you could try new LinkProduct and pass into the second thread.
Best Regards,
Brando
Wednesday, May 2, 2018 9:20 AM -
User475983607 posted
Also, use async when making HTTP requests. Parallel threads are for CPU intensive processes. The code actually creates blocking threads.
Wednesday, May 2, 2018 10:04 AM -
User932968695 posted
Brando ZWZ
Firstly, I found you call i++ inside for loop, the i++ has already been added in the for loop.You are right, I placed that there because i wanted to change the page but the loop has done it. Thanks
Secondly, I understand, that must be the reason why Ii get no results because the value is being changed while the thread is being executed.Brando ZWZ
Besides, I found you replace LinkProduct value directly in the for loop.*
Im testingI get only one product from first page with those changes, dont understand why.
//First Process (Get Products Url) string newer1 = Server.UrlEncode(FirstUrl); string NewUrl = GetUrl("https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + newer1); //Second Process (Get the HTML Parsed) var doc = GetHtmlDoc("https://api.proxycrawl.com/?token=KWexa-72p7YRyW2Pq7W2Jg&url=" + NewUrl); //Get the number of pages to be navigated num = GetNumberofPages(doc); //Get all the Url's on a product page and the Next link href Tuple<List<string>, string> myVal = GetAllhrefs(doc); string NPage = myVal.Item2; LinkProduct = myVal.Item1; for (int i = 0; i < num; i++) { //If its first page then print the process if (i == 0) { Thread hiloNuevo = new Thread(() => EnterAllUrl(LinkProduct)); hiloNuevo.Start(); } else //Second time, get new HTML information { var docw = GetHtmlDoc(newPage); Tuple<List<string>, string> test = GetAllhrefs(docw); string NPage2 = test.Item2; LinkProduct2 = test.Item1; Thread hiloNuevo2 = new Thread(() => EnterAllUrl(LinkProduct2)); hiloNuevo2.Start(); NPage = NPage2; } } private void EnterAllUrl(List<string> LinkProduct) { for (int j = 0; j < LinkProduct.Count ; j++) { string Linking = LinkProduct[j]; string proxlin = + Linking; var DataInfo = GetHtmlDoc(proxlin); showData(DataInfo); } } private void showData(HtmlDocument DataInfo) { var HeaderNames = DataInfo.DocumentNode.SelectNodes("//h1[@id='title'] | //*[@id='priceblock_ourprice'] | //*[@id='productDetails_detailBullets_sections1']/tr[1]/td | //*[@id='productDetails_detailBullets_sections1']/tr[3]/td/span/span[1]"); if (HeaderNames != null) { var nodeList = HeaderNames.ToList(); foreach (var item in nodeList) { OutputLabel.Text += item.InnerText + "\t" + "<br />"; } } }
Wednesday, May 2, 2018 9:57 PM