Sorry guys for being such a noob, but I have never done web scraping before.
My quesiton is how can I scrap all href attributes that have a parent div container with a class attrib of 'persona-name'?
Here is the site im trying to scrap from:
http://www.battlefieldheroes.com/en/player/2305114994
All I want to gather are the links colored in green:
<ul class="user-personas">
<li class="user-persona clearfix">
<img src="http://cdn.battlefieldheroes.com/static/20101215090912/bulk-images/hero-headshot-icons-32/1-6-2-0-130.png" alt="avatar" class="persona-avatar" />
<div class="persona-name"><a href="/en/heroes/276720132">[gg]ROAST~BEEF</a></div>
<div class="persona-faction faction-National" title="National"> </div>
<div class="persona-class class-gunner" title="Gunner"> </div>
<div class="persona-level level-16" title="Level 16"> </div>
</li>
<li class="user-persona clearfix">
<img src="http://cdn.battlefieldheroes.com/static/20101215090912/bulk-images/hero-headshot-icons-32/2-4-2-85-107.png" alt="avatar" class="persona-avatar" />
<div class="persona-name"><a href="/en/heroes/235328126">[gg]SLOPPY~JOE</a></div>
<div class="persona-faction faction-Royal" title="Royal"> </div>
<div class="persona-class class-commando" title="Commando"> </div>
<div class="persona-level level-12" title="Level 12"> </div>
</li>
<li class="user-persona clearfix">
<img src="http://cdn.battlefieldheroes.com/static/20101215090912/bulk-images/hero-headshot-icons-32/2-4-4-0-107.png" alt="avatar" class="persona-avatar" />
<div class="persona-name"><a href="/en/heroes/233563772">[gg]HOOF~ARTED</a></div>
<div class="persona-faction faction-Royal" title="Royal"> </div>
<div class="persona-class class-soldier" title="Soldier"> </div>
<div class="persona-level level-30" title="Level 30"> </div>
</li>
<li class="user-persona clearfix">
<img src="http://cdn.battlefieldheroes.com/static/20101215090912/bulk-images/hero-headshot-icons-32/2-6-2-85-107.png" alt="avatar" class="persona-avatar" />
<div class="persona-name"><a href="/en/heroes/220683351">[gg]PORK~CHOP</a></div>
<div class="persona-faction faction-Royal" title="Royal"> </div>
<div class="persona-class class-gunner" title="Gunner"> </div>
<div class="persona-level level-27" title="Level 27"> </div>
</li>
</ul>
I have tried the following code and cannot figure out why it is not working:
Dim content As String = ""
Dim web As New HtmlAgilityPack.HtmlWeb
Dim doc As HtmlAgilityPack.HtmlDocument = web.Load("http://www.battlefieldheroes.com/en/player/2305114994")
Dim hnc As HtmlAgilityPack.HtmlNodeCollection = doc.DocumentNode.SelectNodes("//div[@class='persona-name']//a")
For Each link As HtmlAgilityPack.HtmlNode In hnc
Dim replaceUnwanted As String = ""
replaceUnwanted = link.GetAttributeValue("href", String.Empty) '
replaceUnwanted = replaceUnwanted.Replace("'", "'")
content &= replaceUnwanted & vbNewLine
Next
HTMLText.Text = content
I get the following error:
Object reference not set to an instance of an object.
...
Line 8: Dim doc As HtmlAgilityPack.HtmlDocument = web.Load("http://www.battlefieldheroes.com/en/player/2305114994")
Line 9: Dim hnc As HtmlAgilityPack.HtmlNodeCollection = doc.DocumentNode.SelectNodes("//div[@class='persona-name']")
Line 10: For Each link As HtmlAgilityPack.HtmlNode In hnc
Line 11: Dim replaceUnwanted As String = ""
Line 12:
|
Thanks for any suggestions!