locked
Web Scraping With HTML Agiligy Pack RRS feed

  • Question

  • User1196439756 posted

    Sorry guys for being such a noob, but I have never done web scraping before.

    My quesiton is how can I scrap all href attributes that have a parent div container with a class attrib of 'persona-name'?

    Here is the site im trying to scrap from:

     http://www.battlefieldheroes.com/en/player/2305114994

    All I want to gather are the links colored in green:

    <ul class="user-personas">
    											<li class="user-persona clearfix">
    							<img src="http://cdn.battlefieldheroes.com/static/20101215090912/bulk-images/hero-headshot-icons-32/1-6-2-0-130.png" alt="avatar" class="persona-avatar" />
    							<div class="persona-name"><a href="/en/heroes/276720132">[gg]ROAST~BEEF</a></div>
    							<div class="persona-faction faction-National" title="National">&nbsp;</div>
    							<div class="persona-class class-gunner" title="Gunner">&nbsp;</div>
    							<div class="persona-level level-16" title="Level 16">&nbsp;</div>
    						</li>
    											<li class="user-persona clearfix">
    							<img src="http://cdn.battlefieldheroes.com/static/20101215090912/bulk-images/hero-headshot-icons-32/2-4-2-85-107.png" alt="avatar" class="persona-avatar" />
    							<div class="persona-name"><a href="/en/heroes/235328126">[gg]SLOPPY~JOE</a></div>
    							<div class="persona-faction faction-Royal" title="Royal">&nbsp;</div>
    							<div class="persona-class class-commando" title="Commando">&nbsp;</div>
    							<div class="persona-level level-12" title="Level 12">&nbsp;</div>
    						</li>
    											<li class="user-persona clearfix">
    							<img src="http://cdn.battlefieldheroes.com/static/20101215090912/bulk-images/hero-headshot-icons-32/2-4-4-0-107.png" alt="avatar" class="persona-avatar" />
    							<div class="persona-name"><a href="/en/heroes/233563772">[gg]HOOF~ARTED</a></div>
    							<div class="persona-faction faction-Royal" title="Royal">&nbsp;</div>
    							<div class="persona-class class-soldier" title="Soldier">&nbsp;</div>
    							<div class="persona-level level-30" title="Level 30">&nbsp;</div>
    						</li>
    											<li class="user-persona clearfix">
    							<img src="http://cdn.battlefieldheroes.com/static/20101215090912/bulk-images/hero-headshot-icons-32/2-6-2-85-107.png" alt="avatar" class="persona-avatar" />
    							<div class="persona-name"><a href="/en/heroes/220683351">[gg]PORK~CHOP</a></div>
    							<div class="persona-faction faction-Royal" title="Royal">&nbsp;</div>
    							<div class="persona-class class-gunner" title="Gunner">&nbsp;</div>
    							<div class="persona-level level-27" title="Level 27">&nbsp;</div>
    						</li>
    									</ul>
    I have tried the following code and cannot figure out why it is not working:
            Dim content As String = ""
            Dim web As New HtmlAgilityPack.HtmlWeb
            Dim doc As HtmlAgilityPack.HtmlDocument = web.Load("http://www.battlefieldheroes.com/en/player/2305114994")
            Dim hnc As HtmlAgilityPack.HtmlNodeCollection = doc.DocumentNode.SelectNodes("//div[@class='persona-name']//a")
            For Each link As HtmlAgilityPack.HtmlNode In hnc
                Dim replaceUnwanted As String = ""
    
                replaceUnwanted = link.GetAttributeValue("href", String.Empty) '
                replaceUnwanted = replaceUnwanted.Replace("&#39;", "'")
    
                content &= replaceUnwanted & vbNewLine
            Next
    
            HTMLText.Text = content
    
    I get the following error:

    Object reference not set to an instance of an object.

    ...

    Line 8:          Dim doc As HtmlAgilityPack.HtmlDocument = web.Load("http://www.battlefieldheroes.com/en/player/2305114994")
    Line 9:          Dim hnc As HtmlAgilityPack.HtmlNodeCollection = doc.DocumentNode.SelectNodes("//div[@class='persona-name']")
    Line 10:         For Each link As HtmlAgilityPack.HtmlNode In hnc
    Line 11:             Dim replaceUnwanted As String = ""
    Line 12: 

     

    Thanks for any suggestions!

    Friday, December 17, 2010 5:05 PM

Answers

  • User1196439756 posted

    Never mind.  'persona-name' is not a displayed div until the an account is logged in. I now understand why it was NULL

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Friday, December 17, 2010 5:30 PM