Recursive web crawling using VBA RRS feed

  • Question

  • Is it possible to crawl a web page recursively? Using two or three requests it is possible to produce lots of links but that is not what i want. Actually I was thinking to do it myself but I don't know how to roll a newly produced link using function or something so that it will run until all the links in a page reach its' dead end. Here is what I wrote to extract the link of a page. Hope somebody will give me an idea how to make those links roll recursively. Here is what I have written primarily. Thanks in advance.

    Sub ConditionalLink()
    Const url = ""
    Const page = ""
    Dim Links As Object, Link As Object

    With CreateObject("MSXML2.serverXMLHTTP")
        .Open "GET", url, False
        .setRequestHeader "Content-Type", "text/xml"
        Set html = CreateObject("htmlfile")
        html.body.innerHTML = .responseText
    End With
    Set Links = html.getElementsByTagName("a")
        For Each Link In Links
            If InStr(Link.href, "about:/wiki/") > 0 Then
                x = x + 1
                Cells(x, 1) = page & Replace(Link.href, "about:", "")
            End If
        Next Link
    Set Links = Nothing
    End Sub

    • Edited by ShahinIqbal Monday, April 3, 2017 9:48 PM correction
    Monday, March 27, 2017 5:52 AM

All replies

  • Here is something that uses SeleniumBasic. 

    Sub GetURLs()
      Dim ele As WebElement
      Dim urls As String
      Dim eles As WebElements
      Set drv = New IEDriver
      drv.Get ""
      Set eles = drv.FindElementsByXPath("//a")
      For Each ele In eles
        urls = ele.Attribute("href")
        Debug.Print urls
      Next ele
    End Sub
    I reread your post.  You want to search recursively for all links.  I did that a long time ago looking for broken links.  Let me see if I can find it.

    • Edited by mogulman52 Tuesday, March 28, 2017 11:01 AM
    Tuesday, March 28, 2017 1:43 AM
  • Try it like this.

    Sub scrapeHyperlinksWebsite()
    'We refer to an active copy of Internet Explorer
    Dim ie As InternetExplorer
    'code to refer to the HTML document returned
    Dim html As HTMLDocument
    Dim ElementCol As Object
    Dim Link As Object
    Dim erow As Long
    Application.ScreenUpdating = False
    'open Internet Explorer and go to website
    Set ie = New InternetExplorer
    ie.Visible = False
    ie.navigate ""
    'Wait until IE is done loading page
    Do While ie.readyState <> READYSTATE_COMPLETE
    Application.StatusBar = "Trying to go to website . . ."
    Set html = ie.document
    'Display text of HTML document returned in a cell
    'Range(“A1”) = html.DocumentElement.innerHTML
    Set ElementCol = html.getElementsByTagName("a")
    For Each Link In ElementCol
    erow = Worksheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0).Row
    Cells(erow, 1).Value = Link
    Cells(erow, 1).Columns.AutoFit
    'close down IE, reset status bar & turn on screenupdating
    Set ie = Nothing
    Application.StatusBar = “”
    Application.ScreenUpdating = True
    End Sub

    That script yields this result.


    Monday, April 3, 2017 1:48 PM
  • I think he wants to get all links in a website not just on a page.  Wikipedia would result in billions of links.
    Monday, April 3, 2017 7:07 PM