none
Web-scraping macro (not working with IE9 - but works fine with IE8) RRS feed

  • Question

  • Hi,

    I have a web-scraping excel macro that pulls data from website to excel. Although this works perfectly fine with IE8, this does't work with IE9 environment. A quick look revealed that the issue is with the execution of pattern within HTML body. I am providing a small example snippet below to help you understand my problem. What it looks like is IE8 condenses the patterns into a single string while, IE9 is unable to do so. Any alternative way to make it working in IE9 please? Thanks for your time.

    PS:You may simply copy the below code into Excel VBE and have a look.

    Sub testSample()
    Dim IEApp As InternetExplorer
    Dim ColList As MSHTML.IHTMLElementCollection
    
    Set IEApp = CreateObject("InternetExplorer.Application")
    IEApp.Visible = True 'I have put in here just to see the IE navigation
    IEApp.Navigate2 ("http://www.msha.gov/drs/drshome.htm")
    
    Set RegEx = CreateObject("VBScript.RegExp")
    RegEx.IgnoreCase = True
    RegEx.Global = True
    RegEx.MultiLine = True
    
    Set ColList = IEApp.Document.getElementsByTagName("input")
    
    For Each element In ColList
        If element.Name = "MineId" Then
            element.Value = "0503672"
            Exit For
        End If
    Next
    
    For Each element In ColList
        If element.Value = "Search" Then
            element.Click
            Exit For
        End If
    Next
    
    getHTML = IEApp.Document.body.innerHTML
    
    RegEx.Pattern = "<b>Current Controller:</b></font></td>\r\n<td width=""40%"">(.*?)</b>"
    Set ColMtch = RegEx.Execute(getHTML) 'This line doesn't seem to work in IE9 (as opposed to IE8)
    For Each Mtch In ColMtch
       curCtrlr = stripHTML(Mtch.subMatches(0))
    Exit For
    Next
    
    RegEx.Pattern = "<td width=""20%"">(.*?)</td>\r\n<td width=""40%"">(.*?)</td>"
    Set ColMtch = RegEx.Execute(getHTML) 'Again this line doesn't seem to work in IE9 (as opposed to IE8)
    For Each Mtch In ColMtch
    If InStr(1, Mtch.subMatches(0), "Mine Status:") <> 0 Then
     mineStat = stripHTML(Mtch.subMatches(1))
     Exit For
    End If
    Next
    
    End Sub
    
    Function stripHTML(strHTML)
      Set objRegExp = CreateObject("VBScript.RegExp")
      objRegExp.IgnoreCase = True
      objRegExp.Global = True
      objRegExp.Pattern = "<(.|\n)+?>"
      stripHTML = objRegExp.Replace(strHTML, "")
    End Function
    

    Wednesday, March 20, 2013 4:34 PM

All replies

  • Not sure what is happening here, but as I examined the source code of the data you extract data, I noticed that td node attributes don't have double quotes. So, instead of:

    "<b>Current Controller:</b></font></td>\r\n<td width=""40""%>(.*?)</b>"

    It should be:

    "<b>Current Controller:</b></font></td>\r\n<td width=40%>(.*?)</b>"

    Also, I suggest you declaring all variables you're using in your procedures and add a code line like the one below after navigating to the desired URL innitially and also after clicking in the web command button:

    Do While IEApp.Busy: Loop

    This will prevent VBA trying to get an element of the body of the document that hasn't been loaded yet.



    Felipe Costa Gualberto - http://www.ambienteoffice.com.br

    Wednesday, March 20, 2013 9:27 PM
  • Hi, I think that I have solved the problem. Yes, the one that I had used works just fine. I have to tweak the pattern by breaking this into two parts - one containing the searchable element and another containing the target element and finally trimming the HTML tags. Anyway, thanks for your response.

    Wednesday, March 20, 2013 10:11 PM