none
How to load an html page without loading pictures

    Question

  • Hi

     

    I need to navigate a web site from some root page and all its local links. This task might include 100's of pages. I dont need to download pictures, I just need the html code from each page itself.

     

    So! How do i EASILY load an HTML page into an html document without any graphics?

     

    I tried using the Web Browser control but it downloads the graphics.

     

    I have tried to intercept the 'FileDownload' event of the browser, but i cannot identify the file name in that event nor can i stop it or cancel that particular file from being downloaded.

     

    I Cannot find how to retrieve the HTML document without downloading any additional elements.

     

    I don't want to use the disk either. There must be a simple solution.

     

    Something Like : myTextBox.text=fromURL(http://www.mysite.com/index.html)

     

    Thanx

    Jerry Cic

    Monday, April 30, 2007 2:10 AM

Answers

  • Hi.

     

    The code below should do it, although if your going through a proxy and/or need to authenticate then you'll need to adapt it. This code downloads the pages to a string and then runs a regular expression to remove all <img> tags from the string.

    Code Snippet

    Imports System.Net

    Imports System.Net.Mime

    Imports System.IO

    Imports System.Text.RegularExpressions

     

    Module Module1

    Sub Main()

       Dim request As WebRequest = WebRequest.Create("http://www.mysite.com/index.html")

       request.ContentType = MediaTypeNames.Text.Html

       request.Method = WebRequestMethods.Http.Get

       Dim response As WebResponse = request.GetResponse()

       Dim reader As New StreamReader(response.GetResponseStream())

       Dim content As String = reader.ReadToEnd()

       Console.WriteLine(Regex.Replace(content, "<img([^>]*[^/])>", ""))

       Console.ReadLine()

    End Sub

    End Module

     

    Monday, April 30, 2007 2:37 PM

All replies

  • Hi.

     

    The code below should do it, although if your going through a proxy and/or need to authenticate then you'll need to adapt it. This code downloads the pages to a string and then runs a regular expression to remove all <img> tags from the string.

    Code Snippet

    Imports System.Net

    Imports System.Net.Mime

    Imports System.IO

    Imports System.Text.RegularExpressions

     

    Module Module1

    Sub Main()

       Dim request As WebRequest = WebRequest.Create("http://www.mysite.com/index.html")

       request.ContentType = MediaTypeNames.Text.Html

       request.Method = WebRequestMethods.Http.Get

       Dim response As WebResponse = request.GetResponse()

       Dim reader As New StreamReader(response.GetResponseStream())

       Dim content As String = reader.ReadToEnd()

       Console.WriteLine(Regex.Replace(content, "<img([^>]*[^/])>", ""))

       Console.ReadLine()

    End Sub

    End Module

     

    Monday, April 30, 2007 2:37 PM
  • Thank you!

     

    This will work for me, but it will take extra work.

     

    What i really want is to use the web browser control to load a page but have it skip any images. There is no problem with your approach, but the web browser control fixes all href links and convers them from relative to absolute. The browser also tells me the type (html, jpg,txt, etc). The web browser also parses all the links into a collection. These functions save me the hassle of having to write them myself. If i turn off the "Show Pictures" checkbox in the internet options advanced tab, this does what i want. Since the user may or may not have permissions to change this setting globally, i just want to turn it off in my app.

     

    Also, I am still thinking that i can intercept the web browser control in the file download event. and cancel any file downloads. But the event seems stupid. It doesnt seem to tell me what file is about to be downloaded and it doesnt seem to give me the option to cancel it.

     

    Anyway, I will keep searching!!   

     

    Thanx

    Wednesday, May 02, 2007 2:13 AM
  • Hi,

     

    I had a look and the setting used by internet explorer is stored in the registry under the following key.

     

    HKCU\SOFTWARE\Microsoft\Internet Explorer\Main\Display Inline Images

     

    It has a value of 'no' or 'yes'

     

    I haven't checked but if the web browser uses the options set in I.E. and your users have permission to edit the registry then your application could set this value. You could do it just before you got the page and reset back again as soon as the page was downloaded.

     

    Another option, and maybe preferred, would be to wait until the web page was downloaded into the web browser then remove all the images through the Document.Images collection. Like this (not sure if OuterHtml or InnerHtml is better)

     

    Code Snippet

    Dim WithEvents browser As New System.Windows.Forms.WebBrowser

    Sub Main()

    AddHandler browser.DocumentCompleted, _

    New System.Windows.Forms.WebBrowserDocumentCompletedEventHandler(AddressOf DocumentDownloaded)

    End Sub

    Public Sub DocumentDownloaded(ByVal sender As Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs)

    For Each element As System.Windows.Forms.HtmlElement In browser.Document.Images

    element.OuterHtml = ""

    Next

    End Sub

     

     

     

    Thursday, May 03, 2007 8:30 AM