Monday, April 30, 2007 2:10 AM
I need to navigate a web site from some root page and all its local links. This task might include 100's of pages. I dont need to download pictures, I just need the html code from each page itself.
So! How do i EASILY load an HTML page into an html document without any graphics?
I tried using the Web Browser control but it downloads the graphics.
I have tried to intercept the 'FileDownload' event of the browser, but i cannot identify the file name in that event nor can i stop it or cancel that particular file from being downloaded.
I Cannot find how to retrieve the HTML document without downloading any additional elements.
I don't want to use the disk either. There must be a simple solution.
Something Like : myTextBox.text=fromURL(http://www.mysite.com/index.html)
Monday, April 30, 2007 2:37 PM
The code below should do it, although if your going through a proxy and/or need to authenticate then you'll need to adapt it. This code downloads the pages to a string and then runs a regular expression to remove all <img> tags from the string.Code Snippet
ModuleModule1 Sub Main() Dim request As WebRequest = WebRequest.Create("http://www.mysite.com/index.html")
request.ContentType = MediaTypeNames.Text.Html
request.Method = WebRequestMethods.Http.GetDim response As WebResponse = request.GetResponse() Dim reader As New StreamReader(response.GetResponseStream()) Dim content As String = reader.ReadToEnd()
Wednesday, May 02, 2007 2:13 AM
This will work for me, but it will take extra work.
What i really want is to use the web browser control to load a page but have it skip any images. There is no problem with your approach, but the web browser control fixes all href links and convers them from relative to absolute. The browser also tells me the type (html, jpg,txt, etc). The web browser also parses all the links into a collection. These functions save me the hassle of having to write them myself. If i turn off the "Show Pictures" checkbox in the internet options advanced tab, this does what i want. Since the user may or may not have permissions to change this setting globally, i just want to turn it off in my app.
Also, I am still thinking that i can intercept the web browser control in the file download event. and cancel any file downloads. But the event seems stupid. It doesnt seem to tell me what file is about to be downloaded and it doesnt seem to give me the option to cancel it.
Anyway, I will keep searching!!
Thursday, May 03, 2007 8:30 AM
I had a look and the setting used by internet explorer is stored in the registry under the following key.
HKCU\SOFTWARE\Microsoft\Internet Explorer\Main\Display Inline Images
It has a value of 'no' or 'yes'
I haven't checked but if the web browser uses the options set in I.E. and your users have permission to edit the registry then your application could set this value. You could do it just before you got the page and reset back again as soon as the page was downloaded.
Another option, and maybe preferred, would be to wait until the web page was downloaded into the web browser then remove all the images through the Document.Images collection. Like this (not sure if OuterHtml or InnerHtml is better)Code Snippet
Dim WithEvents browser As New System.Windows.Forms.WebBrowser
AddHandler browser.DocumentCompleted, _
New System.Windows.Forms.WebBrowserDocumentCompletedEventHandler(AddressOf DocumentDownloaded)
End SubPublic Sub DocumentDownloaded(ByVal sender As Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) For Each element As System.Windows.Forms.HtmlElement In browser.Document.Images
element.OuterHtml ="" Next End Sub