none
How can I download a URL irregardless of whether it starts with http and or https (and assuming I don't know which it is) RRS feed

  • Question

  • Suppose a user types in a URL into my program such as:

    www.townhall.com/trump_is_super.htm

    Given that URL, I don't know if the site uses SSL and therefore starts with https, or whether it starts with http.

    Is there a way to download the webpage, given that I don't know this information?   (Actually, even if I knew the domain used SSL, I'm not sure I could always download the page, so a code sample would be much appreciated).

    Tuesday, November 6, 2018 6:41 PM

All replies

  • Hello,

    Please provide code you have tried and is not working for you to start with.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.
    VB Forums - moderator
    profile for Karen Payne on Stack Exchange, a network of free, community-driven Q&A sites

    Tuesday, November 6, 2018 8:29 PM
    Moderator
  •  Public Shared Function tryDownloadThreeWays(ByRef strURLargument As String, ByRef errorStructCheck As classAlert,
                                                    ByRef hadError As Boolean,
                                                    ByRef hadWarning As Boolean, ByRef responsefromserver As String,
                                                     ByRef alreadysaved As Boolean, ByRef httpschange As enumhttpschange,
                                                    ByVal shouldLimitPageSize As Boolean) As Boolean
            Dim errMsg As String = String.Empty
            Dim strURLssl As String
            Dim arglower As String = strURLargument.ToLower
            Dim alreadySSL As Boolean = False
            Dim strURLplain As String
            Dim hadError2 As Boolean = False
            Dim hadWarning2 As Boolean = False
            Dim alreadysaved2 As Boolean = False
            Dim responsefromserver2 As String = String.Empty
            Dim errMsg2 As String = String.Empty
            Dim errorstructCheck2 As New classAlert
            Dim hadError3 As Boolean = False
            Dim hadWarning3 As Boolean = False
            Dim alreadysaved3 As Boolean = False
            Dim responsefromserver3 As String = String.Empty
            Dim errMsg3 As String = String.Empty
            Dim errorstructCheck3 As New classAlert
            Dim ftpWorkedButStillMightNeedSSL As Boolean = False
    
            httpschange = enumhttpschange.none
            responsefromserver = String.Empty
            If arglower.StartsWith("https") Then
                alreadySSL = True
                strURLssl = strURLargument
                strURLplain = "http" & strURLargument.Substring(5)
            ElseIf arglower.StartsWith("http") Then
                alreadySSL = False
                strURLssl = "https" & strURLargument.Substring(4)
                strURLplain = strURLargument
            Else
                ' ???
                strURLssl = strURLargument
                strURLplain = strURLargument
            End If
    
    
            If Not alreadySSL Then
                If downloadWithBlock(strURLplain, responsefromserver, errorStructCheck, hadError, hadWarning, shouldLimitPageSize) Then
                    If responsefromserver.Contains("301") Then
                        ftpWorkedButStillMightNeedSSL = True
                    Else
                        httpschange = enumhttpschange.none
                        Return True
                    End If
                End If
                If httpschange <> enumhttpschange.addhttps Then
                    clearErrorIndicators(hadError, hadWarning, errorStructCheck, alreadysaved)
                    If downloadURLwithFTPtoText(strURLplain, responsefromserver, errorStructCheck, hadError, hadWarning, alreadysaved) Then
                        If responsefromserver.Contains("301") Then
                            ftpWorkedButStillMightNeedSSL = True
                        Else
                            httpschange = enumhttpschange.none
                            Return True
                        End If
                    End If
                End If
    
    
                If downloadURLContentsSSL(strURLssl, errMsg2, errorstructCheck2, hadError2, hadWarning2, responsefromserver2) Then
                    If ftpWorkedButStillMightNeedSSL Then
                        If responsefromserver2.Length > responsefromserver.Length Then
                            ' this is not foolproof test
                            errMsg = errMsg2
                            errorStructCheck = errorstructCheck2
                            hadError = hadError2
                            hadWarning = hadWarning2
                            responsefromserver = responsefromserver2
                            httpschange = enumhttpschange.addhttps
                            strURLargument = strURLssl
                            Return True
                        Else
                            httpschange = enumhttpschange.none
                            Return True
                        End If
                    Else
                        httpschange = enumhttpschange.addhttps
                        Return True
                    End If
    
                Else
                    'ssl failed
                    If ftpWorkedButStillMightNeedSSL Then
                        httpschange = enumhttpschange.none
                        Return True
                    Else
                        httpschange = enumhttpschange.none
                        Return False
                    End If
                End If
    
            Else
                ' alreadySSL
                If downloadURLContentsSSL(strURLssl, errMsg, errorStructCheck, hadError, hadWarning, responsefromserver) Then
                    httpschange = enumhttpschange.none
                    Return True
                End If
                clearErrorIndicators(hadError, hadWarning, errorStructCheck, alreadysaved)
                If downloadURLwithFTPtoText(strURLplain, responsefromserver, errorStructCheck, hadError, hadWarning, alreadysaved) Then
                    httpschange = enumhttpschange.removehttps
                    strURLargument = strURLplain
                    Return True
                End If
    
                clearErrorIndicators(hadError, hadWarning, errorStructCheck, alreadysaved)
                If downloadWithBlock(strURLplain, responsefromserver, errorStructCheck, hadError, hadWarning, shouldLimitPageSize) Then
                    httpschange = enumhttpschange.removehttps
                    strURLargument = enumhttpschange.removehttps
                    Return True
                End If
    
            End If
    
            Return False
        End Function
    
     Public Shared Function downloadWithBlock(ByVal strURL As String, ByRef responseFromServer As String,
                                                ByRef errorStructCheck As classAlert,
                                                 ByRef hadError As Boolean, ByRef hadWarning As Boolean, ByVal shouldLimitPageSize As Boolean) As Boolean
            ' errmsg = ""
            Dim retval As Boolean
            Dim strURLlower As String = strURL.ToLower
            Try
                ' Create a request for the URL. 		
                Dim request As WebRequest = WebRequest.Create(strURL)
                ' If required by the server, set the credentials.
                request.Credentials = CredentialCache.DefaultCredentials
                ' Get the response.
                Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)
                ' Display the status.
                ' Get the stream containing content returned by the server.
                Dim dataStream As Stream = response.GetResponseStream()
                ' Open the stream using a StreamReader for easy access.
                Dim reader As New StreamReader(dataStream)
                ' Read the content.
                If shouldLimitPageSize Then
                    Dim charbuf(ClassGlobalVariables.pageSizeLimit + 2) As Char
                    Dim i As Integer
                    For i = 0 To charbuf.Length - 1
                        charbuf(i) = " "
                    Next
                    reader.ReadBlock(charbuf, 0, ClassGlobalVariables.pageSizeLimit)
                    responseFromServer = New String(charbuf)
    
                Else
                    responseFromServer = reader.ReadToEnd()
                End If
    
                ' Display the content.
                reader.Close()
                dataStream.Close()
                response.Close()
                If strURLlower.Contains("americanthinker.com") Then
                    callsFromFinal.getRidofPrintRegion(responseFromServer)
                End If
                retval = True
            Catch ex As Exception
                retval = False
                'errmsg = "Unable to download [" & strURL & "] because: " & ex.Message
                'If ex.HResult = System.Net.WebExceptionStatus.SecureChannelFailure Then
                '    ' nothing for now
                'End If
                errorStructCheck.fillErrorStruct(classAlert.enumError.downloadFailed,
                           strURL & ": " & ex.Message,
                                               classAlert.enumWarning.fatalindividual,
                                               ClassGlobalVariables.pCurrentlyWorkingOnURL,
                                               hadError, hadWarning)
            End Try
            Return retval
        End Function 
    
     Public Shared Function downloadURLwithFTPtoText(ByVal URL As String, ByRef responsefromserver As String,
                                                        ByRef errorStructCheck As classAlert,
                                           ByRef hadError As Boolean, ByRef hadWarning As Boolean,
                                                        ByRef alreadySaved As Boolean) As Boolean
            Dim tempname As String = "tempftpdown.txt"
            Dim returnstring As String = ""
            Try
                If File.Exists(tempname) Then
                    My.Computer.FileSystem.DeleteFile(tempname)
                End If
                My.Computer.Network.DownloadFile(URL, tempname)
                responsefromserver = My.Computer.FileSystem.ReadAllText(tempname)
            Catch ex As Exception
                errorStructCheck.fillErrorStruct(classAlert.enumError.downloadFailed,
                           URL & ": " & ex.Message,
                                               classAlert.enumWarning.fatalindividual,
                                               ClassGlobalVariables.pCurrentlyWorkingOnURL,
                                               hadError, hadWarning)
            End Try
            If hadError Then
                Return False
            End If
            If hadWarning Then
                Return False
            End If
            Return True
        End Function
    
     Public Shared Function downloadURLContentsSSL(ByVal strURL As String, ByRef errmsg As String, ByRef errorStructCheck As classAlert, ByRef hadError As Boolean,
                                                    ByRef hadWarning As Boolean, ByRef strcontents As String) As Boolean
    
            Dim retval As Boolean = True
            If DownloadFileUsingAgent(strURL, "horses.txt", errmsg) Then
                strcontents = My.Computer.FileSystem.ReadAllText("horses.txt")
            Else
                retval = False
                strcontents = ""
                errorStructCheck.fillErrorStruct(classAlert.enumError.downloadFailed,
                        "file " & strURL & " with error: " & errmsg,
                                                classAlert.enumWarning.fatalindividual,
                                                ClassGlobalVariables.pCurrentlyWorkingOnURL, hadError, hadWarning)
            End If
            Return retval
    
        End Function
    
      Public Shared Function DownloadFileUsingAgent(ByVal strURL As String, ByVal fullname As String, ByRef errMessage As String) As Boolean
            Dim retval As Boolean = True
            Dim sslproblem As Boolean = False
    
            If strURL.ToLower.Contains("https://arxiv.org") Then
                'Dear Gid,
    
                'For crawling please use http://export.arxiv.org/ And a 2 sec crawl delay. Please do Not crawl https://arxiv.org as we have fairly strict firewall blocks. For bulk data access please see https://arxiv.org/help/bulk_data
    
                'Regards,
                'Jim
                'arXiv admin
                strURL = strURL.ToLower
                Dim pos1 As Integer = strURL.IndexOf("http")
                Dim pos2 As Integer = strURL.IndexOf(".org")
                Dim strprefix As String = ""
                If pos1 > 0 Then
                    strprefix = strURL.Substring(0, pos1 - 1)
                End If
                If pos2 > -1 Then
                    strURL = strprefix & "http://export.arxiv.org" & strURL.Substring(pos2 + 4)
                End If
    
            End If
    
            ' this way may be better, it might work with arxiv, for instance.
            Using wcx As WebClient = New WebClient()
                wcx.Headers.Add("User-Agent: BrowseAndDownload")
                ServicePointManager.Expect100Continue = True
                ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12
    
                Try
                    wcx.DownloadFile(strURL, fullname)
                Catch ex As Exception
                    retval = False
                    errMessage = ex.Message ' THIS MESSAGE WAS NOT the message that appeared in showdiag!!!!
                    If ex.HResult = System.Net.WebExceptionStatus.SecureChannelFailure Then
                        sslproblem = True
                    End If
                End Try
    
            End Using
            If Not retval And sslproblem Then
    
                ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 Or SecurityProtocolType.Tls12
                Dim webreq As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create(strURL)
                Dim cookieContainer As CookieContainer = New CookieContainer
                webreq.CookieContainer = cookieContainer
                Try
                    Using webresp As System.Net.WebResponse = webreq.GetResponse
                        Using respStream As IO.Stream = webresp.GetResponseStream
                            Using fs As New IO.FileStream(fullname, FileMode.Create, FileAccess.Write)
                                Dim buffer(2047) As Byte
                                Dim nRead As Integer
                                Do
                                    nRead = respStream.Read(buffer, 0, buffer.Length)
                                    fs.Write(buffer, 0, nRead)
                                Loop Until nRead = 0
                                respStream.Close()
                                fs.Flush()
                                fs.Close()
                            End Using
                            respStream.Close()
                        End Using
                        webresp.Close()
                    End Using
                Catch ex As Exception
                    retval = False
                    errMessage = ex.Message
                End Try
    
            End If
            If retval Then
                If strURL.Contains("arxiv.org") Then
                    Threading.Thread.Sleep(1000)
                Else
                    Threading.Thread.Sleep(100)
                End If
            End If
    
            Return retval
        End Function

    Above is the code you asked for - it will download files OK, but only if you give the correct prefix, and in the case of SSL, I'm not even sure it always works.

    Thanks.

    Wednesday, November 7, 2018 10:07 AM