locked
What is Screen Scraping and How to do it? RRS feed

  • Question

  • User1159546359 posted

    In very simple terms, Screen scraping is just making a Http Request from a web page. The simplest way of making a Http request is to use the WebClient class in System.Net, but it has its own drawbacks, like it refuses to work when its behind a proxy.

    Then comes the HttpWebRequest class, which has many advanced features and handles proxies quite good as well.

    Let me explain this alongside coding as well.

    First create an instance of HttpWebRequest class -

    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);

    Then lets create a HttpWebResponse object, which will contain the response returned from the GetResponse() method on our request object -

    HttpWebResponse res = req.GetResponse();

    The HttpWebResponse class provides access to a method called "GetResponseStream" which provides us access to the stream data returned -

    StreamReader sr = new StreamReader(res.GetResponseStream());

    The whole method will look like this-

    public void Test_Scraping()

    {

        HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://www.google.com");

        using (HttpWebResponse res = (HttpWebResponse)req.GetResponse())

        {

            StreamReader sr = new StreamReader(res.GetResponseStream());

            Response.Write(sr.ReadToEnd());

         }

    }

    *in this case i'm just taking the input stream as a string and writing it to the page response.

    Just run the simple code above and you see content from google home page. As simple as that.

    Now lets see one more feature of this object. Consider the following line- req.Timeout = 2000;

    Here req is the HttpWebRequest object.

    In this case we are setting a time out for this request to be executed. If there is no response from the remote server for a period of 2 secs or 2000 milli seconds, a WebException is raised. This is always better to do, as there can be multiple exceptions when contacting a remote server, in worst cases the remote server may not exist at all.

     

    Hope this helps ! 

    Thursday, December 20, 2007 6:11 AM

All replies

  • User-1614457691 posted

    That is a nice post above and here is another (similar) sample, FWIW, which was found somewhere on the web (I forget where) and modified to suit my needs. HTH.

    public const int DefaultUrlLengthMin = 16;
    public const int DefaultStartTokenLengthMin = 1;
    public const int DefaultEndTokenLengthMin = 1;
    public const string DefaultUrl = @"http://www.Google.com";
    public const string DefaultStartToken = @"<!-- TEST_START -->";
    public const string DefaultEndToken = @"<!-- TEST_END -->";
    
    private string ScreenScrapeNow(string targetUrl, string startToken, string endToken)
    {
        string myReturnValue = "";
    
        //Validate URL.
        
        targetUrl = targetUrl + "";
        targetUrl = targetUrl.Trim();
    
        if (targetUrl.Length >= DefaultUrlLengthMin)
        {
            //Continue.
        }
        else
        {
            throw new System.NotSupportedException("The URL is not valid.");
        }
    
        //Validate start.
    
        startToken = startToken + "";
        startToken = startToken.Trim();
    
        if (startToken.Length >= DefaultStartTokenLengthMin)
        {
            //Continue.
        }
        else
        {
            throw new System.NotSupportedException("The start-token is not valid.");
        }
    
        //Validate end.
    
        endToken = endToken + "";
        endToken = endToken.Trim();
    
        if (endToken.Length >= DefaultEndTokenLengthMin)
        {
            //Continue.
        }
        else
        {
            throw new System.NotSupportedException("The start-token is not valid.");
        }
    
        //Use WebRequest object fetches the URL
        WebRequest myWebRequest = WebRequest.Create(targetUrl);
    
        //The WebResponse object gets the Request's response (the HTML) 
        WebResponse myWebResponse = myWebRequest.GetResponse();
    
        //Put the contents of our HTML in the Response object to a Stream reader, but probably will not work with Unicode.
        StreamReader myStreamReader = new StreamReader(myWebResponse.GetResponseStream());
    
        //And dump the StreamReader into a string...
        string myContent = myStreamReader.ReadToEnd();
    
        //Get a working RegEx.
        Regex myRegEx = new Regex(startToken + "((.|\n)*?)" + endToken, RegexOptions.IgnoreCase);
    
        //Here we apply our regular expression to our string using the Match object. 
        Match myMatch = myRegEx.Match(myContent);
    
        //Bam! We return the value from our Match, and we're in business. 
        myReturnValue = myMatch.Value;
    
        return myReturnValue;
    }
    
    Thursday, December 20, 2007 8:39 AM
  • User1159546359 posted

    Thx for the nice code. I meant to keep it simple, so that a novice reader can understand.

    By the way, do u know of any way in which we can ByPass the server proxy while doing this, and still get to the required url ? I don't mean setting the flag which bypasses the proxy for local addresses.

    Friday, December 21, 2007 1:20 AM
  • User-1614457691 posted

    By the way, do u know of any way in which we can ByPass the server proxy while doing this, and still get to the required url ? I don't mean setting the flag which bypasses the proxy for local addresses.

    I am sorry; but, I have no idea.

    I have not faced that as an issue.

    If I do, and if I find a solution, then I will plan to post it here.

    Thank you.

    -- Mark Kamoski

    Friday, December 21, 2007 8:55 AM
  • User2093584515 posted

    Mark...Take a quick look at your http://adam.weblogicarts.com/ site, it's crashing with a webhost4life MDF error.

    Friday, December 21, 2007 9:15 AM
  • User-1614457691 posted

    Mark...Take a quick look at your http://adam.weblogicarts.com/ site, it's crashing with a webhost4life MDF error.

    Yes, I know.

    It has been that way for months and may remain that way too, for the forseeable future.

    As it turns out, spending time with my child, "Adam" does not leave much time for taking care of the "Adam fan site".

    [:D]

    Given the choice of the former or the later, I choose the former-- of course!!!

    Thank you.

    -- Mark Kamoski

    Friday, December 21, 2007 12:37 PM