none
Downloading a dynamically-generated file using C# RRS feed

  • Question

  • I was working on a web scraping project, and was able to use ScrapySharp to log into a site and grab data from thousands of users. My boss was happy enough that he wanted me to use it to automate getting a CSV report from the same site. I just had to modify a few form values and roll. I was able to log in, to find and load the form as well as changing the pertinent values, but the site runs some kind of script to generate the CSV report I need, and I'm not able to access it via the scraper. 

    So I changed gears and tried it using HttpWebRequest/Reponse objects. I'm pretty sure I'm authenticating successfully, but I can't be sure. I get a 200 status code, but the HTML returned isn't the HTML of the form page I need to get. So I tried first sending a request to the login url, and then post the form data on the form page's url. When I get the response stream from the second request, I'm still getting the login page's HTML. I encountered something similar in ScrapySharp. ScrapySharp has a browser emulator, and a means to submit a page's form. The Submit method returns a WebPage object, and when I submit the login form via ScrapySharp, the returned WebPage yields the login page's HTML. However, if I then use ScrapySharp's browser to navigator to the required URL (which requires authentication), I get the correct page's HTML. I thought I could emulate this with HttpWebRequest by posting a second request to the required page with the form data it needed. That isn't working, or at least I can't seem to get the generated CSV file when I submit. I'm worried I might need to copy over headers, access tokens, whatever... but I'm not sure what I'm missing. Here's my code:

    public void DownloadCSV()
            {
                var cookieContainer = new CookieContainer();
    
                var request = WebRequest.Create(_loginUri) as HttpWebRequest;
                request.Credentials = GenerateCredentials();
                request.PreAuthenticate = true;
                request.CookieContainer = cookieContainer;
                request.KeepAlive = true;
                request.Method = WebRequestMethods.Http.Post;
                request.ContentType = "application/x-www-form-urlencoded";
    
                var loginResponse = request.GetResponse() as HttpWebResponse;
    
                using (var loginStream = loginResponse.GetResponseStream())
                using (var output = File.Create(_loginResponseSavePath))
                {
                    loginStream.CopyTo(output);
                }
    
                // Logged in, now submit form.
                var postData = "huge-string-of-post-data";
                var postBytes = Encoding.UTF8.GetBytes(postData);
    
                request = WebRequest.Create(_csvFormUri) as HttpWebRequest;
                request.Credentials = GenerateCredentials();
                request.ContentLength = postBytes.Length;
                request.CookieContainer = cookieContainer;
                request.KeepAlive = true;
                request.Method = WebRequestMethods.Http.Post;
                request.ContentType = "application/x-www-form-urlencoded";
    
                using (Stream postStream = request.GetRequestStream())
                {
                    postStream.Write(postBytes, 0, postBytes.Length);
                }
    
                var formResponse = request.GetResponse() as HttpWebResponse;
    
                using (var stream = formResponse.GetResponseStream())
                using (var output = File.Create(_csvSavePath))
                {
                    stream.CopyTo(output);
                }
            }

    And here's the code that generates the credentials:

    private CredentialCache GenerateCredentials()
            {
                var username = _configuration.GetValue<string>("LoginCreds:username");
                var password = _configuration.GetValue<string>("LoginCreds:password");
    
                var credentialCache = new CredentialCache();
                credentialCache.Add(_loginUri, "Basic", new NetworkCredential(username, password));
    
                return credentialCache;
            }

    I'm thinking that maybe the first request is authenticating as required, but I'm re-creating the request and possibly blowing that out. I re-generate the credentials, but nothing is working thus far. The actual site has a form, but some kind of report builder script runs and the browser automatically downloads the CSV file as an Excel spreadsheet. It's super-easy to grab the report manually, but the automation is driving me nuts. 


    Friday, April 19, 2019 8:32 PM

All replies

  • There are a lot of ways this could go wrong.

    If you browse to the login page manually, do you actually see a login form, or do you see a username/password dialog from the browser?  You're sending your credentials through HTTP Basic authentication, which is what the browser dialog uses, but usually places that have a separate "login" page are doing their own authentication, where you have to send the username and password as POST data to the page, not through HTTP Basic authentication.

    However, before you go much further, this may not do what you need at all.  If their web page is generating the CSV data using Javascript in the returned HTML, then you're still screwed.  What you get back will just be Javascript code that needs to be executed.  You need to fetch this web page using an actual browser that has a Javascript interpreter built-in.


    Tim Roberts | Driver MVP Emeritus | Providenza &amp; Boekelheide, Inc.

    Friday, April 19, 2019 11:15 PM