locked
How to programmatically download xml files from a website that requires password RRS feed

  • Question

  • Hi, Can someone tell me if it is possible to download XML files from a webpage so I can read them with my c# application, please?

    Or is this more of a web scraping exercise?

    The page is basically HTML 4.01 and the XML files are href links in a table. All links have:

    <a class='xmlLink' href="http://webAddress.com/document.xml">xml</a><a class='xmlLink'

    If I can download the contents of each xml link from say a c# winfom application, what libraries would I use?

    If it's a web scraping task, what libraries are there for that?

    I have the credentials to access the webpage.

    Thanks for the advice.


    CuriousCoder


    Friday, September 28, 2018 8:48 AM

Answers

  • Hi, thank you for your answers, it seems everyone is focussing on the 'with password' part of the question and not really looking at the 'How to download xml files from a website' part. This is probably my fault, I should have phrased the question better.

    I was asking if accessing a website to retrieve XML files was a web scraping task or if it could be achieved in some other way, again my fault for the composition of the question.

    I have since found that it is a web scraping task and can access the page using HtmlAgilityPack, as the page requires a password I have also found that creating a cookie holder and mimicking a web request to capture the cookie is needed and a solution to these can be found here: 

    https://stackoverflow.com/questions/52558311/how-to-pass-a-password-when-using-htmlagilitypack/52558985#52558985

    Hope this helps someone else too.

    Many Thanks


    CuriousCoder

    Monday, October 1, 2018 8:48 AM

All replies

  • Hello,

    If you read a specification for URL - there will be described where and how can be specified username and password.

    If server didn't support correct URL encoding - you will have to do several requests to the server to get data.


    Sincerely, Highly skilled coding monkey.

    Friday, September 28, 2018 8:54 AM
  • So, are you saying this is a web scraping task?

    CuriousCoder

    Friday, September 28, 2018 9:25 AM
  • Do you study the specification for URL?

    Do you check if proper URL encoding with name&pass works on the server?


    Sincerely, Highly skilled coding monkey.

    Friday, September 28, 2018 9:47 AM
  • Can you explain you concrete use-case?

    Cause whether it is web scraping or not, cannot be told from your given description. We can also give no recommendations for libraries to use without knowing that.

    Assuming the normal use-case in such assumed scenarios:

    First of all, contact the owner/operator, whether this is an allowed usage of the web site. Then ask for a describtion of how to do it. Cause sometimes they have optimized ways to access documents or files on their system.

    Otherwise, when the URL is well-known, you need only to examine the web sites authentication mechanism. And reply it using the normal WebClient.

    Friday, September 28, 2018 10:54 AM
  • Hi, thank you for your answers, it seems everyone is focussing on the 'with password' part of the question and not really looking at the 'How to download xml files from a website' part. This is probably my fault, I should have phrased the question better.

    I was asking if accessing a website to retrieve XML files was a web scraping task or if it could be achieved in some other way, again my fault for the composition of the question.

    I have since found that it is a web scraping task and can access the page using HtmlAgilityPack, as the page requires a password I have also found that creating a cookie holder and mimicking a web request to capture the cookie is needed and a solution to these can be found here: 

    https://stackoverflow.com/questions/52558311/how-to-pass-a-password-when-using-htmlagilitypack/52558985#52558985

    Hope this helps someone else too.

    Many Thanks


    CuriousCoder

    Monday, October 1, 2018 8:48 AM