locked
Read files from Azure storage in map function RRS feed

  • Question

  • Hi,

    My pyspark application tries to process some json data in Azure blob containers (they are in different containers). To process json files in one container, I use the following code

    sc = SparkContext(conf=conf)  # construct the spark context
    file_in_container = "wasbs://<containername>@<accountname>.blob.core.windows.net/<file.path>/*.json"
    files = sc.textFile(file_in_container)  # build RDD

    And then I have a `map` function which will parse the json string, and get statistics for aggregation. This method worked, but I found the speed is too slow. I think on possible reason is that when I run "files = sc.textFile(file_in_container)", it results in the header node to read all json file in the container, and transfer them into each executor.

    I think a better way is to build RDD from the file names, and let each executor to read the json file. But the problems is how can executor access the data in Azure container? I think "wasbs://" can not work here, since in map function, it seems "sc" is forbiddend. So I cannot get json files by "sc.textFile()". So how can I achieve this ?


    Saturday, April 18, 2020 12:11 AM

All replies

  • Hello,

    Could you please add the complete code which you are trying and also do share complete stack trace of the error message which your experiencing?

    Meanwhile, kindly go through difference between loading text files and Json files.

    Text files are very simple and convenient to load from and save to Spark applications. When we load a single text file as an RDD, then each input line becomes an element in the RDD. It has the capacity to load multiple whole text files at the same time into a pair of RDD elements, with the key being the name given and the value the contents of each file format specified.

    Loading the text files: Loading a single text file is as simple as calling the textFile() function on our SparkContext with the pathname placed next to the file, as shown below:

    input =sc.textFile("wasbs://<containername>@<accountname>.blob.core.windows.net/<file.path>/*README.md")

    JSON files, which is a light-weighted data interchange format. It supports text only which can be easily sent to and received from a server.

    Loading the JSON Files: For all supported languages, the approach of loading data in the text form and parsing the JSON data can be adopted. Here, if the file contains multiple JSON records, the developer will have to download the entire file and parse each one by one.

    # spark is from the previous example.
    
    sc = spark.sparkContext
    
    # A JSON dataset is pointed to by path.
    
    # The path can be either a single text file or a directory
    storing text files
    
    path = " wasbs://<containername>@<accountname>.blob.core.windows.net/<file.path>/*.json"
    
    peopleDF = spark.read.json(path)

    There are several ways you can access the files in Blob/Data Lake Storage from an HDInsight cluster. The URI scheme provides unencrypted access (with the wasb: prefix) and TLS encrypted access (with wasbs). We recommend using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.

    Reference: Use Azure storage with Azure HDInsight clusters

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

    Wednesday, April 22, 2020 3:38 AM
  • Hello,

    Just checking in to see if the above answer helped. If this answers your query, do click “Mark as Answer” and Up-Vote for the same. And, if you have any further query do let us know.

    Thursday, April 23, 2020 10:40 AM
  • Hello,

    Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

    Friday, April 24, 2020 11:27 AM