Big Json file analysis in Azure HDInsight Spark Cluster RRS feed

  • Question


    I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Spark). But it's very slow. Here what I do:


    1. Data is downloaded from here.

    2. val data = spark.read.json(path) ---- it crashes. Data are stored in HDFS. 

    3. val rdd = sc.textFile(path) ... then rdd.count() .... also crashes

    4. rdd.take(10) , ... these are ok

    5. It was not possible to unzip the file; I read it with data.json.gz

    Any suggestion? How I can read it with json reader?


    • Edited by Maryam_Lewen Wednesday, November 27, 2019 8:14 AM
    Wednesday, November 27, 2019 8:14 AM

All replies

  • Hello Maryam_Lewen and thank you for your question.  I saw your related thread on unzipping.

    Were there any error messages or logs from the crash you can share?

    Thursday, November 28, 2019 2:33 AM
  • Thanks Martin, the problem is because of not being able to unzip, parallel computing does not happen and I get timeout error.
    Thursday, November 28, 2019 9:38 AM
  • Since I am helping you with unzipping in the other thread, may I tentatively close this one?  If the unzip fails, we can come back here.
    Monday, December 2, 2019 10:58 PM