Big Json file analysis in Azure HDInsight Spark Cluster RRS feed

  • Question


    I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Spark). But it's very slow. Here what I do:


    1. Data is downloaded from here.

    2. val data = ---- it crashes. Data are stored in HDFS. 

    3. val rdd = sc.textFile(path) ... then rdd.count() .... also crashes

    4. rdd.take(10) , ... these are ok

    5. It was not possible to unzip the file; I read it with data.json.gz

    Any suggestion? How I can read it with json reader?


    • Edited by Maryam_Lewen Wednesday, November 27, 2019 8:14 AM
    Wednesday, November 27, 2019 8:14 AM

All replies