none
Query tables created in Hive using Spark in HDInsight RRS feed

  • Question

  • I have created a Hadoop cluster and loaded some tables to hive. Now I have created a spark cluster and wish to see if we can query the tables from there. Could you please let me know if there is a custom configuration that needs to be done to be able to make the spark cluster work with the Hadoop/hive cluster ? Any insights are helpful, thank you.
    Monday, September 9, 2019 5:37 PM

All replies

  • Hello,

    To interact cluster with each other, you should deploy Hadoop and Spark cluster using the same hive metastore.

    Custom Metastore lets you attach multiple clusters and cluster types to same Metastore. Example – Single Metastore can be shared across Interactive Hive, Hive and Spark clusters in HDInsight.

    For more details, refer “Use external metadata stores in Azure HDInsight”.  

    In details:

    Step1: Create Azure SQL Database to use as the metastore.

    Note: Start with an S2 tier, which provides 50 DTU and 250 GB of storage. If you see a bottleneck, you can scale the database up.

    Step2: Create an HDInsight Hadoop cluster named “cheprahive” by configuring Metastore settings with Azure SQL Database.

    Step3: Create an HDInsight Spark cluster named “chepraspark” by configuring Metastore settings with same Azure SQL Database.

    To explain how to query tables created in Hive using HDInsight:

    I have created a table named “errorlogs” in HDInsight Hadoop cluster name “cheprahive”.

    Now, you can query tables created in Hive using Spark HDInsight cluster named “chepraspark”.

    Hope this helps.      

    ----------------------------------------------------------------------------------------

    Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

    Tuesday, September 10, 2019 7:08 AM
    Moderator
  • Appreciate the response.

    I indeed followed the same steps earlier. Created a Hadoop cluster and a spark cluster, leveraging the same external meta store (Azure SQL DB) and also the storage account (ADLS Gen2, just different file systems).

    However, I was trying to connect to spark command line using Azure powershell (bash) which I did and was able to launch the spark shell. When I try to query the table using a dataframe and when I show the result is when it says it doesn't recognize the database and/or that I created from hive.

    Are we sure it connects seamlessly as long as they have the same common external metastore ? Below says that there needs to be some custom settings done ? IF they can seamlessly connect with a common external metastore, any thoughts on why spark doesn't recognize the hive db and/or table ? Thanks.

    https://docs.microsoft.com/en-us/azure/hdinsight/interactive-query/apache-hive-warehouse-connector

    Tuesday, September 10, 2019 8:34 PM
  • Hello,

    In order to understand the issue, I would request you to provide the steps which you are trying, along with the screenshot of the error message?

    Wednesday, September 11, 2019 7:19 AM
    Moderator
  • Hello,

    Just checking in if you have had a chance to see the previous response. We need the above requested information to understand/investigate this issue further.

    Friday, September 13, 2019 9:49 AM
    Moderator