When need Azure HDInsight HBase cluster connected to more than one azure storage account? RRS feed

  • Question

  • Hi,

    I am new to HDInsight and Azure Storage.

    I noted that HDInsight uses Azure Storage as the HDFS layer for default use, leveraging the replication, scalability and so many wonderful features provided by Azure Storage. And now, we conduct a full-table scan on HBase, but get slow query performance. I figure out this is because of the network I/O between region server and Azure Storage.

    Therefore I feel confused about the storage management provided by HDInsight.

    1. Should I pay effort to specify the storage management strategy, for example by adding more storage account or adding more containers in one storage account to drive HBase to spread our data across multiple stores to improve the performance?

    2. When do I need to add more storage account or storage container? If I've added to my HDInsight HBase Cluster followed by Add additional storage accounts to HDInsight, how can I get to know the data is persisted across all storage account, and what is the persistence strategy, for example, one storage account for one copy of HBase data, or some other ways? Can this way improve the query performance of HBase?

    Looking forward to your kind answers.

    Many Thanks & Best Regards.
    Friday, August 4, 2017 5:26 AM


  • HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment. The clusters are configured to store data directly in Azure Storage or Azure Data Lake Store, which provides low latency and increased elasticity in performance and cost choices.

    HBase is a fantastic high end NoSql BigData machine that gives you many options to get great performance, there are no shortage of levers that you can't tweak to further optimize it.

    Below is the general list of impact-full considerations for great HBase performance in HDInsight:

    ·         Pick appropriate VM's

    ·         Incorrect Row Key design can really hurt

    ·         Drastically improve your write throughput by implementing batching

    ·         Improve your Read Performance by enabling bucket caching

    ·         Avoid major compaction at all cost

    ·         Presplit regions for instant great performance

    ·         Do not use HBase storage account for anything else

    ·         Avoid using HBase cluster for other hadoop applications

    ·         Disable or Flush HBase tables before you delete the cluster

    For more details, refer “HDInsight HBase: 9 things you must do to get great HBase performance”.


    Do click on "Mark as Answer" and “Vote as Helpful” on the post that helps you, this can be beneficial to other community members.

    Friday, August 4, 2017 6:54 AM