HDInsight with Datalake Gen2 storage RRS feed

  • Question

  • I have created HDInsight Hadoop cluster with underlying storage account as Blob storage and was able to successfully prove out a few concepts of creating tables, loading data from blob.

    However, I have now created a new Datalake Gen2 storage account. My idea is to build the Hadoop HDInsight cluster with Datalake Gen2 as the underlying storage account. When I try to create the cluster, it is forcing me to provide a user-assigned managed identity and I have no clue how it works. It was not there when blob is selected as storage but why only for DL gen2?

    I even tried creating a managed identity account but am not getting it in the drop down. Can you please help me move forward and complete the cluster creation with ADLS Gen2 ? Your help is much appreciated.

    Tuesday, August 27, 2019 5:00 PM

All replies

  • Hello azdevad1 and thank you for your inquiry.  I ran into the same issue myself and here is what I found:

    There are three steps:

    1. Create the user-assigned managed identity
    2. In the storage account, go to 'Access control (IAM)' and assign the managed identity permissions (I used contributor role).
    3. Create the HDInsight cluster.

    I found that it is necessary to do steps 1 and 2 before opening the HDInsight creation wizard.  When I opened the wizard, and then created the identity and assigned it a role, the drop down did not update until I reloaded the portal page.  Reloading the page reset all progress made in the wizard.

    Tuesday, August 27, 2019 9:05 PM
  • Thanks for the insights. Appreciate it.

    When selecting the storage account - There is a Primary Storage type. We can either select ADLS Gen2 or Azure Storage from the drop down. It however allows to select the gen2 storage account with either of the options. While the first one prompts for creating a managed identity, the second option doesn't . May I know what is the difference between selecting gen2 vs Azure storage in the primary storage option and then being able to select the gen2 account regardless ? Would it make a difference ?

    Wednesday, August 28, 2019 4:01 PM
  • I noticed that, but didn't get around to testing all the combinations.  I do have an off-the-cuff hypothesis.

    First let me explain the difference.  During storage account creation, there is the optional feature 'hierarchical namespace'.  This feature transforms the blob storage into Azure Data Lake Storage Gen2.  Due to the structural differences between blob and ADLS Gen2, there are different protocols and endpoints for each.  If you go to your storage account and look inside the properties blade, you can see the endpoint URIs for each.

    While ADLS Gen2 is built on top of blob storage, at this time mixing protocols (using blob interface on ALDS Gen2, or using ADLS Gen2 interface on blob storage) is not supported.  When you select the Primary Storage type, you are really specifying which protocol / interface to use.  This means if you select Primary Storage type ADLS Gen2, but choose a storage account that is blob only (or vice-versa), will cause errors during cluster deployment.  I believe I ran into this issue myself once, but I haven't tested every combination.

    What adds to the confusion, is the portal just lists 'generic' storage accounts, not caring whether it has hierarchical namespace or not, whether it is blob or ADLS Gen2.

    As to why only one of them requires managed identity, I will have to ask an expert.

    Wednesday, August 28, 2019 4:42 PM
  • @azdevad1 , did this help you? (if so please mark as answered.)  If not please let me know what else I can do to assist.
    Friday, August 30, 2019 5:29 PM