none
HDInsight vs. Virtualized Hadoop Cluster on Azure RRS feed

  • Question

  • I'm investigating two alternatives for using a Hadoop cluster, the first one is using HDInsight (with either Blob or HDFS storage) and the second alternative is Deploying a powerful Windows Server on Microsoft Azure and run HDP on it. The second alternative gives me more flexibilites, however what I'm interested in investigating is the overhead each alternative has. Any ideas on that?

    • Edited by HESSAMZAK Thursday, February 5, 2015 12:44 AM
    Wednesday, February 4, 2015 11:10 PM

Answers

  • Hi,

    I just found out that:

    Data disks in Azure VMs are limited to a max size of a page blob.

    The current maximum size for page blob is 1 TB. There is no current plan to increase that limit in the near future. There is, of course, the ability to create disk arrays of up to 32 TB with up to 50000 IOPS. 

    Regards.


    Debarchan Sarkar - MSFT ( This posting is provided AS IS with no warranties, and confers no rights.)


    Friday, March 6, 2015 5:49 AM

All replies

  • Hi,

    You are correct, the second alternative i.e. HDP on Windows Azure VMs gives you more flexibility and control on your Hadoop clusters. It will be easier for you to install the latest versions of HDP (and hence Hadoop), modify your configuration files on the fly, troubleshoot and debug the issues that you may be having while running your jobs and so on. The overhead with this approach, however, is that you have to maintain your infrastructure yourself. You have to take care of any OS patching, security updates, setting up the network and authentication between your VMs and things like that.

    HDInsight on the other hand makes life easy in terms of setting up the cluster with a few mouse clicks. The OS is automatically patched and all security updates are taken care of. However, you have to use the build of HDP (and Hadoop) that HDInsight ships in and you cannot take advantage of the latest HDP builds. In case of any bugs/issues that are covered in the latest version of the component (e.g. a Hive 0.13 bug that is fixed in Hive 0.14), you cannot upgrade that component on demand; rather you have to wait till another version of HDInsight service is available with the newer versions. Moreover, configuration changes for your jobs are not straight forward and you have to recreate your cluster every time you want to push a configuration change. This is because the HDInsight VMs are refreshed periodically (which cannot be controlled) to its initial state and any configuration changes made on the fly are lost.

    So, to summarize, as a bulleted set of points:

    The benefits of the HDInsight service (PaaS) are the following:
    •         Ease of deployment, 0 to cluster in about 10 minutes.
    •         Supported Hadoop offering, the customer can call Microsoft and get help with Hadoop.  In IaaS the customer will need to acquire their own support, Microsoft’s support stops with the VM.
    •         First class support for “transient clusters” that allow you to only use compute when you need it, by way of externalizing data to blob store, and metadata to a database.  You could do this on IaaS, but there is more complexity
    •         We run the cluster for you, in IaaS, you will need to run the cluster.
    •         An SLA on cluster availability.  IaaS will get you an SLA on VM availability.
    •         The service handles OS patching, Azure machine lifecycle events, etc. 
    •         We’ve integrated into the Azure hosted service model in order to provide a robust security model.  You could mimic this on IaaS, but it’s left as an exercise for you.

    The benefits of running HDP on Azure VMs (IaaS) are the following:

    •         Full customization of the machines. If you want to install random piece of software X on every node, you can  go ahead.  We constrain your ability to do that in the service as it makes it much tougher for us to offer an SLA
    •         Support of the HDP entire stack, for instance, we don’t have Hue in the HDInsight service, yet. You can install the latest version of HDP with all its components, for which, in HDInsight, you may have to wait for the next service release.
    •         If you have an on-prem solution and want to use Azure as a Development environment.  You can have parity with your on-prem distribution.
    •         You have data that requires the BAA to cover HIPAA requirements.  HDInsight is not covered in the BAA, but network, storage and VMs are covered – the core ingredients for IaaS.

    Personally, I don’t really have a solid preference of HDP on Azure VMs over HDI or vice versa – it comes down to the scenario and exact requirements. From the cost perspective both of them are almost similar, we charge per core at the hosted service price, on IaaS, you pay a little less per core (and it is different for Windows vs Linux).

    Hope this helps.

    Regards.


    DebarchanS - MSFT


    Thursday, February 5, 2015 10:36 AM
  • Thanks for your elaborate answer. What I'm mostly concerned of is the performance difference between these two alternatives, particularly the overhead (or degradation) that using Blob storage might have on the performance. 

    The other thing is I'm going to run some machine learning algorithms on my data and I wonder to know how optimized the Machine Learning Framework in Azure is compared to Mahout package on Hadoop.

    Thanks.


    • Edited by HESSAMZAK Thursday, February 5, 2015 5:39 PM
    Thursday, February 5, 2015 5:37 PM
  • Hi,

    Your concern is valid. It's a common question for many - The network is often the bottleneck and making it performant can be expensive.  Yet the practice for HDInsight on Azure is to place the data into Azure Blob Storage ; these storage nodes are separate from the compute nodes that Hadoop uses to perform its calculations.  This seems to be in conflict with the idea of moving compute to the data. After all, Hadoop is all about moving compute to data vs. traditionally moving data to compute.

    The typical HDInsight infrastructure is that HDInsight is located on the compute nodes while the data resides in the Azure Blob Storage.  But to ensure that the transfer of data from storage to compute is fast, Azure recently deployed the Azure Flat Network Storage (also known as Quantum 10 or Q10 network) which is a mesh grid network that allows very high bandwidth connectivity for storage clients. See more details about FNS here - http://blogs.msdn.com/b/hanuk/archive/2012/11/04/windows-azure-s-flat-network-storage-to-enable-higher-scalability-targets.aspx

    Suffice it to say, the performance by utilizing HDFS with local disk or HDFS using Blob storage is comparable and in some cases, we have seen it run faster on Blob storage due to the fast performance of the Q10 network.

    When HDInsight is performing its task, it is streaming data from the storage node to the compute node.  But many of the map, sort, shuffle,and reduce tasks that Hadoop is performing is being done on the local disk residing with the compute nodes themselves.  The map, reduce, and sort tasks typically will be performed on compute nodes with minimal network load while the shuffle tasks will use some network to move the data from the mappers nodes to less reduce nodes.  The final step of storing the dat back to the storage is typically a much smaller dataset (e.g. a query dataset or report).  In the end, the network is being more heavily utilized during the initial and final streaming phases while most of the other tasks are being performed intra-nodally (i.e. minimal network utilization).

    So, a quick summary on the performance aspect would be:
         •Azure Blob storage provides near identical HDFS access characteristics for reading (performance and task splitting) into map tasks.
         •Azure Blob provides faster write access for Hadoop HDFS; allowing jobs to complete faster when writing data to disk from reduce tasks.

    Some additional info:

    Map Reduce uses HDFS which itself is actually just a file system abstraction.  There are two implementations of HDFS file system when running Hadoop in Azure; is either local file system another is Azure Blob.  Both are still HDFS; the code path for map reduce against local file system HDFS or Azure Blob filesystem are identical.   You can specify the file split size (minimum 64MB, max 100GB, default 5GB).  So a single file will be split and read by different mappers, just like local disk HDFS.

    We have re-architected our networking infrastructure with Q10 in our datacenters to accommodate the Hadoop scenario.  All up we have an incredibly low overhead / subscription ratio for networking, therefore we can have a lot of throughput between Hadoop and Blob. The worker nodes, Medium VM’s, will each read upto 800Mbps from Azure blob (which is running remotely); this is equivalent to how fast the VM can read off of disk.  With the right storage account placement and settings we can achieve disk speed for aprox 50 worker nodes. It’s screaming fast today; and there are some mind bindingly fast networking speed stuff coming down the pipe in the next year that will likely triple that number.

    Regarding Machine Learning:
    Microsoft Azure Machine Learning (MAML) is still evolving while Mahout has been there for some time. I don't have a personal preference between the two to be honest, the ease of use of MAML certainly is alluring and it accommodates most of the algorithms, packages like R and so on and it is improving every day. You can also configure Mahout to run on top of HDInsight today in a couple of ways (links below). Official inclusion of Mahout in HDInsight is in the roadmap but there is no certain dates or timelines as of now.
    Mahout On HDInsight - Team blog
    Mahout on HDInsight - Video Tutorial

    Hope this helps,

    Regards.


    DebarchanS - MSFT

    Friday, February 6, 2015 7:59 AM
  • Thanks, it helped a tons.

    My last question is, I recently noticed that HDP VM is another option for using Hadoop on Azure. I'd like to know where the data is stored in this VM? Is it on the local disk of the machine or on the Blob storage? Is the HDP VM running on linux or Windows?


    • Edited by HESSAMZAK Friday, February 6, 2015 10:37 PM
    Friday, February 6, 2015 10:35 PM
  • Hi,

    For the HDP VM on Azure, the storage is HDFS which is local to the nodes. The virtual machines run on Linux (CentOS as far as I remember).

    Regards.


    DebarchanS - MSFT ----------------------------- This posting is provided AS IS with no warranties, and confers no rights.

    Saturday, February 7, 2015 8:17 AM
  • Hi,

    Do you have any more questions?

    Regards.


    Debarchan Sarkar - MSFT ( This posting is provided AS IS with no warranties, and confers no rights.)

    Thursday, February 12, 2015 5:19 AM
  • Thanks.

    So, let's say I run three Windows Server and attach some disk to each. Then I install HDP on them and get a 3-node HDP cluster (one name node and two data nodes) with its own local HDFS. If I copy something on the local HDFS, does it remain there as long as I keep the server up and running? Is there any limit on the amount of storage I can attach to each node?

    There should be another option to not attack a disk to the server and have it read/write to the Blob storage?

    Tuesday, March 3, 2015 7:12 PM
  • Hi,

    If you run a HDP cluster on Azure VMs, you can retain your HDFS data as long as the VM is up and running. In fact, you can even retain the data (.vhd) in your storage account after you delete your VM. Later, you can spin up a new VM and can attach the .vhd from your storage account.

    There is a way to associate your blob storage to your HDP clusters on Azure VMs as described in this blog - https://alexeikh.wordpress.com/2014/01/14/expanding-hdp-hadoop-file-system-to-azure-blob-storage/

    Note that this not a supported scenario and you will not get any official support, neither from Microsoft nor from Hortonworks. So, be careful and use it at your own risk.

    Regards.


    Debarchan Sarkar - MSFT ( This posting is provided AS IS with no warranties, and confers no rights.)

    Wednesday, March 4, 2015 9:48 AM
  • Is there any limitation on the amount of storage attached to Azure VMs? How is the pricing for this type of storage?
    Wednesday, March 4, 2015 9:21 PM
  • As far as I know, there is no limit as such. The maximum amount of disks that can be attached to a virtual machine is variable based upon the size of the virtual machine. For example, you can only attach 4 disks to the Standard A2 but you can attach 32 disks to the Standard D14 and 64 disks to the Standard G5.

    Also, the pricing of retaining them in your storage account is the same as any other type of data you store, no change.

    If you have further questions on vhd size and pricing, I suggest you open a separate thread in the Azure forum for Virtual Machines, that will be a more appropriate place.

    Regards.


    Debarchan Sarkar - MSFT ( This posting is provided AS IS with no warranties, and confers no rights.)

    Thursday, March 5, 2015 11:58 AM
  • Hi,

    I just found out that:

    Data disks in Azure VMs are limited to a max size of a page blob.

    The current maximum size for page blob is 1 TB. There is no current plan to increase that limit in the near future. There is, of course, the ability to create disk arrays of up to 32 TB with up to 50000 IOPS. 

    Regards.


    Debarchan Sarkar - MSFT ( This posting is provided AS IS with no warranties, and confers no rights.)


    Friday, March 6, 2015 5:49 AM