Amazon EMR and Azure HDInsight: A Comparison

By the end of this year, spending on IT infrastructure products to be deployed in cloud environments will be $37.1 billion, International Data Corporation predicts.

AzurevsAWS

According to IDC’s Worldwide Quarterly Cloud IT Infrastructure Tracker, released this month, enterprise spending on cloud will increase about 15 percent over last year. Public cloud infrastructure makes up the largest piece of the pie, with spending in that arena expected to be $23.3 billion, a nearly 19 percent increase.

One of the main reasons behind the exponential growth of Cloud market is association of Cloud with Hadoop. There have been hot discussions about running Hadoop in Cloud environments, while Amazon’s Elastic MapReduce continued with enhancements with its base platform, Azure too came with HDInsights’ improvizations. Increasing awareness about Hadoop in Cloud also bought some limelight to the projects like Apache Whirr and Netflix’s Genie. Recent announcements at the HadoopWorld + StrataConf summit prompted analysts to claim that Hadoop is taking over the cloud.

Why this association (Cloud+Hadoop) rocks?

Vendors providing Cloud services fetch unlimited benefits by the union of Cloud and Hadoop. Customers may get rapidly scalable processing power and storage. It also lowers cost of innovation with cost effective strategies pay-per-use basis so businesses can pay for the storage or analytics they need without making that upfront investment or paying for maintaining a system when it is not being used. Hadoop also helps manage situations that crank out large volumes of data, big enough to impact your storage resources.

Yelp, a local business directory service and review site with social networking features, and AWS customer, is using Hadoop in-house, and deploying big RAID storage resources to handle the increase in their log file production. According to Yelp, they were pumping out up to 100GB of log files every day.

Comparison Between Amazon EMR and Azure HDInsight

AWS and Azure both made the Hadoop technology available via the Cloud in its Elastic MapReduce and HDInsight respectively. These web services make it easy to quickly and cost effectively process vast amount of data. Let us analyze some of the prime features of both:

Features of AWS EMR

  1. Amazon EMR provides a managed Hadoop framework that simplifies big data processing.
  2. Other popular distributed frameworks such as Apache Spark and Presto can also be run in Amazon EMR.
  3. Pricing of Amazon EMR is simple and predictable: Payment can be done on hourly rate. A 10-node Hadoop can be launched for as little as $0.15 per hour. Because Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, 50-80% can also be saved on the cost of the underlying instances.
  4. It also is in vogue due to its easy usage capability. When a cluster is launched on Amazon EMR the web service allocates the virtual server instances and configures them with the needed software for you. Within minutes you can have a cluster configured and ready to run your Hadoop application.
  5. It is resizable, the number of virtual clusters depending on the processing needs can be easily contracted or expanded.
  6. Amazon EMR integrates with popular business intelligence (BI) tools such as Tableau, MicroStrategy, and Datameer. For more information, see Use Business Intelligence Tools with Amazon EMR.
  7. You can run Amazon EMR in a Amazon VPC in which you configure networking and security rules. Amazon EMR also supports IAM users and roles which you can use to control access to your cluster and permissions that restrict what others can do on the cluster. For more information, see Configure Access to the Cluster.

Features of Azure HDInsight

  1. Azure HDInsight is a service that provisions Apache Hadoop in the Azure cloud, providing a software framework designed to manage, analyze and report on big data.
  2. HDInsight clusters are configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices.
  3. Unlike the first edition of HDInsight , now it is delivered on Linux – as Hadoop should be, which means access to to HDP features. The cluster can be accessed via Ambari in the web browser, or directly via SSH.
  4. HDInsight has always been an elastic platform for data processing. In today’s platform, it’s even more scalable. Not only can nodes be added and removed from a running cluster, but individual node size can be controlled which means the cluster can be highly optimized to run the specific jobs that are scheduled.
  5. In its initial form, there were many options for developing HDInsight processing jobs. Today, however, there are really great options available that enable developers to build data processing applications in whatever environment they prefer. For Windows developers, HDInsight has a rich plugin for Visual Studio that supports the creation of Hive, Pig, and Storm applications. For Linux or Windows developers, HDInsight has plugins for both IntelliJ IDEA and Eclipse, two very popular open-source Java IDE platforms. HDInsight also supports PowerShell, Bash, and Windows command inputs to allow for scripting of job workflows.

Effective service with best pricing always wins the deal, and AWS is leading the cost wars for last few years with all the competitors, not just Microsoft. Reasons behind its success is a free tier of business application operation, which includes EMR implementation that lasts for one year from sign-up. That allows you to grow your application, understand its long-term scope including spikes and dips, and then budget accordingly.

The one place where Azure HDInsight may pull ahead is in end-user tools. If your Big Data analytics team is using Excel as its front-end analysis tool, then Azure delivers a Hive ODBC driver and a Hive add-on for Excel. That’s a smart move on Azure’s part, but it can be duplicated on EMR with some front-end planning.

Whether Azure, SQL Server and Excel can really compete in BI against competition like that is definitely still up in the air. Azure needs to prove itself at the service level, not just as the infrastructure as a service (IaaS) and cloud storage provider it’s been so far.

Conclusion

Anything you select will be determined by your specific needs, and the type of workloads you have to manage. It may even be the case that different services will suit the requirements of different divisions, within the one enterprise.

Hopefully, this comparative guide will assist you, in making the right choice.