Amazon EMR: Overview, Compatible Services and Use Cases
This is part 2 of 3 in our series on Amazon Big Data Tools. See also Part 1 on Amazon Athena and Part 3 on Amazon Redshift.
What Is Amazon EMR?
Amazon EMR (aka Amazon Elastic MapReduce) is the service offering that allows you to scale and run data lake (big data) services on-demand in the Amazon cloud.
EMR is designed to replace (or augment) on-premise big data server provisioning and processing using open-source products like Hadoop, Hive, Spark, Flink, Presto, and TensorFlow, working with S3 for storage (or alternatively, HBase, Amazon RDS, DynamoDB, Redshift, Glacier, or HDFS for Hadoop, ) and EC2 instances for compute power.
It’s a fully managed data lake service that can decouple data storage from compute resources and instead makes compute clusters scalable, available to be utilized on-demand, and includes the ability for multiple clusters to access the same datasets at once.
If you’re actively looking for a way to process big data without the on-site configuration management and provisioning overheads, and feel like you aren’t making the most of your server resources (or want to do away with them all together and migrate to the cloud), then EMR could be the product you’re looking for.
Benefits of Using Amazon EMR
Reduce the Cost of Physical Infrastructure
As with all cloud-based compute services, EMR removes the need to purchase, maintain, run and house physical server infrastructure that would be performing big data computational services on-site. You can use the same tools and services that you currently use on-site in the cloud instead, but instead pay per second for the compute cluster resources you use.
Pricing is based on instance type (Reserved, Dedicated, or Spot EC2 instances), number of EC2 instances deployed, and region. This can be particularly cost-effective if you take advantage of EC2 Spot Instances, a variable-priced instance option that you make bids for. For further information, view the slides from Lower Costs on Amazon EMR: Auto Scaling, Spot Pricing, & Expert Strategies (ANT385) at AWS re:Invent 2018.
Save System Admin Time
One of the key benefits of using EMR is savings in system administrators’ time that would otherwise be spent configuring and provisioning on-site servers for big data computational tasks. While you may have self-built scripts to aid in the process, for every change that’s made, or new service that’s added, these will need to be tweaked. With EMR handling all these operational details for you, it means less time spent configuring manual admin tasks.
In IDC’s Whitepaper The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR, where a number of organisations were surveyed regarding their EMR migration, it was discovered that using EMR gave IT staff 49% more free time than they’d otherwise spend managing in-house infrastructure.
Resources Are Only Used When Necessary and Are Completely Scalable
The key to EMR’s cost-saving benefits really lie in the fact that storage and compute can be decoupled, which means that you can spin up and scale EC2 instances and clusters when needed, then release resources once you’re done. Inbuilt elasticity with AWS Autoscaling means you only pay for what you need. Note that to get the same performance as in-house processing, your storage will need to be kept local.
Other Benefits
EMR includes 24/7 customer support as standard with subscription (far less than other Spark and Hadoop vendors charge), fast spin up times for instances to run your services, and EMR can be run in an AWS VPC (Virtual Private Cloud) for increased corporate data security.
It's worth mentioning of course that the benefits of EMR extend far beyond simply time and cost savings. EMR is used to deliver real business value through processing big data.
Complementary EMR services
You will need storage and EC2 instances to run EMR, alongside supported processing instructions for the job you want to run. To run as securely as possible, it’s recommended to run these services on Amazon VPC.
Amazon SageMaker
Sagemaker allows data scientists to build and train ML algorithms with Jupyter notebooks, then create a Model endpoint for production use. When configured correctly with EMR Spark clusters (particularly when used in combination with an AWS Glue Data Catalog), this can speed and automate this particular ML pipeline. Sagemaker works with popular ML frameworks such as TensorFlow, Apache MXNet and PyTorch, plus has a fairly extensive library of common ML algorithms built in for speed.
Amazon CloudWatch
Keep tabs on resource allocation, efficiency, and operation by using Amazon CloudWatch with EMR. When combining CloudWatch with AWS Lambda, you can more effectively (and automatically) manage EMR running costs.
Amazon QuickSight
QuickSight is Amazon’s Business Intelligence tool that lets you visualise datasets from an intuitive dashboard. If you’re working with Presto or Spark with EMR, there is now native connector support for these products in QuickSight.
Use Cases for EMR
When you have rigid in-house cluster infrastructure
Running the same complete cluster infrastructure anytime you need to analyse big data, no matter what the terms of analysis are, is a waste of resources. Unless you’ve managed to configure elastic clustering on your on machines, EMR is more resource-efficient.
Here’s why (and how) AOL moved from in-house clickstream data processing over to EMR.
When managing Hadoop is becoming a hassle
Many businesses currently rely on on-premise Hadoop for their big data processing needs. With Hadoop fully managed on EMR, this means you remove the time and complexity involved with in-house Hadoop management such as upgrading, maintenance, node failures, operational costs.
When you have long wait times for data processing
If running processing in-house is taking far too long, leaving people waiting around for results, or other teams waiting to add to the queue, then you have the option to either a) add new physical hardware to speed up the process or b) choose a completely scalable EMR solution to complete tasks in a fraction of the time. In the IDC Whitepaper we referenced earlier, switching to EMR reduced time on average to run queries by 40% and increasing the number of queries run each day by 33%.
When you want to outsource physical server infrastructure
It’s no secret that big server rooms are being converted into meeting rooms and nap pods. Removing physical hardware infrastructure frees up space in the workplace, human resources required to configure physical servers, and the cost to purchase and maintain them. If removing unnecessary physical IT infrastructure is a business goal, EMR helps achieve it.
For more on Amazon EMR, including blog posts like ‘Exploring data warehouse tables with machine learning and Amazon SageMaker notebooks’ and videos like ‘AWS re:Invent 2018: A Deep Dive into What's New with Amazon EMR’, head over to the EMR Resources page at AWS.