Open source big data platform E-MapReduce
Play video
Open source big data platform E-MapReduce (referred to as "EMR") is a cloud native open source big data platform, which provides customers with simple and easy integrated open source big data computing and storage engines such as Hadoop, Hive, Spark, StarRocks, Flink, Presto, ClickHouse, etc. EMR computing resources support flexible flexible control. EMR supports on ECS, on ACK, and serverless deployment modes.

Product line

Product types can be selected according to different business demand scenarios

Product advantages

New generation open source big data platform
Stable, reliable and easy to use
Support node fault tolerance and compensation; 100 node capacity expansion time<2 minutes; Comprehensive service inspection and event notification; EMR Studio provides one-stop development scheduling service
Deposit accounting separation architecture
Archive on demand, saving 20% - 40% of storage costs; OSS HDFS storage, operation and maintenance free; DLF lake management realizes lake data lifecycle management
Significant cost savings
Time based elastic scalability, preemptive instances can further reduce costs; Support cost-effective models such as DPCA/AMD; On ack mode, supporting mixed resources
Leading open source ecology
Deeply optimize Spark to improve performance by 100%; Provide Hadoop, Spark, Hive, Kafka, HBase, Presto, Impala, Hudi, StarRocks and other open source components

Product Functions

Cluster management Convenient cluster management, fast cluster creation and capacity expansion
Cluster creation Through the console page or OpenAPI, you can quickly create multiple types of clusters, such as Hadoop, Dataflow, Datascience, Druid, ZooKeeper and other open source big data frameworks, without caring about the underlying hardware and software deployment
Cluster expansion The number of nodes in the existing cluster can be easily increased or reduced through the console page or OpenAPI
Service configuration You can quickly add services provided by EMR, monitor the status of services, and configure and operate service components
Elastic expansion Through the console interface, you can easily add the required components, and configure and operate the components
Dynamic capacity expansion Multiple elastic scaling strategies can be set to automatically scale cluster computing resources dynamically and reduce TCO
Operation and Maintenance Center Perfect operation and maintenance management tools to facilitate rapid discovery and positioning of cluster problems
Cluster monitoring Provide rich display of service monitoring indicators and host monitoring indicators, and quickly locate service and host exceptions through visualization
Event Center The EMR service provides a wide range of event types, including service events, managed service events, and host events. It can obtain cluster problems more quickly and specifically, and trace the source of the problem link
Job List Make statistics on the running status of cluster jobs, quickly compare abnormal jobs, and facilitate job and cluster performance tuning
diagnostic analysis HDFS cold and hot data analysis and small file analysis functions are provided to provide basis for service performance optimization
Rich components Rich component support, you can select components according to your needs
DataLake A more flexible, reliable and efficient big data computing cluster
Spark A new generation of distributed open source big data framework based on memory, which supports offline and real-time computing, as well as SQL syntax and machine learning processing
Hive A set of offline data processing system based on Hadoop, which provides structured table data management capability on HDFS and SQL like query syntax for data analysis and processing
Kafka Kafka is a high throughput distributed publish subscribe messaging system with excellent performance and reliability
Flink For the distributed processing engine of streaming data and batch data, EMR provides an enterprise level big data computing platform built on the Ververica Platform, a commercial product based on Apache Flink, to provide real-time computing services
Presto Open source distributed SQL query engine, suitable for interactive query analysis
ClickHouse Open source OLAP analysis engine, main features: column storage, MPP architecture, support for SQL, real-time data update, support for indexes, etc
Hudi A data lake storage format that provides the ability to update and delete data and the ability to consume changing data
StarRocks Open source MPP architecture OLAP analysis engine, supporting sub second level data query and multi table join
Perfect cloud ecological support The product environment on Alibaba Cloud is deeply integrated and supported
Support DataWorks Provide customers with a professional, efficient, safe and reliable one-stop big data development and governance platform
Support MaxCompute Support data reading and writing of Alibaba Cloud MaxCompute products
Supports ElasticSearch The ES Hadoop plug-in is built in Hadoop, which can directly support ES related operations
Support data lake to build DLF By default, EMR supports metadata management using DLF to facilitate metadata management in the data lake scenario.
Support object storage OSS All computing engines in EMR support OSS as storage, which can be used like HDFS. JindoFS is used to speed up OSS data reading and writing.
Support cloud monitoring Monitoring of EMR services and operations can be set in cloud monitoring to facilitate quick alarm of problems
SLS support SLS is supported as a real-time data input source, and SDK direct operation is provided
Support Alibaba Cloud messaging products Support reading and writing of message queues, message services, etc., and provide SDK packaging for user convenience

Application scenarios

Big data moving station
Cloud native data lake
Intelligent recommendation
interactive analysis
Continue the open source technology stack, link Alibaba Cloud ecosystem and open source big data ecosystem
Big data relocation will encounter the following challenges: the big data technology stack is complex, and the data scale and task number are large; The open source community version has a fast iterative evolution speed, and the compatibility between open source components and community bugs will affect the continuity of work and business. Big data migration can continue the open source technology stack through EMR, linking Alibaba Cloud ecosystem and open source big data ecosystem
Capable of providing
Adopt community open source software
High scenario coverage, continuity of existing technology stack and organizational structure, low migration risk and cost
Mature and stable
Components adopt the latest stable version of the community, which is more stable and reliable through component stability and compatibility verification tests
Ecological integration with Alibaba Cloud
It can flexibly integrate with Alibaba Cloud ecosystem according to business requirements and technical routes, such as DataWorks+EMR for data development, PAI+EMR for machine learning, MaxCompute+Data Lake Formation+EMR for lake warehouse integration
Multiple migration schemes
Depending on the data size and budget, you can efficiently migrate to the cloud as planned through Lightning Cube, private lines and public networks
Recommended combination
Reduce costs, solve idle resources, and apply to multiple data analysis scenarios
With the rapid expansion of enterprises' accumulated data scale, data analysis and use will encounter: the cost challenge of data scale expansion; The problem of idle resources caused by the coupling of computing and storage; Due to a variety of data analysis scenarios, such as offline computing, streaming computing, interactive analysis, machine learning, etc., multiple engines frequently trigger data, resulting in data inconsistency and cost problems. The above problems can be effectively solved through EMR and supporting cloud native data solutions
Capable of providing
Computing storage separation
The data is stored in OSS object storage, and the data lake is accelerated through EMR JindoFS or Alluxio, so as to achieve the decoupling of computing and storage, improve and ensure the computing efficiency, and avoid the problem of idle resources
Data Tiered Storage
Jindo Table combines the hierarchical storage capabilities of OSS, combines big data business with underlying basic capabilities, and matches different OSS storage types according to the cold, hot, and warm layers of data to maximize cost savings
Docking multiple computing engines
EMR data lake solution can interface with real-time computing, PAI, MaxCompute, ElasticSearch and other computing engines to avoid repeated data movement
Unified management and control of metadata multi engines
Realize unified management of metadata through EMR+Data Lake Formation, and DLF can uniformly control the permissions of different EMR computing engines
Recommended combination
Build machine learning and algorithm platform through EMR to accelerate model training
Collect user behavior data, build machine learning and algorithm platform through EMR, build machine learning feature library, model library and algorithm library through Hive/Spark, realize model training through EMR Data Science cluster TensorFlow/Pytoch, and realize online reasoning service of model through PAI EAS
Capable of providing
Stable and reliable
The recommendation system solution that has been verified by mass production in the industry has significantly increased the CTR click rate
Flexible and controllable
It is applicable to offline recommendation and real-time recommendation scenarios. Users can flexibly select open source technology components according to requirements and technology stack direction
Good integration
You can flexibly select the appropriate ECS GPU instance type according to the rapid integration of PAI EAS/PAI Studio, etc
Recommended combination
Fully compatible with open source version features, and quickly integrated with other EMR components
Collect various user behavior data on the APP, process and analyze the data through the EMR platform, write it into ClickHouse, support flexible and rapid analysis of upper business, and improve the efficiency of business decisions
Capable of supporting
Second level query
ClickHouse supports second level data query, rapid call and manual analysis of the application layer
Flexible query
Complete SQL statement support and flexible business logic analysis
Easy operation and maintenance
Semi trusteeship cluster, providing cluster management, monitoring, capacity expansion and other operation and maintenance capabilities, allowing more technical personnel to invest in business development
Recommended combination

Customer Stories

Why E-MapReduce?
Shuhe Technology
Driven by big data technology, Shuhe Technology provides intelligent financial solutions for financial institutions. With the expansion of the company's business, a large number of data requirements proposed by the business side are testing the ability of the existing cluster. In order to reduce the pressure of existing clusters, Shuhe has used Alibaba Cloud EMR to land a data lake suitable for its current business. It can store structured and unstructured data at any scale. It uses different types of engines for analysis, providing a better basis for decision-making for business development.
Uncle Kai tells stories
Kaishu Storytelling is a well-known brand of children's content education in China. At the beginning, Kaishu's story telling uses the third-party SAAS platform as the operation support. The cycle is long, the display is rigid, and the personalized development is very limited. It is difficult to support the team's refined operation needs. After Alibaba Cloud's E-MapReduce big data platform is used to support the business team, it helps the business team to achieve accurate access to users, real-time feedback and active services. After the system goes online, the business increases significantly.
Yeahmobi report
Yeahmobi is a global intelligent marketing service company driven by technology. Its main services include effect marketing, brand services, and comprehensive marketing solutions of various categories. The Yeahmobi point report is based on AliCloud OSS+E-MapReduce, where all data is stored in a unified OSS, computing resources are dynamically adjusted, and E-MapReduce is used to support offline analysis, meeting the requirements of business scenarios, and the overall TOC is reduced by 30%.
Speak fluently
Fluent is a technology driven education company. In offline computing tasks, most data sources come from business DB. With the increase of data volume, it is unable to meet the near real-time query requirements. Fluent said that after choosing AliCloud E-MapReduce and adopting CDC+Delta Lake, the cost was saved by nearly 80%. The time cost of early morning DB data access is greatly reduced, ensuring that all DB data access without special requirements can be completed within one hour, greatly improving the efficiency.

Comparison between open source big data platform E-MapReduce and self built Hadoop cluster

Contrast dimension
cost
performance
Ease of use
elastic
security
reliable
service
Alibaba Cloud E-MapReduce
Pay as you go resources, support flexible adjustment of cluster resources, hierarchical data storage, and high resource utilization. No additional software license fees.
Compared with the open source version, the performance of EMR SparkSQL is significantly improved. For example, the performance of EMR SparkSQL is six times that of the open source version.
The Hadoop cluster is launched at the minute level to respond to business needs quickly.
The cluster can be started and destroyed temporarily according to the job. Cluster resources can be automatically adjusted dynamically according to the time cycle or cluster load. Based on JindoFS computing storage separation architecture, computing and storage resources can be easily expanded separately.
Support enterprise level multi tenant resource management capabilities, support table, column, row level permission control and log audit, and support data encryption.
Large scale, enterprise level environment inspection, upgrade with the open source version, and pass professional compatibility verification tests to provide a better use experience than the community version.
Professional and senior big data expert technical service team provides after-sales support.
Self built Hadoop cluster
The resources are estimated in advance and relatively fixed, so the utilization rate of resources is low. If Hadoop distribution is adopted, additional license fees shall be paid.
The open source community version is adopted, and the performance needs to be self optimized.
Purchase servers and deploy Hadoop ecological components, with a cycle of several weeks.
Computing and storage are coupled. Resources are relatively fixed and cannot be flexibly adjusted.
The multi tenant management capability needs to be configured by itself. The capability is not perfect and cannot meet the enterprise level requirements.
You need to update and upgrade the open source version by yourself, verify the compatibility of each component version, and repair community bugs by yourself.
There is no service support for the community version. For Hadoop distribution, you need to pay additional license and service fees.

Product Dynamics

2017-01-18 New products
EMR supports exclusive package
View details
2017-01-18 New functions/specifications
EMR supports Spark 2.0
View details
2017-02-23 New functions/specifications
Support unified Hive table metadata management
View details
2017-04-26 New Region/New Availability Zone
E-MapReduce went online in North China 3
View details
2017-05-03 New functions/specifications
Execution plan scheduling enhancement
View details
2017-05-10 New Features/Specifications
Job addition retry support
View details
2017-06-15 New functions/specifications
Release of cluster configuration management system
View details
2017-07-29 Price adjustment
The price of E-MapReduce International Station has been comprehensively lowered
View details
2017-08-05 New region/new zone
E-MapReduce German station online service
View details
2017-08-08 New Features/Specifications
EMR big data model overall plan release
View details
New functions/specifications on November 23, 2017
Gateway function goes online
View details
2018-01-03 New Region/New Availability Zone
E-MapReduce product Hong Kong, Hohhot regional online service
View details
2018-03-01 New functions/specifications
Detailed permission control component Ranger publishing
View details
2018-03-03 New Region/New Availability Zone
E-MapReduce went online in Mumbai, India
View details
2018-03-20 Function optimization
E-MapReduce supports model upgrade
View details
New functions/specifications on April 18, 2018
E-MapReduce supports cluster replacement from volume based to monthly package
View details
2018-07-05 New functions/specifications
Hadoop elastic scalability released
View details
2018-09-06 New Functions/Specifications
E-MapReduce performance is greatly optimized
View details
2018-09-22 New functions/specifications
EMR TensorFlow Release
View details
2018-11-01 Function optimization
One click expansion of EMR cloud disk data disk
View details
New functions/specifications on November 1, 2018
EMR supports preemptive instances
View details
2018-12-07 New functions/specifications
EMR APM function release
View details
New functions/specifications on January 21, 2019
EMR upgrade Hadoop 2.8.5
View details
New functions/specifications on March 15, 2019
EMR Knox supports Flink and adapts to YARN time line service
View details
2019-06-08 New Region/New Availability Zone
E-MapReduce regional service in Chengdu was launched
View details
2019-07-09 New functions/specifications
New EMR workflow supports streaming job types
View details
2019-07-28 New functions/specifications
EMR latest version EMR-3.22.0 released
View details
2019-07-28 New functions/specifications
EMR newly added Kudu component
View details
2019-08-01 New functions/specifications
EMR newly released JindoFS, a self-developed big data storage service customized for cloud storage
View details
New functions/specifications on November 18, 2019
E-MapReduce version 3.24.0 release
View details
New functions/specifications on November 18, 2019
EMR supports TensorFlow on spark
View details
New functions/specifications from November 2011
E-MapReduce 3.23.0 Release
View details
New functions/specifications on November 21, 2019
EMR China/International Station Launches the 6th Generation ECS Enterprise Instance
View details
2020-06-30 New Features/Specifications
E-MapReduce supports ECS D2S new generation big data instance
View details
2020-07-31 New Features/Specifications
Alibaba Cloud E-MapReduce adds the ECS big data instance type D2C
View details
2021-01-05 New Features/Specifications
Alibaba Cloud E-MapReduce adds Remote Shuffle Service
View details
2021-02-28 New Region/New Availability Zone
Alibaba Cloud E-MapReduce officially launched in North China 6 Ulanqab
View details
New functions/specifications of 2021-04-01
Alibaba Cloud E-MapReduce publishes the Clickhouse cluster type
View details
2021-05-01 New Features/Specifications
AliCloud E-MapReduce launched the latest generation local SSD instance
View details
2021-07-31 New Features/Specifications
E-Mapreduce semi managed Clickhouse cluster publishing
View details
2021-09-30 New Features/Specifications
E-MapReduce on ACK new release
View details
2022-01-26 Function optimization
E-MapReduce new console is released
View details
2022-03-28 New Features/Specifications
StarRocks is launched on the new console, dedicated to building a fast unified analysis experience
View details
2022-04-15 New Features/Specifications
JindoData publishing supports OSS HDFS services
View details
2022-04-22 New Features/Specifications
Upgrading StarRocks to 2.1.1 will greatly improve query performance
View details
2022-06-16 New Features/Specifications
Data lake cluster goes online
View details
2022-07-15 New Features/Specifications
DataWorks supports EMR DataLake clusters
View details
2022-07-22 New Features/Specifications
Doctor goes online
View details
2022-08-04 New Features/Specifications
Data Service Publishing
View details
2022-08-16 New Features/Specifications
The new control platform supports more high-level features
View details
2022-09-02 New Features/Specifications
Add elastic scaling rules
View details
2022-09-07 New Features/Specifications
Turn on automatic compensation
View details
2022-09-09 New Features/Specifications
Clone cluster
View details
2022-10-17 New Features/Specifications
User defined cluster goes online
View details
2022-11-17 New Features/Specifications
OSS HDFS supports hot and cold tiered storage
View details
2022-11-25 New Features/Specifications
DataWorks supports EMR custom clusters
View details
2022-12-20 New Features/Specifications
EMR Doctor real-time risk detection
View details
2022-12-28 New Features/Specifications
EMR Doctor cluster daily report
View details
2023-02-14 New Features/Specifications
Access link and port function upgrade
View details
2023-02-24 New Features/Specifications
Support data disk encryption
View details
2023-03-02 New Features/Specifications
New configuration parameters for elastic scaling rules
View details
2023-03-08 New Features/Specifications
New application configuration export function
View details
2023-03-15 New Features/Specifications
New System Event in Event Center
View details
2023-03-23 New Features/Specifications
Support the creation of separate storage and accounting clusters by default
View details
2023-04-10 New Features/Specifications
Serverless StarRocks Free Public Beta Released
View details
2023-04-23 New Features/Specifications
Support visual management of YARN partitions on the console
View details
2023-05-15 New Features/Specifications
View cluster daily report and analysis
View details
2023-05-23 New Features/Specifications
Commercialization of Serverless StarRocks
View details
2023-05-26 New Features/Specifications
Support for Sky Reliant Cloud Server (under test)
View details
2023-06-21 New Features/Specifications
Operating StarRocks instances through SQL Editor
View details
2023-07-04 New Features/Specifications
EMR Workflow public beta
View details
2023-07-14 New Features/Specifications
Support stateless clusters
View details
2023-07-14 New Features/Specifications
EMR on ACK supports Data Science cluster
View details
2023-08-09 New Features/Specifications
New Elastic Scaling Management Module
View details
2023-08-17 New Features/Specifications
Support YARN partition and queue association
View details
2023-08-29 New Features/Specifications
New cluster template function
View details
2023-09-12 New Features/Specifications
StarRocks supports separation of storage and accounting
View details
2023-10-24 New Features/Specifications
Support the cloud server relying on the sky
View details
2023-11-21 New Features/Specifications
New alarm management function
View details
2023-11-24 New Features/Specifications
New node health status
View details
2023-12-05 New Features/Specifications
Connect StarRocks instance through DMS
View details
2023-12-08 New Features/Specifications
Connect StarRocks instance through Quick BI
View details
2023-12-21 New Features/Specifications
Workflow New Workspace Management
View details
2023-12-25 New Features/Specifications
Workflow supports submission to cluster template for execution
View details
2024-01-10 New Features/Specifications
Workflow commercial release
View details
View all logs

Introduction and Practice

EMR Open Source Big Data Migration Zone
HDFS, Hive, Kafka migration to EMR best practices
View details
EMR elastic calculation practice
EMR flexible low-cost offline big data analysis best practice
View details
Real time statistics practice of incremental data
Realize real-time statistics of incremental data through Serverless StarRocks
View details
Practice of minute level quasi real-time analysis
Minute level quasi real-time analysis via Serverless StarRocks
View details

Documentation and Tools