E-MapReduce_EMR_Big Data Framework_Big Data

Open source big data platform E-MapReduce

Play video

Open source big data platform E-MapReduce (referred to as "EMR") is a cloud native open source big data platform, which provides customers with simple and easy integrated open source big data computing and storage engines such as Hadoop, Hive, Spark, StarRocks, Flink, Presto, ClickHouse, etc. EMR computing resources support flexible flexible control. EMR supports on ECS, on ACK, and serverless deployment modes.

EMR Serverless StarRocks Starter Edition 59 yuan in the first month

EMR Serverless Spark

Vector Retrieval Milvus Edition

Product billing

Product documentation

EMR user community

Recent updates Latest release EMR Workflow Commercialization Announcement Heavy release Full Link Data Lake Development and Governance Solution 2.0 Heavy Upgrade Latest release Alibaba Cloud Smart Data Lake was selected as one of the "Top Ten Hard Core Technologies" at the 6th Digital China Construction Summit Latest release Alibaba Cloud EMR 2.0: Redefining the new generation open source big data platform Heavyweight function EMR launches the intelligent operation and maintenance diagnostic system (EMR Doctor) - an operation and maintenance tool of open source big data platform

Product case Himalayan How to release the value of Himalayan data behind the rapid growth of ear economy Ape counselling The evolution of OLAP based on EMR StarRocks Water drop financing Shuidiqian is based on Alibaba Cloud EMR StarRocks actual combat sharing Interesting headlines Sharp tool for cost reduction and efficiency increase! Interesting Headlines Spark Remote Shuffle Service Best Practices Shuhe Technology Best Practice of Shuhe Cloud Data Lake

Product line

Product types can be selected according to different business demand scenarios

EMR on ECS

EMR on ECS refers to the way EMR operates on ECS. EMR on ECS combines the big data processing function of EMR with the advantages of container deployment of ECS, enabling you to configure and manage EMR clusters more flexibly, so as to better adapt to complex data processing and analysis scenarios. With EMR on ECS, you can quickly create, manage, operate and maintain EMR clusters, and use computing and storage resources more efficiently.

EMR on ACK

EMR on ACK provides a new way to build a big data platform. You can deploy open source big data services on Alibaba Cloud Container Service Kubernetes Edition (ACK), and use the advantages of ACK in service deployment and container application management to reduce the investment in operation and maintenance of underlying cluster resources, so that you can focus more on big data tasks.

EMR Serverless StarRocks

E-MapReduce Serverless StarRocks is the full hosting service of serverless StarRocks provided by Alibaba Cloud. It provides high-performance, full scenario, extremely fast and unified data analysis experience, and has full lifecycle capabilities such as out of the box, elastic expansion, monitoring management, slow SQL diagnosis and analysis. The core is 100% compatible with StarRocks, and its performance is 3-5 times higher than that of traditional OLAP engines, helping enterprises build big data applications efficiently.

Product advantages

New generation open source big data platform

Stable, reliable and easy to use

Support node fault tolerance and compensation; 100 node capacity expansion time<2 minutes; Comprehensive service inspection and event notification; EMR Studio provides one-stop development scheduling service

Deposit accounting separation architecture

Archive on demand, saving 20% - 40% of storage costs; OSS HDFS storage, operation and maintenance free; DLF lake management realizes lake data lifecycle management

Significant cost savings

Time based elastic scalability, preemptive instances can further reduce costs; Support cost-effective models such as DPCA/AMD; On ack mode, supporting mixed resources

Leading open source ecology

Deeply optimize Spark to improve performance by 100%; Provide Hadoop, Spark, Hive, Kafka, HBase, Presto, Impala, Hudi, StarRocks and other open source components

Product Functions

Cluster management Convenient cluster management, fast cluster creation and capacity expansion

Cluster creation Through the console page or OpenAPI, you can quickly create multiple types of clusters, such as Hadoop, Dataflow, Datascience, Druid, ZooKeeper and other open source big data frameworks, without caring about the underlying hardware and software deployment

Cluster expansion The number of nodes in the existing cluster can be easily increased or reduced through the console page or OpenAPI

Service configuration You can quickly add services provided by EMR, monitor the status of services, and configure and operate service components

Elastic expansion Through the console interface, you can easily add the required components, and configure and operate the components

Dynamic capacity expansion Multiple elastic scaling strategies can be set to automatically scale cluster computing resources dynamically and reduce TCO

Operation and Maintenance Center Perfect operation and maintenance management tools to facilitate rapid discovery and positioning of cluster problems

Cluster monitoring Provide rich display of service monitoring indicators and host monitoring indicators, and quickly locate service and host exceptions through visualization

Event Center The EMR service provides a wide range of event types, including service events, managed service events, and host events. It can obtain cluster problems more quickly and specifically, and trace the source of the problem link

Job List Make statistics on the running status of cluster jobs, quickly compare abnormal jobs, and facilitate job and cluster performance tuning

diagnostic analysis HDFS cold and hot data analysis and small file analysis functions are provided to provide basis for service performance optimization

Rich components Rich component support, you can select components according to your needs

DataLake A more flexible, reliable and efficient big data computing cluster

Spark A new generation of distributed open source big data framework based on memory, which supports offline and real-time computing, as well as SQL syntax and machine learning processing

Hive A set of offline data processing system based on Hadoop, which provides structured table data management capability on HDFS and SQL like query syntax for data analysis and processing

Kafka Kafka is a high throughput distributed publish subscribe messaging system with excellent performance and reliability

Flink For the distributed processing engine of streaming data and batch data, EMR provides an enterprise level big data computing platform built on the Ververica Platform, a commercial product based on Apache Flink, to provide real-time computing services

Presto Open source distributed SQL query engine, suitable for interactive query analysis

ClickHouse Open source OLAP analysis engine, main features: column storage, MPP architecture, support for SQL, real-time data update, support for indexes, etc

Hudi A data lake storage format that provides the ability to update and delete data and the ability to consume changing data

StarRocks Open source MPP architecture OLAP analysis engine, supporting sub second level data query and multi table join

Perfect cloud ecological support The product environment on Alibaba Cloud is deeply integrated and supported

Support DataWorks Provide customers with a professional, efficient, safe and reliable one-stop big data development and governance platform

Support MaxCompute Support data reading and writing of Alibaba Cloud MaxCompute products

Supports ElasticSearch The ES Hadoop plug-in is built in Hadoop, which can directly support ES related operations

Support data lake to build DLF By default, EMR supports metadata management using DLF to facilitate metadata management in the data lake scenario.

Support object storage OSS All computing engines in EMR support OSS as storage, which can be used like HDFS. JindoFS is used to speed up OSS data reading and writing.

Support cloud monitoring Monitoring of EMR services and operations can be set in cloud monitoring to facilitate quick alarm of problems

SLS support SLS is supported as a real-time data input source, and SDK direct operation is provided

Support Alibaba Cloud messaging products Support reading and writing of message queues, message services, etc., and provide SDK packaging for user convenience

Application scenarios

Big data moving station

Cloud native data lake

Intelligent recommendation

interactive analysis

Continue the open source technology stack, link Alibaba Cloud ecosystem and open source big data ecosystem

Big data relocation will encounter the following challenges: the big data technology stack is complex, and the data scale and task number are large; The open source community version has a fast iterative evolution speed, and the compatibility between open source components and community bugs will affect the continuity of work and business. Big data migration can continue the open source technology stack through EMR, linking Alibaba Cloud ecosystem and open source big data ecosystem

Capable of providing

Adopt community open source software

High scenario coverage, continuity of existing technology stack and organizational structure, low migration risk and cost

Mature and stable

Components adopt the latest stable version of the community, which is more stable and reliable through component stability and compatibility verification tests

Ecological integration with Alibaba Cloud

It can flexibly integrate with Alibaba Cloud ecosystem according to business requirements and technical routes, such as DataWorks+EMR for data development, PAI+EMR for machine learning, MaxCompute+Data Lake Formation+EMR for lake warehouse integration

Multiple migration schemes

Depending on the data size and budget, you can efficiently migrate to the cloud as planned through Lightning Cube, private lines and public networks

Recommended combination

ECS

Object Storage OSS

Reduce costs, solve idle resources, and apply to multiple data analysis scenarios

With the rapid expansion of enterprises' accumulated data scale, data analysis and use will encounter: the cost challenge of data scale expansion; The problem of idle resources caused by the coupling of computing and storage; Due to a variety of data analysis scenarios, such as offline computing, streaming computing, interactive analysis, machine learning, etc., multiple engines frequently trigger data, resulting in data inconsistency and cost problems. The above problems can be effectively solved through EMR and supporting cloud native data solutions

Capable of providing

Computing storage separation

The data is stored in OSS object storage, and the data lake is accelerated through EMR JindoFS or Alluxio, so as to achieve the decoupling of computing and storage, improve and ensure the computing efficiency, and avoid the problem of idle resources

Data Tiered Storage

Jindo Table combines the hierarchical storage capabilities of OSS, combines big data business with underlying basic capabilities, and matches different OSS storage types according to the cold, hot, and warm layers of data to maximize cost savings

Docking multiple computing engines

EMR data lake solution can interface with real-time computing, PAI, MaxCompute, ElasticSearch and other computing engines to avoid repeated data movement

Unified management and control of metadata multi engines

Realize unified management of metadata through EMR+Data Lake Formation, and DLF can uniformly control the permissions of different EMR computing engines

Recommended combination

Object Storage OSS

Data lake construction DLF

Build machine learning and algorithm platform through EMR to accelerate model training

Collect user behavior data, build machine learning and algorithm platform through EMR, build machine learning feature library, model library and algorithm library through Hive/Spark, realize model training through EMR Data Science cluster TensorFlow/Pytoch, and realize online reasoning service of model through PAI EAS

Capable of providing

Stable and reliable

The recommendation system solution that has been verified by mass production in the industry has significantly increased the CTR click rate

Flexible and controllable

It is applicable to offline recommendation and real-time recommendation scenarios. Users can flexibly select open source technology components according to requirements and technology stack direction

Good integration

You can flexibly select the appropriate ECS GPU instance type according to the rapid integration of PAI EAS/PAI Studio, etc

Recommended combination

Machine learning platform PAI

Fully compatible with open source version features, and quickly integrated with other EMR components

Collect various user behavior data on the APP, process and analyze the data through the EMR platform, write it into ClickHouse, support flexible and rapid analysis of upper business, and improve the efficiency of business decisions

Capable of supporting

Second level query

ClickHouse supports second level data query, rapid call and manual analysis of the application layer

Flexible query

Complete SQL statement support and flexible business logic analysis

Easy operation and maintenance

Semi trusteeship cluster, providing cluster management, monitoring, capacity expansion and other operation and maintenance capabilities, allowing more technical personnel to invest in business development

Recommended combination

ECS

Customer Stories

Why E-MapReduce?

Shuhe Technology

Driven by big data technology, Shuhe Technology provides intelligent financial solutions for financial institutions. With the expansion of the company's business, a large number of data requirements proposed by the business side are testing the ability of the existing cluster. In order to reduce the pressure of existing clusters, Shuhe has used Alibaba Cloud EMR to land a data lake suitable for its current business. It can store structured and unstructured data at any scale. It uses different types of engines for analysis, providing a better basis for decision-making for business development.

Uncle Kai tells stories

Kaishu Storytelling is a well-known brand of children's content education in China. At the beginning, Kaishu's story telling uses the third-party SAAS platform as the operation support. The cycle is long, the display is rigid, and the personalized development is very limited. It is difficult to support the team's refined operation needs. After Alibaba Cloud's E-MapReduce big data platform is used to support the business team, it helps the business team to achieve accurate access to users, real-time feedback and active services. After the system goes online, the business increases significantly.

Yeahmobi report

Yeahmobi is a global intelligent marketing service company driven by technology. Its main services include effect marketing, brand services, and comprehensive marketing solutions of various categories. The Yeahmobi point report is based on AliCloud OSS+E-MapReduce, where all data is stored in a unified OSS, computing resources are dynamically adjusted, and E-MapReduce is used to support offline analysis, meeting the requirements of business scenarios, and the overall TOC is reduced by 30%.

Speak fluently

Fluent is a technology driven education company. In offline computing tasks, most data sources come from business DB. With the increase of data volume, it is unable to meet the near real-time query requirements. Fluent said that after choosing AliCloud E-MapReduce and adopting CDC+Delta Lake, the cost was saved by nearly 80%. The time cost of early morning DB data access is greatly reduced, ensuring that all DB data access without special requirements can be completed within one hour, greatly improving the efficiency.

Comparison between open source big data platform E-MapReduce and self built Hadoop cluster

Contrast dimension

cost

performance

Ease of use

elastic

security

reliable

service

Alibaba Cloud E-MapReduce

Pay as you go resources, support flexible adjustment of cluster resources, hierarchical data storage, and high resource utilization. No additional software license fees.

Compared with the open source version, the performance of EMR SparkSQL is significantly improved. For example, the performance of EMR SparkSQL is six times that of the open source version.

The Hadoop cluster is launched at the minute level to respond to business needs quickly.

The cluster can be started and destroyed temporarily according to the job. Cluster resources can be automatically adjusted dynamically according to the time cycle or cluster load. Based on JindoFS computing storage separation architecture, computing and storage resources can be easily expanded separately.

Support enterprise level multi tenant resource management capabilities, support table, column, row level permission control and log audit, and support data encryption.

Large scale, enterprise level environment inspection, upgrade with the open source version, and pass professional compatibility verification tests to provide a better use experience than the community version.

Professional and senior big data expert technical service team provides after-sales support.

Self built Hadoop cluster

The resources are estimated in advance and relatively fixed, so the utilization rate of resources is low. If Hadoop distribution is adopted, additional license fees shall be paid.

The open source community version is adopted, and the performance needs to be self optimized.

Purchase servers and deploy Hadoop ecological components, with a cycle of several weeks.

Computing and storage are coupled. Resources are relatively fixed and cannot be flexibly adjusted.

The multi tenant management capability needs to be configured by itself. The capability is not perfect and cannot meet the enterprise level requirements.

You need to update and upgrade the open source version by yourself, verify the compatibility of each component version, and repair community bugs by yourself.

There is no service support for the community version. For Hadoop distribution, you need to pay additional license and service fees.

Product Dynamics

View details

2017-01-18 New products

EMR supports exclusive package

View details

2017-01-18 New functions/specifications

EMR supports Spark 2.0

View details

2017-02-23 New functions/specifications

Support unified Hive table metadata management

View details

2017-04-26 New Region/New Availability Zone

E-MapReduce went online in North China 3

View details

2017-05-03 New functions/specifications

Execution plan scheduling enhancement

View details

2017-05-10 New Features/Specifications

Job addition retry support

View details

2017-06-15 New functions/specifications

Release of cluster configuration management system

View details

2017-07-29 Price adjustment

The price of E-MapReduce International Station has been comprehensively lowered

View details

2017-08-05 New region/new zone

E-MapReduce German station online service

View details

2017-08-08 New Features/Specifications

EMR big data model overall plan release

View details

New functions/specifications on November 23, 2017

Gateway function goes online

View details

2018-01-03 New Region/New Availability Zone

E-MapReduce product Hong Kong, Hohhot regional online service

View details

2018-03-01 New functions/specifications

Detailed permission control component Ranger publishing

View details

2018-03-03 New Region/New Availability Zone

E-MapReduce went online in Mumbai, India

View details

2018-03-20 Function optimization

E-MapReduce supports model upgrade

View details

New functions/specifications on April 18, 2018

E-MapReduce supports cluster replacement from volume based to monthly package

View details

2018-07-05 New functions/specifications

Hadoop elastic scalability released

View details

2018-09-06 New Functions/Specifications

E-MapReduce performance is greatly optimized

View details

2018-09-22 New functions/specifications

EMR TensorFlow Release

View details

2018-11-01 Function optimization

One click expansion of EMR cloud disk data disk

View details

New functions/specifications on November 1, 2018

EMR supports preemptive instances

View details

2018-12-07 New functions/specifications

EMR APM function release

View details

New functions/specifications on January 21, 2019

EMR upgrade Hadoop 2.8.5

View details

New functions/specifications on March 15, 2019

EMR Knox supports Flink and adapts to YARN time line service

View details

2019-06-08 New Region/New Availability Zone

E-MapReduce regional service in Chengdu was launched

View details

2019-07-09 New functions/specifications

New EMR workflow supports streaming job types

View details

2019-07-28 New functions/specifications

EMR latest version EMR-3.22.0 released

View details

2019-07-28 New functions/specifications

EMR newly added Kudu component

View details

2019-08-01 New functions/specifications

EMR newly released JindoFS, a self-developed big data storage service customized for cloud storage

View details

New functions/specifications on November 18, 2019

E-MapReduce version 3.24.0 release

View details

New functions/specifications on November 18, 2019

EMR supports TensorFlow on spark

View details

New functions/specifications from November 2011

E-MapReduce 3.23.0 Release

View details

New functions/specifications on November 21, 2019

EMR China/International Station Launches the 6th Generation ECS Enterprise Instance

View details

2020-06-30 New Features/Specifications

E-MapReduce supports ECS D2S new generation big data instance

View details

2020-07-31 New Features/Specifications

Alibaba Cloud E-MapReduce adds the ECS big data instance type D2C

View details

2021-01-05 New Features/Specifications

Alibaba Cloud E-MapReduce adds Remote Shuffle Service

View details

2021-02-28 New Region/New Availability Zone

Alibaba Cloud E-MapReduce officially launched in North China 6 Ulanqab

View details

New functions/specifications of 2021-04-01

Alibaba Cloud E-MapReduce publishes the Clickhouse cluster type

View details

2021-05-01 New Features/Specifications

AliCloud E-MapReduce launched the latest generation local SSD instance

View details

2021-07-31 New Features/Specifications

E-Mapreduce semi managed Clickhouse cluster publishing

View details

2021-09-30 New Features/Specifications

E-MapReduce on ACK new release

View details

2022-01-26 Function optimization

E-MapReduce new console is released

View details

2022-03-28 New Features/Specifications

StarRocks is launched on the new console, dedicated to building a fast unified analysis experience

View details

2022-04-15 New Features/Specifications

JindoData publishing supports OSS HDFS services

View details

2022-04-22 New Features/Specifications

Upgrading StarRocks to 2.1.1 will greatly improve query performance

View details

2022-06-16 New Features/Specifications

Data lake cluster goes online

View details

2022-07-15 New Features/Specifications

DataWorks supports EMR DataLake clusters

View details

2022-07-22 New Features/Specifications

Doctor goes online

View details

2022-08-04 New Features/Specifications

Data Service Publishing

View details

2022-08-16 New Features/Specifications

The new control platform supports more high-level features

View details

2022-09-02 New Features/Specifications

Add elastic scaling rules

View details

2022-09-07 New Features/Specifications

Turn on automatic compensation

View details

2022-09-09 New Features/Specifications

Clone cluster

View details

2022-10-17 New Features/Specifications

User defined cluster goes online

View details

2022-11-17 New Features/Specifications

OSS HDFS supports hot and cold tiered storage

View details

2022-11-25 New Features/Specifications

DataWorks supports EMR custom clusters

View details

2022-12-20 New Features/Specifications

EMR Doctor real-time risk detection

View details

2022-12-28 New Features/Specifications

EMR Doctor cluster daily report

View details

2023-02-14 New Features/Specifications

Access link and port function upgrade

View details

2023-02-24 New Features/Specifications

Support data disk encryption

View details

2023-03-02 New Features/Specifications

New configuration parameters for elastic scaling rules

View details

2023-03-08 New Features/Specifications

New application configuration export function

View details

2023-03-15 New Features/Specifications

New System Event in Event Center

View details

2023-03-23 New Features/Specifications

Support the creation of separate storage and accounting clusters by default

View details

2023-04-10 New Features/Specifications

Serverless StarRocks Free Public Beta Released

View details

2023-04-23 New Features/Specifications

Support visual management of YARN partitions on the console

View details

2023-05-15 New Features/Specifications

View cluster daily report and analysis

View details

2023-05-23 New Features/Specifications

Commercialization of Serverless StarRocks

View details

2023-05-26 New Features/Specifications

Support for Sky Reliant Cloud Server (under test)

View details

2023-06-21 New Features/Specifications

Operating StarRocks instances through SQL Editor

View details

2023-07-04 New Features/Specifications

EMR Workflow public beta

View details

2023-07-14 New Features/Specifications

Support stateless clusters

View details

2023-07-14 New Features/Specifications

EMR on ACK supports Data Science cluster

View details

2023-08-09 New Features/Specifications

New Elastic Scaling Management Module

View details

2023-08-17 New Features/Specifications

Support YARN partition and queue association

View details

2023-08-29 New Features/Specifications

New cluster template function

View details

2023-09-12 New Features/Specifications

StarRocks supports separation of storage and accounting

View details

2023-10-24 New Features/Specifications

Support the cloud server relying on the sky

View details

2023-11-21 New Features/Specifications

New alarm management function

View details

2023-11-24 New Features/Specifications

New node health status

View details

2023-12-05 New Features/Specifications

Connect StarRocks instance through DMS

View details

2023-12-08 New Features/Specifications

Connect StarRocks instance through Quick BI

View details

2023-12-21 New Features/Specifications

Workflow New Workspace Management

View details

2023-12-25 New Features/Specifications

Workflow supports submission to cluster template for execution

View details

2024-01-10 New Features/Specifications

Workflow commercial release

View details

View all logs

Introduction and Practice

EMR Open Source Big Data Migration Zone

HDFS, Hive, Kafka migration to EMR best practices

View details

EMR elastic calculation practice

EMR flexible low-cost offline big data analysis best practice

View details

Real time statistics practice of incremental data

Realize real-time statistics of incremental data through Serverless StarRocks

View details

Practice of minute level quasi real-time analysis

Minute level quasi real-time analysis via Serverless StarRocks

View details

Documentation and Tools

Product documentation

How to get started, use and develop

quick get start

Quickly create clusters and execute jobs

Cluster type

Cluster selection planning in different scenarios

common problem

Summary of common errors&problems