Open source big data platform E-MapReduce (referred to as "EMR") is a cloud native open source big data platform, which provides customers with simple and easy integrated open source big data computing and storage engines such as Hadoop, Hive, Spark, StarRocks, Flink, Presto, ClickHouse, etc.EMR computing resources support flexible flexible control.EMR supports on ECS, on ACK, and serverless deployment modes.
Product types can be selected according to different business demand scenarios
EMR on ECS
EMR on ECS refers to the way EMR operates on ECS.EMR on ECS combines the big data processing function of EMR with the advantages of container deployment of ECS, enabling you to configure and manage EMR clusters more flexibly, so as to better adapt to complex data processing and analysis scenarios.With EMR on ECS, you can quickly create, manage, operate and maintain EMR clusters, and use computing and storage resources more efficiently.
EMR on ACK
EMR on ACK provides a new way to build a big data platform.You can deploy open source big data services on Alibaba Cloud Container Service Kubernetes Edition (ACK), and use the advantages of ACK in service deployment and container application management to reduce the investment in operation and maintenance of underlying cluster resources, so that you can focus more on big data tasks.
EMR Serverless StarRocks
E-MapReduce Serverless StarRocks is the full hosting service of serverless StarRocks provided by Alibaba Cloud. It provides high-performance, full scenario, extremely fast and unified data analysis experience, and has full lifecycle capabilities such as out of the box, elastic expansion, monitoring management, slow SQL diagnosis and analysis.The core is 100% compatible with StarRocks, and its performance is 3-5 times higher than that of traditional OLAP engines, helping enterprises build big data applications efficiently.
Product advantages
New generation open source big data platform
Stable, reliable and easy to use
Support node fault tolerance and compensation;100 node capacity expansion time<2 minutes;Comprehensive service inspection and event notification;EMR Studio provides one-stop development scheduling service
Deposit accounting separation architecture
Archive on demand, saving 20% - 40% of storage costs;OSS HDFS storage, operation and maintenance free;DLF lake management realizes lake data lifecycle management
Significant cost savings
Time based elastic scalability, preemptive instances can further reduce costs;Support cost-effective models such as DPCA/AMD;On ack mode, supporting mixed resources
Leading open source ecology
Deeply optimize Spark to improve performance by 100%;Provide Hadoop, Spark, Hive, Kafka, HBase, Presto, Impala, Hudi, StarRocks and other open source components
Product Functions
Cluster managementConvenient cluster management, fast cluster creation and capacity expansion
Cluster creationThrough the console page or OpenAPI, you can quickly create multiple types of clusters, such as Hadoop, Dataflow, Datascience, Druid, ZooKeeper and other open source big data frameworks, without caring about the underlying hardware and software deployment
Cluster expansionThe number of nodes in the existing cluster can be easily increased or reduced through the console page or OpenAPI
Service configurationYou can quickly add services provided by EMR, monitor the status of services, and configure and operate service components
Elastic expansionThrough the console interface, you can easily add the required components, and configure and operate the components
Dynamic capacity expansionMultiple elastic scaling strategies can be set to automatically scale cluster computing resources dynamically and reduce TCO
Operation and Maintenance CenterPerfect operation and maintenance management tools to facilitate rapid discovery and positioning of cluster problems
Cluster monitoringProvide rich display of service monitoring indicators and host monitoring indicators, and quickly locate service and host exceptions through visualization
Event CenterThe EMR service provides a wide range of event types, including service events, managed service events, and host events. It can obtain cluster problems more quickly and specifically, and trace the source of the problem link
Job ListMake statistics on the running status of cluster jobs, quickly compare abnormal jobs, and facilitate job and cluster performance tuning
diagnostic analysis HDFS cold and hot data analysis and small file analysis functions are provided to provide basis for service performance optimization
Rich componentsRich component support, you can select components according to your needs
DataLakeA more flexible, reliable and efficient big data computing cluster
SparkA new generation of distributed open source big data framework based on memory, which supports offline and real-time computing, as well as SQL syntax and machine learning processing
HiveA set of offline data processing system based on Hadoop, which provides structured table data management capability on HDFS and SQL like query syntax for data analysis and processing
KafkaKafka is a high throughput distributed publish subscribe messaging system with excellent performance and reliability
FlinkFor the distributed processing engine of streaming data and batch data, EMR provides an enterprise level big data computing platform built on the Ververica Platform, a commercial product based on Apache Flink, to provide real-time computing services
ClickHouseOpen source OLAP analysis engine, main features: column storage, MPP architecture, support for SQL, real-time data update, support for indexes, etc
HudiA data lake storage format that provides the ability to update and delete data and the ability to consume changing data
StarRocksOpen source MPP architecture OLAP analysis engine, supporting sub second level data query and multi table join
Perfect cloud ecological supportThe product environment on Alibaba Cloud is deeply integrated and supported
Support DataWorksProvide customers with a professional, efficient, safe and reliable one-stop big data development and governance platform
Support MaxComputeSupport data reading and writing of Alibaba Cloud MaxCompute products
Supports ElasticSearchThe ES Hadoop plug-in is built in Hadoop, which can directly support ES related operations
Support data lake to build DLFBy default, EMR supports metadata management using DLF to facilitate metadata management in the data lake scenario.
Support object storage OSSAll computing engines in EMR support OSS as storage, which can be used like HDFS.JindoFS is used to speed up OSS data reading and writing.
Support cloud monitoringMonitoring of EMR services and operations can be set in cloud monitoring to facilitate quick alarm of problems
SLS supportSLS is supported as a real-time data input source, and SDK direct operation is provided
Support Alibaba Cloud messaging productsSupport reading and writing of message queues, message services, etc., and provide SDK packaging for user convenience
Application scenarios
Big data moving station
Cloud native data lake
Intelligent recommendation
interactive analysis
Continue the open source technology stack, link Alibaba Cloud ecosystem and open source big data ecosystem
Big data relocation will encounter the following challenges: the big data technology stack is complex, and the data scale and task number are large;The open source community version has a fast iterative evolution speed, and the compatibility between open source components and community bugs will affect the continuity of work and business.Big data migration can continue the open source technology stack through EMR, linking Alibaba Cloud ecosystem and open source big data ecosystem
Capable of providing
Adopt community open source software
High scenario coverage, continuity of existing technology stack and organizational structure, low migration risk and cost
Mature and stable
Components adopt the latest stable version of the community, which is more stable and reliable through component stability and compatibility verification tests
Ecological integration with Alibaba Cloud
It can flexibly integrate with Alibaba Cloud ecosystem according to business requirements and technical routes, such as DataWorks+EMR for data development, PAI+EMR for machine learning, MaxCompute+Data Lake Formation+EMR for lake warehouse integration
Multiple migration schemes
Depending on the data size and budget, you can efficiently migrate to the cloud as planned through Lightning Cube, private lines and public networks
Reduce costs, solve idle resources, and apply to multiple data analysis scenarios
With the rapid expansion of enterprises' accumulated data scale, data analysis and use will encounter: the cost challenge of data scale expansion;The problem of idle resources caused by the coupling of computing and storage;Due to a variety of data analysis scenarios, such as offline computing, streaming computing, interactive analysis, machine learning, etc., multiple engines frequently trigger data, resulting in data inconsistency and cost problems.The above problems can be effectively solved through EMR and supporting cloud native data solutions
Capable of providing
Computing storage separation
The data is stored in OSS object storage, and the data lake is accelerated through EMR JindoFS or Alluxio, so as to achieve the decoupling of computing and storage, improve and ensure the computing efficiency, and avoid the problem of idle resources
Data Tiered Storage
Jindo Table combines the hierarchical storage capabilities of OSS, combines big data business with underlying basic capabilities, and matches different OSS storage types according to the cold, hot, and warm layers of data to maximize cost savings
Docking multiple computing engines
EMR data lake solution can interface with real-time computing, PAI, MaxCompute, ElasticSearch and other computing engines to avoid repeated data movement
Unified management and control of metadata multi engines
Realize unified management of metadata through EMR+Data Lake Formation, and DLF can uniformly control the permissions of different EMR computing engines
Build machine learning and algorithm platform through EMR to accelerate model training
Collect user behavior data, build machine learning and algorithm platform through EMR, build machine learning feature library, model library and algorithm library through Hive/Spark, realize model training through EMR Data Science cluster TensorFlow/Pytoch, and realize online reasoning service of model through PAI EAS
Capable of providing
Stable and reliable
The recommendation system solution that has been verified by mass production in the industry has significantly increased the CTR click rate
Flexible and controllable
It is applicable to offline recommendation and real-time recommendation scenarios. Users can flexibly select open source technology components according to requirements and technology stack direction
Good integration
You can flexibly select the appropriate ECS GPU instance type according to the rapid integration of PAI EAS/PAI Studio, etc
Fully compatible with open source version features, and quickly integrated with other EMR components
Collect various user behavior data on the APP, process and analyze the data through the EMR platform, write it into ClickHouse, support flexible and rapid analysis of upper business, and improve the efficiency of business decisions
Capable of supporting
Second level query
ClickHouse supports second level data query, rapid call and manual analysis of the application layer
Flexible query
Complete SQL statement support and flexible business logic analysis
Easy operation and maintenance
Semi trusteeship cluster, providing cluster management, monitoring, capacity expansion and other operation and maintenance capabilities, allowing more technical personnel to invest in business development
Driven by big data technology, Shuhe Technology provides intelligent financial solutions for financial institutions.With the expansion of the company's business, a large number of data requirements proposed by the business side are testing the ability of the existing cluster.In order to reduce the pressure of existing clusters, Shuhe has used Alibaba Cloud EMR to land a data lake suitable for its current business. It can store structured and unstructured data at any scale. It uses different types of engines for analysis, providing a better basis for decision-making for business development.
Uncle Kai tells stories
Kaishu Storytelling is a well-known brand of children's content education in China.At the beginning, Kaishu's story telling uses the third-party SAAS platform as the operation support. The cycle is long, the display is rigid, and the personalized development is very limited. It is difficult to support the team's refined operation needs.After Alibaba Cloud's E-MapReduce big data platform is used to support the business team, it helps the business team to achieve accurate access to users, real-time feedback and active services. After the system goes online, the business increases significantly.
Yeahmobi report
Yeahmobi is a global intelligent marketing service company driven by technology. Its main services include effect marketing, brand services, and comprehensive marketing solutions of various categories.The Yeahmobi point report is based on AliCloud OSS+E-MapReduce, where all data is stored in a unified OSS, computing resources are dynamically adjusted, and E-MapReduce is used to support offline analysis, meeting the requirements of business scenarios, and the overall TOC is reduced by 30%.
Speak fluently
Fluent is a technology driven education company. In offline computing tasks, most data sources come from business DB.With the increase of data volume, it is unable to meet the near real-time query requirements.Fluent said that after choosing AliCloud E-MapReduce and adopting CDC+Delta Lake, the cost was saved by nearly 80%.The time cost of early morning DB data access is greatly reduced, ensuring that all DB data access without special requirements can be completed within one hour, greatly improving the efficiency.
Comparison between open source big data platform E-MapReduce and self built Hadoop cluster
Contrast dimension
cost
performance
Ease of use
elastic
security
reliable
service
Alibaba Cloud E-MapReduce
Pay as you go resources, support flexible adjustment of cluster resources, hierarchical data storage, and high resource utilization.No additional software license fees.
Compared with the open source version, the performance of EMR SparkSQL is significantly improved. For example, the performance of EMR SparkSQL is six times that of the open source version.
The Hadoop cluster is launched at the minute level to respond to business needs quickly.
The cluster can be started and destroyed temporarily according to the job.Cluster resources can be automatically adjusted dynamically according to the time cycle or cluster load.Based on JindoFS computing storage separation architecture, computing and storage resources can be easily expanded separately.
Support enterprise level multi tenant resource management capabilities, support table, column, row level permission control and log audit, and support data encryption.
Large scale, enterprise level environment inspection, upgrade with the open source version, and pass professional compatibility verification tests to provide a better use experience than the community version.
Professional and senior big data expert technical service team provides after-sales support.
Self built Hadoop cluster
The resources are estimated in advance and relatively fixed, so the utilization rate of resources is low.If Hadoop distribution is adopted, additional license fees shall be paid.
The open source community version is adopted, and the performance needs to be self optimized.
Purchase servers and deploy Hadoop ecological components, with a cycle of several weeks.
Computing and storage are coupled. Resources are relatively fixed and cannot be flexibly adjusted.
The multi tenant management capability needs to be configured by itself. The capability is not perfect and cannot meet the enterprise level requirements.
You need to update and upgrade the open source version by yourself, verify the compatibility of each component version, and repair community bugs by yourself.
There is no service support for the community version. For Hadoop distribution, you need to pay additional license and service fees.
Product Dynamics
View details
View details
View details
2017-01-18 New products
EMR supports exclusive package
View details
2017-01-18 New functions/specifications
EMR supports Spark 2.0
View details
2017-02-23 New functions/specifications
Support unified Hive table metadata management
View details
2017-04-26 New Region/New Availability Zone
E-MapReduce went online in North China 3
View details
2017-05-03 New functions/specifications
Execution plan scheduling enhancement
View details
2017-05-10 New Features/Specifications
Job addition retry support
View details
2017-06-15 New functions/specifications
Release of cluster configuration management system
View details
2017-07-29 Price adjustment
The price of E-MapReduce International Station has been comprehensively lowered
View details
2017-08-05 New region/new zone
E-MapReduce German station online service
View details
2017-08-08 New Features/Specifications
EMR big data model overall plan release
View details
New functions/specifications on November 23, 2017
Gateway function goes online
View details
2018-01-03 New Region/New Availability Zone
E-MapReduce product Hong Kong, Hohhot regional online service
View details
2018-03-01 New functions/specifications
Detailed permission control component Ranger publishing
View details
2018-03-03 New Region/New Availability Zone
E-MapReduce went online in Mumbai, India
View details
2018-03-20 Function optimization
E-MapReduce supports model upgrade
View details
New functions/specifications on April 18, 2018
E-MapReduce supports cluster replacement from volume based to monthly package
View details
2018-07-05 New functions/specifications
Hadoop elastic scalability released
View details
2018-09-06 New Functions/Specifications
E-MapReduce performance is greatly optimized
View details
2018-09-22 New functions/specifications
EMR TensorFlow Release
View details
2018-11-01 Function optimization
One click expansion of EMR cloud disk data disk
View details
New functions/specifications on November 1, 2018
EMR supports preemptive instances
View details
2018-12-07 New functions/specifications
EMR APM function release
View details
New functions/specifications on January 21, 2019
EMR upgrade Hadoop 2.8.5
View details
New functions/specifications on March 15, 2019
EMR Knox supports Flink and adapts to YARN time line service
View details
2019-06-08 New Region/New Availability Zone
E-MapReduce regional service in Chengdu was launched
View details
2019-07-09 New functions/specifications
New EMR workflow supports streaming job types
View details
2019-07-28 New functions/specifications
EMR latest version EMR-3.22.0 released
View details
2019-07-28 New functions/specifications
EMR newly added Kudu component
View details
2019-08-01 New functions/specifications
EMR newly released JindoFS, a self-developed big data storage service customized for cloud storage
View details
New functions/specifications on November 18, 2019
E-MapReduce version 3.24.0 release
View details
New functions/specifications on November 18, 2019
EMR supports TensorFlow on spark
View details
New functions/specifications from November 2011
E-MapReduce 3.23.0 Release
View details
New functions/specifications on November 21, 2019
EMR China/International Station Launches the 6th Generation ECS Enterprise Instance
View details
2020-06-30 New Features/Specifications
E-MapReduce supports ECS D2S new generation big data instance
View details
2020-07-31 New Features/Specifications
Alibaba Cloud E-MapReduce adds the ECS big data instance type D2C
View details
2021-01-05 New Features/Specifications
Alibaba Cloud E-MapReduce adds Remote Shuffle Service
View details
2021-02-28 New Region/New Availability Zone
Alibaba Cloud E-MapReduce officially launched in North China 6 Ulanqab
View details
New functions/specifications of 2021-04-01
Alibaba Cloud E-MapReduce publishes the Clickhouse cluster type
View details
2021-05-01 New Features/Specifications
AliCloud E-MapReduce launched the latest generation local SSD instance