EMR Serverless Spark_ Full Custody Spark_ Data Query and Analysis _ Big Data Computing

EMR Serverless Spark

EMR Serverless Spark is a high-performance Lakehouse product for Data+AI. It provides enterprises with one-stop data platform services, including task development, debugging, scheduling, operation and maintenance, which greatly simplifies the whole process of data processing and model training. At the same time, it is 100% compatible with the open source Spark ecosystem and can be seamlessly integrated into the customer's existing data platform. With EMR Serverless Spark, enterprises can focus more on data processing analysis and model training and tuning to improve work efficiency.

Buy Now Free trial

Hot commodity
Content selection

Product Introduction

EMR Serverless Spark

View the documentation to learn more about EMR Serverless Spark

Product advantages

Cloud native fast computing engine

Built in Spark Native Engine, the performance of which is three times higher than that of the open source version; Built in enterprise level Celeborn (Remote Shuffle Service), supports PB level Shuffle data, and reduces the total cost of computing resources by up to 30%.

Flexible resource management

Resource scheduling has second level flexibility. It supports on-demand allocation of resources with a minimum granularity of 1 core, and fine resource metering by task or queue level to ensure maximum and flexibility of resource use.

DATA and AI

It provides a development and running environment that is fully compatible with PySpark/Python, supports Python ecological machine learning Lib, and Spark MLlib, and supports product management of Python third-party dependency libraries.

Ecological compatibility

It has strong compatibility and integration capabilities. It supports DLF and Hive MetaStore data directories, is compatible with Paimon, Iceberg, Hudi, Delta and other mainstream lake formats, can interface with mainstream scheduling systems such as Airflow and Dolphin Scheduler, supports Kerberos/LDAP authentication and Ranger authentication, and also supports DataWorks and DBT to submit tasks to meet user needs in all aspects.

Product Functions

SQL Editor

SQL Editor provides a SQL integrated development environment for you to write, debug, and execute SparkSQL code. SQL Editor can be used for efficient data analysis, providing you with key data insights and decision support.

Notebook

It provides an interactive working environment for data analysts, data scientists and data engineers to support the development of PySpark and Markdown. You can write code, run queries, visualize data and view results in real time.

Workflow

Provide the arrangement and running of different types of tasks (such as PySpark, SQL, Notebook, Spark JAR) in the workspace, and easily build a data pipeline. Both grid and topology dependency perspectives are provided to facilitate workflow management.

resource management

In resource management, you can add different queues to isolate and manage resources, and create production and development environment queues for different business teams to carry the running of tasks.

Custom Environment

When submitting PySpark tasks or running Notebook, you can use the running environment to manage third-party Python libraries. Serverless Spark will automatically help you install and deploy dependent libraries, thus simplifying the environment preparation process.

Task History

It provides a wealth of task instance indicators to help you understand the task operation, including the cost indicator CU *, resource indicators MB seconds, vcores seconds, the Spark UI and log files corresponding to the task.

Product selection

EMR Serverless Spark product billing

Getting Started and Trying Out

Free trial

Get started quickly

Quickly create a workspace on the EMR Serverless Spark page

The workspace is the basic unit of EMR Serverless Spark, which is used to manage tasks, members, roles and permissions. All task development needs to be carried out in a specific workspace. Therefore, you need to create a workspace before starting task development.

Experience now

Technical solutions

General data lake construction and analysis scenario
Data and AI integration application scenario
Real time monitoring application scenario of industrial intelligent equipment

Product pricing

The charging items of EMR Serverless Spark mainly come from computing resources, that is, resources actually available for computing, which will be converted into CU fees.

Billing method

It supports pay as you go, monthly package, and resource deduction package billing methods. You can choose the appropriate billing method as needed.

Pay as you go (pay as you go)
Pay as you go is a billing method that uses before you pay. You do not need to purchase a large amount of resources in advance, and the system will settle according to the actual resource usage of your workspace. The fee is calculated every hour and hour (subject to UTC+8 time), and a new billing cycle will be entered after the calculation is completed.
It is applicable to scenarios where business usage often changes.
View details
Monthly package (prepayment)
Monthly subscription is a prepaid billing method. When purchasing, you need to pay in advance according to the selected duration. EMR Serverless Spark will strictly calculate the price of the billing cycle according to the duration you purchased.
It is applicable to scenarios with long-term stable use or clear budget planning.
View details
Resource deduction package
Preferential resource packages with different capacities are purchased in advance, and the consumption is deducted from the resource package in priority when the cost is settled. The part exceeding the limit of resource package shall be paid as you go.
It is applicable to scenarios where business usage is relatively stable.
View details

Open EMR Serverless Spark now

Security Compliance

Permission management

Workspace permissions: It supports adding RAM users to the workspace, assigning users to workspace roles according to their functions, and controlling the operation permissions of different users in the workspace.

RAM Policy： RAM Policy is a user based authorization policy. You can use RAM Policy to control the user's operation permissions on the workspace.

network security

It provides customers with a virtual and secure network environment. It supports VPC security group rule configuration, access to data sources and servers in the VPC, or call other services in the VPC.

Operational audit

AliCloud operation audit ActionTrail console OpenAPI、 Developer tools, etc., query the instance operation event log in the past 90 days, and provide Query log information.

Customer Stories

Micro finance technology

Micro finance technology chooses to build a data platform based on EMR Serverless Spark, and has a separate resource pool for model training to avoid resource conflicts. At the same time, it also solves the puzzle of dealing with Shuffle stability and performance problems under the storage computing separation architecture.

Learn more

Eagle Horn Network

Choosing EMR Serverless Spark as the offline computing engine significantly reduces the operation and maintenance costs and improves the stability and reliability of the system. Its Celeborn capability solves the disk limitation problem in the operation of large S huffle tasks. At the same time, the task status and scheduling tool achieve strong consistency, without the need for secondary confirmation, and further optimizes the data processing process.

Midea Building

Midea Building Technology builds the LakeHouse data platform based on EMR Serverless Spark, which effectively integrates data and AI technology, and finally achieves an overall performance improvement of more than 50% in different scenarios, while reducing the comprehensive cost by 30%.

Learn more

common problem

Product selection
Product billing
Product use

Q： What is the EMR Serverless Spark version? What are the advantages of the product?: A： EMR Serverless Spark is a cloud native, fully hosted serverless product designed for large-scale data processing and analysis. It provides enterprises with one-stop data platform services, including task development, debugging, scheduling, operation and maintenance, which greatly simplifies data ..... View details

Q： What are the application scenarios of EMR Serverless Spark?: A： EMR Serverless Spark can meet various data processing and analysis needs of enterprise level users, such as establishing data platforms, data query and analysis scenarios, etc. View details

Q： What is Fusion?: A： The Fusion engine is a high-performance vectorized SQL execution engine built into EMR Serverless Spark. Compared with the open source Spark, the performance of the Fusion engine in the TPC-DS benchmark is three times higher. The Fusion engine is fully compatible with the open source Spark, and you don't need to make any changes to the existing code. View details

Q： What billing modes and items does the product support?: A： This article introduces the resource estimation policy, billing items, calculation methods, and the unit price of the supported regions of EMR Serverless Spark. View details

Q： How to use Paimon in EMR Serverless Spark?: A： This article introduces how to implement the read and write operations of Paimon tables in EMR Serverless Spark. View details

Q： How to connect external Hive Metastore in EMR Serverless Spark?: A： EMR Serverless Spark supports the connection of external Hive Metastore services, so you can easily access the data stored in Hive Metastore. This article describes how to configure and connect external Hive Metastore services in EMR Serverless Spark, so that ..... View details

Q： How to submit tasks to EMR Serverless Spark through Airflow?: A： This article shows you how to automatically submit tasks to EMR Serverless Spark through Apache Airflow to automate job scheduling and execution and help you manage data processing tasks more effectively. View details

Q： How to submit tasks to EMR Serverless Spark through DolphinScheduler?: A： DolphinScheduler is a distributed and easy to expand visual DAG workflow task scheduling open source system, which can efficiently execute and manage big data processes. This article shows you how to easily create, edit, and schedule Spark jobs through the DolphinScheduler Web interface. View details

Q： How to interact with EMR Serverless Spark through Jupyter Notebook?: A： Jupyter Notebook is a powerful interactive development tool. You can write and execute code in real time in the Web interface, and view the results in real time without precompiling or executing scripts separately. This article will introduce you how to build an efficient communication with Serverless Spark ..... View details

For more frequently asked questions about EMR Serverless Spark, please refer to the official documentation

Experience now and create the future in Alibaba Cloud

Free trial

Want to experience more product features?

Buy Now EMR Serverless Spark product

Learn more about AliCloud products?

explore Alibaba Cloud products Learn more about product introduction

Need help when encountering difficulties?