EMR Serverless Spark

    EMR Serverless Spark

    EMR Serverless Spark is a high-performance Lakehouse product for Data+AI. It provides enterprises with one-stop data platform services, including task development, debugging, scheduling, operation and maintenance, which greatly simplifies the whole process of data processing and model training. At the same time, it is 100% compatible with the open source Spark ecosystem and can be seamlessly integrated into the customer's existing data platform. With EMR Serverless Spark, enterprises can focus more on data processing analysis and model training and tuning to improve work efficiency.

     Preview
    Product Introduction
     Product introduction pictures

    EMR Serverless Spark

    EMR Serverless Spark is a high-performance Lakehouse product for Data+AI. It provides enterprises with one-stop data platform services, including task development, debugging, scheduling, operation and maintenance, which greatly simplifies the whole process of data processing and model training. At the same time, it is 100% compatible with the open source Spark ecosystem and can be seamlessly integrated into the customer's existing data platform. With EMR Serverless Spark, enterprises can focus more on data processing analysis and model training and tuning to improve work efficiency.

    Product advantages

    Cloud native fast computing engine

    Built in Spark Native Engine, the performance of which is three times higher than that of the open source version; Built in enterprise level Celeborn (Remote Shuffle Service), supports PB level Shuffle data, and reduces the total cost of computing resources by up to 30%.

    Flexible resource management

    Resource scheduling has second level flexibility. It supports on-demand allocation of resources with a minimum granularity of 1 core, and fine resource metering by task or queue level to ensure maximum and flexibility of resource use.

    DATA and AI

    It provides a development and running environment that is fully compatible with PySpark/Python, supports Python ecological machine learning Lib, and Spark MLlib, and supports product management of Python third-party dependency libraries.

    Ecological compatibility

    It has strong compatibility and integration capabilities. It supports DLF and Hive MetaStore data directories, is compatible with Paimon, Iceberg, Hudi, Delta and other mainstream lake formats, can interface with mainstream scheduling systems such as Airflow and Dolphin Scheduler, supports Kerberos/LDAP authentication and Ranger authentication, and also supports DataWorks and DBT to submit tasks to meet user needs in all aspects.
    Product Functions
     Card Head Diagram
    SQL Editor
    SQL Editor provides a SQL integrated development environment for you to write, debug, and execute SparkSQL code. SQL Editor can be used for efficient data analysis, providing you with key data insights and decision support.
     Card Head Diagram
    Notebook
    It provides an interactive working environment for data analysts, data scientists and data engineers to support the development of PySpark and Markdown. You can write code, run queries, visualize data and view results in real time.
     Card Head Diagram
    Workflow
    Provide the arrangement and running of different types of tasks (such as PySpark, SQL, Notebook, Spark JAR) in the workspace, and easily build a data pipeline. Both grid and topology dependency perspectives are provided to facilitate workflow management.
     Card Head Diagram
    resource management
    In resource management, you can add different queues to isolate and manage resources, and create production and development environment queues for different business teams to carry the running of tasks.
     Card Head Diagram
    Custom Environment
    When submitting PySpark tasks or running Notebook, you can use the running environment to manage third-party Python libraries. Serverless Spark will automatically help you install and deploy dependent libraries, thus simplifying the environment preparation process.
     Card Head Diagram
    Task History
    It provides a wealth of task instance indicators to help you understand the task operation, including the cost indicator CU *, resource indicators MB seconds, vcores seconds, the Spark UI and log files corresponding to the task.
    Product selection
    Getting Started and Trying Out
    Free trial
    Get started quickly

    Quickly create a workspace on the EMR Serverless Spark page

    The workspace is the basic unit of EMR Serverless Spark, which is used to manage tasks, members, roles and permissions. All task development needs to be carried out in a specific workspace. Therefore, you need to create a workspace before starting task development.

    Technical solutions
    • General data lake construction and analysis scenario
    • Data and AI integration application scenario
    • Real time monitoring application scenario of industrial intelligent equipment
    Product pricing

    The charging items of EMR Serverless Spark mainly come from computing resources, that is, resources actually available for computing, which will be converted into CU fees.

    Billing method

    It supports pay as you go, monthly package, and resource deduction package billing methods. You can choose the appropriate billing method as needed.

    • Pay as you go (pay as you go)

      Pay as you go is a billing method that uses before you pay. You do not need to purchase a large amount of resources in advance, and the system will settle according to the actual resource usage of your workspace. The fee is calculated every hour and hour (subject to UTC+8 time), and a new billing cycle will be entered after the calculation is completed.
      It is applicable to scenarios where business usage often changes.
      View details
    • Monthly package (prepayment)

      Monthly subscription is a prepaid billing method. When purchasing, you need to pay in advance according to the selected duration. EMR Serverless Spark will strictly calculate the price of the billing cycle according to the duration you purchased.
      It is applicable to scenarios with long-term stable use or clear budget planning.
      View details
    • Resource deduction package

      Preferential resource packages with different capacities are purchased in advance, and the consumption is deducted from the resource package in priority when the cost is settled. The part exceeding the limit of resource package shall be paid as you go.
      It is applicable to scenarios where business usage is relatively stable.
      View details
    Security Compliance

    Permission management

    Workspace permissions: It supports adding RAM users to the workspace, assigning users to workspace roles according to their functions, and controlling the operation permissions of different users in the workspace.

    RAM Policy: RAM Policy is a user based authorization policy. You can use RAM Policy to control the user's operation permissions on the workspace.

    network security

    It provides customers with a virtual and secure network environment. It supports VPC security group rule configuration, access to data sources and servers in the VPC, or call other services in the VPC.

    Operational audit

    AliCloud operation audit ActionTrail console OpenAPI、 Developer tools, etc., query the instance operation event log in the past 90 days, and provide Query log information.

     

    Customer Stories
     banner banner banner
    01
    Micro finance technology
    Micro finance technology chooses to build a data platform based on EMR Serverless Spark, and has a separate resource pool for model training to avoid resource conflicts. At the same time, it also solves the puzzle of dealing with Shuffle stability and performance problems under the storage computing separation architecture.
    Learn more
    02
    Eagle Horn Network
    Choosing EMR Serverless Spark as the offline computing engine significantly reduces the operation and maintenance costs and improves the stability and reliability of the system. Its Celeborn capability solves the disk limitation problem in the operation of large S huffle tasks. At the same time, the task status and scheduling tool achieve strong consistency, without the need for secondary confirmation, and further optimizes the data processing process.
    03
    Midea Building
    Midea Building Technology builds the LakeHouse data platform based on EMR Serverless Spark, which effectively integrates data and AI technology, and finally achieves an overall performance improvement of more than 50% in different scenarios, while reducing the comprehensive cost by 30%.
    Learn more
    common problem
    Q: What is the EMR Serverless Spark version? What are the advantages of the product?
    A: EMR Serverless Spark is a cloud native, fully hosted serverless product designed for large-scale data processing and analysis. It provides enterprises with one-stop data platform services, including task development, debugging, scheduling, operation and maintenance, which greatly simplifies data ..... View details
    Q: What are the application scenarios of EMR Serverless Spark?
    A: EMR Serverless Spark can meet various data processing and analysis needs of enterprise level users, such as establishing data platforms, data query and analysis scenarios, etc. View details
    Q: What is Fusion?
    A: The Fusion engine is a high-performance vectorized SQL execution engine built into EMR Serverless Spark. Compared with the open source Spark, the performance of the Fusion engine in the TPC-DS benchmark is three times higher. The Fusion engine is fully compatible with the open source Spark, and you don't need to make any changes to the existing code. View details
    Q: What billing modes and items does the product support?
    A: This article introduces the resource estimation policy, billing items, calculation methods, and the unit price of the supported regions of EMR Serverless Spark. View details
    Q: How to use Paimon in EMR Serverless Spark?
    A: This article introduces how to implement the read and write operations of Paimon tables in EMR Serverless Spark. View details
    Q: How to connect external Hive Metastore in EMR Serverless Spark?
    A: EMR Serverless Spark supports the connection of external Hive Metastore services, so you can easily access the data stored in Hive Metastore. This article describes how to configure and connect external Hive Metastore services in EMR Serverless Spark, so that ..... View details
    Q: How to submit tasks to EMR Serverless Spark through Airflow?
    A: This article shows you how to automatically submit tasks to EMR Serverless Spark through Apache Airflow to automate job scheduling and execution and help you manage data processing tasks more effectively. View details
    Q: How to submit tasks to EMR Serverless Spark through DolphinScheduler?
    A: DolphinScheduler is a distributed and easy to expand visual DAG workflow task scheduling open source system, which can efficiently execute and manage big data processes. This article shows you how to easily create, edit, and schedule Spark jobs through the DolphinScheduler Web interface. View details
    Q: How to interact with EMR Serverless Spark through Jupyter Notebook?
    A: Jupyter Notebook is a powerful interactive development tool. You can write and execute code in real time in the Web interface, and view the results in real time without precompiling or executing scripts separately. This article will introduce you how to build an efficient communication with Serverless Spark ..... View details
    Free trial

    Want to experience more product features?

    Buy Now EMR Serverless Spark product

    Learn more about AliCloud products?

    explore Alibaba Cloud products Learn more about product introduction

    Need help when encountering difficulties?

    contact us Consult Alibaba Cloud service team