file

NVMe Protocol Introduction

NVMe (Non Volatile Memory Express), a nonvolatile memory host controller interface specification, is a logical device interface specification. It is a bus transmission protocol specification similar to AHCI and based on the device logical interface. This article introduces the basic concepts and usage scenarios of NVMe protocol.

Basic concepts

Basic concepts

explain

advantage

NVMe

NVMe defines a rich set of commands and functions for PCIe based SSDs. The goal is to improve performance and efficiency, while enabling a wide range of enterprise class systems and client systems to interoperate.

NVMe is specially designed for SSDs. It uses high-speed interfaces for data communication between CPU and SDD. Compared with traditional drive protocols such as SCSI and virtio blk, NVMe is faster and has higher transmission bandwidth. NVMe is becoming a new industry standard for data center servers and client devices. Alibaba Cloud ESSD NVMe cloud disk has the high performance and enterprise characteristics of the NVMe protocol. Currently, it supports simultaneous attachment to multiple ECS instances that support the NVMe protocol to achieve data sharing.

Multiple mounts

That is, a single NVMe cloud disk supports simultaneous attachment to multiple ECS instances in the same zone, so that multiple ECS instances can simultaneously read and write to the same NVMe cloud disk.

By sharing a piece of data among multiple instances, storage costs can be effectively reduced, business expansion capability can be improved without moving data, and recovery capability in failure scenarios can be improved. Multiple mounts can be widely used in scenarios such as database high availability, write once read many, distributed caching, and accelerated machine learning.

PersistentReservation(PR)

As part of the NVMe protocol, the PR command can control the client's access to the cloud disk. The PR command mainly includes register, acquire, release, and report, which are used for permission registration, permission preemption, permission logout, and permission query, respectively. By configuring the permissions of different cloud disks and clients, data reliability and security can be effectively improved. For more information, see NVMe PR protocol

In the multi mount scenario, multiple different clients writing to a cloud disk at the same time may cause data to be written bad. PR can accurately control the read/write permissions of a cloud disk, so as to ensure that the computing end writes data as expected. For example, PR capability can ensure that the failed node no longer writes data in the failover scenario, so as to ensure the correctness of writing data after the new node goes online.

NVMe shared disk

It is a cloud disk based on the NVMe protocol that supports multiple mounts and PR features. It allows simultaneous mounts to up to 16 ECS instances.

NVMe shared disks have extensive application value in such scenarios as database high availability, write once and read many, and can effectively support the cloud application of highly available services based on traditional SAN, such as Oracle RAC, SAP Hana, cloud native database and other scenarios.

Cluster File System

In the multi mount scenario, it is necessary to ensure that multiple mount nodes can see the file system of the same data. The cluster file system can ensure that the written data, created files, and modified metadata can be synchronized to all mount nodes in real time, so as to ensure data consistency in the file system layer.

Traditional ext3 and ext4 usually cache data and metadata to speed up access performance, resulting in data written, files created, and disk space allocated under a node being cached locally and not perceived in real time by other nodes. Cluster file systems exist to solve this scenario. Common cluster file systems include OCFS2, DBFS, and so on.

Use Scenarios

Common usage scenarios of NVMe protocol include NVMe cloud disk and NVMe shared disk.

NVMe Cloud Disk

NVMe protocol is gradually becoming a new generation of industry standard. Currently, more and more applications are built based on NVMe SSD. The ESSD cloud disk that supports the NVMe protocol is called the NVMe cloud disk. The NVMe cloud disk has the same read-write interface as the NVMe SSD. It can seamlessly connect applications based on traditional NVMe SSD to the cloud, and can fully share the elastic resources, operation and maintenance free, snapshots, high-performance and other features of the cloud. For more information, see Overview of NVMe Cloud Disk

NVMe shared disk

When creating an ESSD cloud disk, you can enable the multi mount function for the cloud disk. An NVMe cloud disk with the multi mount function enabled is called an NVMe shared disk. For more information, see Enable multi mount function

NVMe shared disks can help applications achieve high availability, high concurrency, and scalable services, and can help traditional SAN based businesses seamlessly go to the cloud. Common application scenarios of shared disks include data sharing, high availability failover, distributed cache acceleration, machine model training, and so on.

 image
  • data sharing

    The simplest application scenario of NVMe is data sharing. When the data is written to the cloud disk, other nodes can access the data, thus effectively saving costs and improving read and write performance. For example, in the cloud container image scenario, images of the same system are usually similar, so that a single image can be read and loaded by multiple different instances.

     image
  • High availability failover

    Business high availability is one of the most common application scenarios for shared disks. Traditional SAN based databases, such as Oracle RAC, SAP HANA, and cloud native high availability databases, are examples. A single point of failure is normal during actual business use. Ensuring business continuity in the event of a failure is the core capability of a highly available system. Storage and network on the cloud have extremely high availability. The computing node is often affected by power failure, downtime, hardware failure, and so on. Therefore, the business usually builds the active/standby mode to solve the high availability problem of computing.

    For example, in the database scenario, when the primary database fails, it quickly switches to the standby database to provide external services. After the instance switch, you can release the write permission of the old instance through the NVMe PR command, so as to ensure that the old instance no longer writes data to affect data consistency. As shown in the figure, the failover process is described as follows:

    1. Database primary instance 1 is down, and the business stops.

    2. Issue the NVMe PR command to prohibit database instance 1 from continuing to write data and allow database instance 2 to write data.

    3. Database instance 2 is restored to the same state as database instance 1 through log playback.

    4. Switch database instance 2 as the primary instance and continue to provide external services.

     image
  • Distributed cache acceleration

    The NVMe shared cloud disk has high performance, IOPS and throughput, and can provide performance acceleration for other low-speed storage systems. For example, in the data lake scenario, the data lake is usually built based on OSS and can be accessed by multiple clients at the same time. It also has high sequential read throughput and append write throughput, but its sequential read/write throughput and latency are poor, and its random read/write performance is poor. By accelerating NVMe shared cloud disks between computing and storage as cache acceleration, you can greatly improve the access performance of data lakes and other scenarios.

     image
  • machine learning

    Machine learning is another typical application scenario of shared disks. After the sample labels are written, the data will be split to multiple nodes for distributed computing of neural networks. Especially in the high-performance machine learning scenario where GPU is the computing resource, slow storage is likely to become the bottleneck of the entire system. At this time, the high-performance NVMe shared cloud disk will be used, It can effectively accelerate the performance of the whole machine model training.

     image
  • Introduction to this page (1)