Cloud native enterprise data lake solution

The data lake is a unified storage pool that can interface with multiple data input methods. You can store structured, semi-structured, and unstructured data of any size. The data lake can seamlessly connect with a variety of computing and analysis platforms, directly conduct data processing and analysis, break the island, and gain insight into business value. At the same time, the data lake provides cold and hot hierarchical conversion capability, covering the entire life cycle of data.

Scheme architecture

Data Lake Storage

Object storage OSS is based on the reliability design of 12 nines, which can store data of any size, support cold and hot layering, and can interface with business applications and various computing and analysis platforms. It is very suitable for enterprises to build data lakes based on OSS.

Why build a data lake based on OSS

Massive elasticity: computing storage separation, storage scale elastic expansion

Ecologically open: Ecologically friendly to Hadoop, and seamlessly connected to Alibaba Cloud computing platforms

High cost performance: unified storage pool, avoid duplicate copies, and multiple types of hot and cold tiers

Easier management: unified management of encryption, authorization, lifecycle, cross zone replication, etc

Challenges Resolved

Inelasticity: waste of self built HDFS resources and difficulty in computing storage coupling and capacity expansion

High cost: the cost of self built HDFS is high, and there is no data cold and hot layering scheme

Lack of service: compared with Alibaba Cloud EMR, self built big data clusters lack expert support

Difficult to manage: data is scattered in multiple clusters, lacking unified data management

Application scenarios

Open source ecological construction of data lake

Build a full trusteeship massive data warehouse

Big data cold and hot tiered storage

Interactive query of massive data

Data lake building machine learning capability

Open source ecological construction of data lake

Build a full trusteeship massive data warehouse

Big data cold and hot tiered storage

Interactive query of massive data

Data lake building machine learning capability

Application scenarios

• Customers build data processing and analysis based on Hadoop ecology
• Widely used in Internet, finance, manufacturing, transportation and other fields

User pain points

• Rapid growth of data scale, unmatched expansion speed of storage resources and computing resources, and customer's demand for cost optimization
• Wide data sources, the storage system needs to interface with different data sources, including application data

WHY AliCloud

• OSS can support EB scale data lakes, support multiple data channels, and comprehensively cover various data sources such as logs, messages, databases, and HDFS
• OSS seamlessly interfaces with EMR Hive, Spark, Presto, Impala and other big data processing engines to eliminate data islands
• Alibaba Cloud EMR big data expert service support
• Alibaba Cloud Data Lake Formation provides data lake metadata management, data lake acceleration and other services; EMR big data expert service support

Application Practice

Practice of online education data lake

Practice of online game data lake

Practice of mutual entertainment new media data lake

Internet Advertising Data Lake Practice

An online education platform with more than 100 million users

customer demand
Courseware materials, application logs, learning samples and other data can be stored centrally
It can provide courseware playing, offline analysis and machine learning for different types of data to meet the needs of different scenarios of online education
Customer value
OSS supports centralized storage of various types of data, such as audio, video, pictures, logs, etc., and seamless docking of big data processing, and on-demand distribution of teaching courseware

Industry Scenario Best Practices

Data Lake Solution - Game Industry Best Practices

Mining the value of data and improving the game experience through data cloud refined operation

Current Time 0:00

/

Duration Time -:-

Progress: NaN%

Practical explanation

First, efficiently migrate massive HDFS files to OSS; Lesson 2: Data worry free: use checksum to migrate HDFS data to OSS

Lesson 3: How to archive HDFS data to OSS; Lesson 4: How to archive Hive data to OSS by partition

Lesson 5: The fastest way to access objects such as OSS: JindoFS SDK; Lesson 6: Hadoop/Spark Access OSS Acceleration

Customer Stories

Fluent in data lake practice

Yidiantianxia Data Lake Practice

Practice of Jiahe Science and Technology Data Lake

Watch video

Customer video - fluent

Through Alibaba Cloud's data lake solution tailored for Fluency, it has solved the unified storage of all kinds of data for Fluency applications, and helped Fluency build a "Chinese English voice database" with a data scale of hundreds of billions. The data lake built by Alibaba Cloud can give full play to the advantages of the computing and decoupling architecture. Combining Alibaba Cloud ECS elastic instances and K8S, it can dynamically expand and reduce the corresponding computing resources according to the actual business needs. It does not need to resident computing resources according to the business peak to help optimize costs to the greatest extent.

Watch video

Product recommendation