Quickly create DataLake clusters _ open source big data platform E-MapReduce (EMR)

This article describes how to log in to the E-MapReduce (EMR) console through an AliCloud account to quickly create a DataLake cluster and execute jobs.

prerequisite

Register an AliCloud account and complete real name authentication. Please refer to Account registration (PC side) 。
explain
According to Alibaba Cloud ECS rules, when purchasing pay as you go instances, you need to ensure that the available balance (including cash, vouchers, coupons, etc.) in your AliCloud account is not less than 100 yuan.
Complete granting the default EMR and ECS role permissions to the E-MapReduce service account. For details, see Alibaba Cloud account role authorization 。

Operation process

Step 1: Create a cluster
On the EMR console, quickly create a DataLake cluster.
Step 2: Create and execute jobs
After the cluster is created successfully, you can create and execute Spark type jobs.
Step 3: View the job running record
After submitting the job, you can view the job running record through the YARN UI.
(Optional) Step 4: Release the cluster
If the cluster is no longer used, you can release the cluster to save costs.

Step 1: Create a cluster

Enter the Create Cluster page.
1. Sign in EMR on ECS console 。
2. In the top menu bar, select the region and resource group according to the actual situation.
  - Region: The cluster created will be in the corresponding region, and once created, it cannot be modified.
  - Resource Group: all resources of the account are displayed by default.
3. Click above Create cluster 。

On the Create Cluster page, complete the cluster related configuration.

Configure Area	Configuration Item	Example	describe
software configuration	region	East China 1 (Hangzhou)	The physical location of the ECS instance of the cluster node. important After the cluster is created, the region cannot be changed. Please select carefully.
	Business Scenario	Data Lake	Select a suitable business scenario. When you create a cluster, Alibaba Cloud EMR will automatically configure default components, services and resources for you to simplify cluster configuration and provide a cluster environment that meets the requirements of specific business scenarios.
	product version	EMR-5.14.0	Current latest software version.
	High availability of services	Do not open	It is not enabled by default. open High availability of services After switching, EMR will distribute the master nodes on different underlying hardware to reduce the risk of failure.
	Option Service	HADOOP-COMMON, OSS-HDFS, YARN, Hive, Spark3, Tez, Knox, and OpenLDAP.	Select components according to your actual needs. The selected components will start the relevant service processes by default. explain In addition to the cluster default service, Knox and OpenLDAP services should also be selected.
	Allow collecting service running logs	open	It supports one click to enable or disable log collection for all services. On by default, your service running logs will be collected. These logs are only used for cluster diagnostics. After the cluster is created, you can Basic information Page, modifying Service running log collection status 。 important When log collection is turned off, the health check and technical support of EMR will be limited, but other functions can still be used normally. Please refer to How to stop collecting service logs? 。
	metadata	DLF unified metadata	It indicates that metadata is stored in the data lake construction DLF. The system will select the default DLF Data Directory If you want to use different data directories for different clusters, click Create Data Directory 。 explain When this mode is selected, Alibaba Cloud data lake construction service needs to be activated.
	Cluster storage root path	1366993922******	When you select the OSS-HDFS service in the optional service area, you need to configure this parameter. If you select the HDFS service, you do not need to configure this parameter. explain Before choosing to use the OSS HDFS service, make sure that the region you choose supports the service. Otherwise, you can try to change the region or use HDFS services instead of OSS HDFS services. For the region information currently supported by the OSS HDFS service, see Activate and authorize access to the OSS HDFS service 。 EMR-5.12.1 and later versions, and EMR-3.46.1 and later versions of DataLake, DataFlow, DataServing, and Custom clusters support the selection of OSS HDFS services.
hardware configuration	Payment type	Pay as you go	It is recommended to use Pay as you go After the test is normal, you can release the cluster and create a new one Monthly guarantee The production cluster of is officially used.
	Zone	Zone I	After the cluster is created, the zone cannot be changed directly. Please choose carefully.
	proprietary network	vpc_Hangzhou/vpc-bp1f4epmkvncimpgs****	Select the VPC in the corresponding region. If not, click Create VPC Go to New. After creating the VPC, click Refresh , you can select the newly created VPC.
	Switch	vsw_i/vsw-bp1e2f5fhaplp0g6p****	Select the switch in the zone under the corresponding VPC. If there is no switch available in this zone, you need to create a new one.
	Default security group	sg_seurity/sg-bp1ddw7sm2risw****	important It is prohibited to use enterprise security groups created on ECS. If there is already a security group in use, you can directly choose to use it. You can also create a new security group.
	Node group	Open the Attach the public network Switch, and use the default values for the rest.	You can configure Master node group, Core node group or Task node group information according to business demands. For details, see Model selection and configuration description 。
Basic configuration	Cluster name	Emr-DataLake	The cluster name is limited to 1~64 characters in length and can only use Chinese, letters, numbers, dashes (-) and underscores (_).
	identity certificate	password.	It is used to remotely log in to the master node of the cluster.
	Login Password and Confirm Password	Custom password.	Please record the configuration. You need to enter the password when logging into the cluster.

Check Service Agreement , click Confirm Order 。
On the EMR on ECS page, when the cluster state Show as In operation When the cluster is created successfully. For more cluster parameter information, see Create cluster 。

Step 2: Create and execute jobs

After the cluster is created successfully, you can create and execute jobs in the cluster.

Connect the cluster through SSH. For details, see Log in to the cluster 。
Execute the following command on the command line to submit and run the job.
This article takes Spark 3.1.1 as an example, and the command examples are as follows.
```
 spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-memory 512m --num-executors 1 --executor-memory 1g --executor-cores 2 /opt/apps/SPARK3/spark-current/examples/jars/spark-examples_2.12-3.1.1.jar 10
```
explain
spark-examples_2.12-3.1.1.jar For the corresponding JAR package name in your cluster, you can Log in to the cluster , on /opt/apps/SPARK3/spark-current/examples/jars Path.

Step 3: View the job running record

After submitting the job, you can view the job running record through the YARN UI.

Open port 8443. For details, see Manage Security Groups 。
New user, see Manage Users 。
When using Knox account to access YARN UI page, the user name and password of Knox account are required.
stay EMR on ECS Page, click Cluster service 。
single click Access links and ports Tab.
single click YARN UI Public network link of the bank.
Use the user identity information in user management for login authentication to enter the YARN UI page.
stay All Applications Page, click the ID of the target job to view the details of the job run.

(Optional) Step 4: Release the cluster

If the cluster you created is no longer in use, you can release the cluster to save costs. After confirming that the cluster is released, the system will process the cluster as follows:

Force termination of all jobs on the cluster.
Terminate and release all ECS instances.

The time required for this process depends on the size of the cluster. The smaller the cluster, the faster it will be released. The release can usually be completed in a few seconds, but not more than 5 minutes at most.

important

The pay as you go cluster can be released at any time, and the cluster can only be released after the expiration of the monthly package.
Before releasing the cluster, ensure that the cluster status is initializing, running, or idle.

stay EMR on ECS Page, select the > release 。
You can also click the cluster name of the target cluster, and then click Basic information Page, select the All operations > release 。
In the pop-up dialog box, click determine 。

common problem

Learn about frequently asked questions about using Alibaba Cloud E-MapReduce: common problem 。

E-MapReduce Quick Start