Create Task

Update time ： 2024-05-21

prerequisite

Before creating a task, make sure that you have successfully created a resource pool. For details, see Create resource pool 。

Operation steps

Sign in Baige heterogeneous computing platform AIHC console 。
get into Training tasks List page, click Create Task 。

Basic information

Basic information of training task

parameter	explain
Task name	Fill in the name of the task. Support lowercase letters, numbers and -, and the beginning must be lowercase letters, and the end must be lowercase letters or numbers, with a length of 1-50 characters
Resource pool	Select 100 Ge resource pool for task deployment
queue	Select the queue that the task needs to deploy under the resource pool
priority	Select task priority, which can be set to high, medium and low
Training framework	Currently supporting PyTorch and MPI training framework
fault-tolerant	After fault tolerance is enabled, if the training task is abnormal due to node failure, the failed node will be blocked and the training task will be rescheduled. See: Training fault tolerance
Log persistence	Enabling the log persistence function will save your task log persistence to the log service (BLS). There will be costs for log storage, reading, writing, and indexing. For details, refer to Price details 。
Task creation method	choice Custom Creation or Accelerate template creation based on AIAK Custom creation: custom training image and parameter scene Create based on AIAK acceleration template: for the scene of directly using AIAK Training acceleration image in training, if this option is selected, fill in additional Training mode, training method and parameters of AIAK training template
Training mode (not required)	support Post-Pretrain and SFT Two modes If selected Accelerate template creation based on AIAK , you need to fill in this field
Training method (not required)	Support full update and LoRA Full update: update all parameters of the large model during training LoRA: On the basis of fixing the parameters of the large pre training model itself and retaining the original weight matrix in the self attention module, LoRA performs low rank decomposition on the weight matrix, and only updates the parameters of the low rank part in the training process If selected Accelerate template creation based on AIAK , you need to fill in this field
AIAK training template (not required)	Select training templates, provide mainstream open source big model acceleration templates, and support common open models such as llama2, qwen, baihuan2, etc If selected Accelerate template creation based on AIAK , you need to fill in this field

Environment configuration

Fill in the training configuration information. The configuration information varies according to the task creation method

Custom Creation

parameter	explain
Mirror Address	Fill in the training image. You can directly enter the image address or click to select an image. Currently, CCR Enterprise Image and Baige Preset Image are supported. For more information, please refer to Container image service CCR 。
Execute command	Specify the execution command of the code
Add environment variable	Add environment variables and support multiple configurations

Accelerate template creation based on AIAK

parameter	explain
Mirror Address	AIAK training acceleration template presets the default training image, which does not support modification
Execute command	The AIAK training acceleration template provides default parameters. You need to replace the PATH of the dataset, CHECKPOINT, TOKENIZER and Tensorboard with a user-defined path
environment variable	The AIAK training acceleration template provides default environment variables, which are not recommended to be modified

The parameters to be replaced in the command execution are as follows:

 DATA_PATH=$USER_DATA_PATH CHECKPOINT_LOAD_PATH=$USER_CHECKPOINT_LOAD_PATH CHECKPOINT_SAVE_PATH=$USER_CHECKPOINT_SAVE_PATH TOKENIZER_PATH=$USER_TOKENIZER_PATH TENSORBOARD_PATH=$TENSORBOARD_PATH

If you need to further modify the model parameters, you can directly modify the submission in the execution command.

Resource allocation

parameter	explain
Number of copies	Set the number of training task copies
GPU type	Select GPU type
GPUs per replica	You can set the number of GPU cards per copy according to the number of currently available GPU cards
CPU/memory	The number of CPUs/memory applied by the business is unlimited by default, and the remaining free resources on the node can be used
Shared memory	Shared memory is used for data exchange and sharing between different processes in Linux to improve the performance and efficiency of applications. The shared memory on Baige platform is 10Gi by default, and can be modified as needed if there is additional demand for business
RDMA	When enabled, the system will automatically schedule tasks to nodes that support RDMA

set up data sources

parameter	explain
Storage type	Select "Local Disk" or "PFS"
Associated file system	Associate the PFS instance by default or use the specified local disk path
Mount Path	Specify the PFS mount path or local disk mount path

Advanced Configuration

parameter	explain
Tensorboard	Open the Tensorboard for this task. You need to specify the log reading path after enabling., This path must be consistent with the Tensorboard log path in the code, otherwise Tensorboard cannot obtain data. See: Training effect monitoring board
give an alarm	Baige platform provides alarm notification mechanism and SMS/email notification mechanism for the status of training tasks and the indicators of training loss. See: Configure message notification for tasks

Submit Task

Confirm the parameters and click Submit To complete task creation

Resource pool creation and management

Manage Tasks

Baige heterogeneous computing platform AIHC