get intoTraining tasksList page, clickCreate Task。
Basic information
Basic information of training task
parameter
explain
Task name
Fill in the name of the task.Support lowercase letters, numbers and -, and the beginning must be lowercase letters, and the end must be lowercase letters or numbers, with a length of 1-50 characters
Resource pool
Select 100 Ge resource pool for task deployment
queue
Select the queue that the task needs to deploy under the resource pool
priority
Select task priority, which can be set to high, medium and low
Training framework
Currently supporting PyTorch and MPI training framework
fault-tolerant
After fault tolerance is enabled, if the training task is abnormal due to node failure, the failed node will be blocked and the training task will be rescheduled.See:Training fault tolerance
Log persistence
Enabling the log persistence function will save your task log persistence to the log service (BLS). There will be costs for log storage, reading, writing, and indexing. For details, refer toPrice details。
Task creation method
choiceCustom CreationorAccelerate template creation based on AIAK Custom creation: custom training image and parameter scene Create based on AIAK acceleration template: for the scene of directly using AIAK Training acceleration image in training, if this option is selected, fill in additionalTraining mode, training method and parameters of AIAK training template
Training mode (not required)
supportPost-PretrainandSFTTwo modes
If selectedAccelerate template creation based on AIAK, you need to fill in this field
Training method (not required)
Support full update and LoRA Full update: update all parameters of the large model during training LoRA: On the basis of fixing the parameters of the large pre training model itself and retaining the original weight matrix in the self attention module, LoRA performs low rank decomposition on the weight matrix, and only updates the parameters of the low rank part in the training process
If selectedAccelerate template creation based on AIAK, you need to fill in this field
AIAK training template (not required)
Select training templates, provide mainstream open source big model acceleration templates, and support common open models such as llama2, qwen, baihuan2, etc
If selectedAccelerate template creation based on AIAK, you need to fill in this field
Environment configuration
Fill in the training configuration information. The configuration information varies according to the task creation method
Custom Creation
parameter
explain
Mirror Address
Fill in the training image. You can directly enter the image address or click to select an image. Currently, CCR Enterprise Image and Baige Preset Image are supported.For more information, please refer toContainer image service CCR。
Execute command
Specify the execution command of the code
Add environment variable
Add environment variables and support multiple configurations
Accelerate template creation based on AIAK
parameter
explain
Mirror Address
AIAK training acceleration template presets the default training image, which does not support modification
Execute command
The AIAK training acceleration template provides default parameters. You need to replace the PATH of the dataset, CHECKPOINT, TOKENIZER and Tensorboard with a user-defined path
environment variable
The AIAK training acceleration template provides default environment variables, which are not recommended to be modified
The parameters to be replaced in the command execution are as follows:
If you need to further modify the model parameters, you can directly modify the submission in the execution command.
Resource allocation
parameter
explain
Number of copies
Set the number of training task copies
GPU type
Select GPU type
GPUs per replica
You can set the number of GPU cards per copy according to the number of currently available GPU cards
CPU/memory
The number of CPUs/memory applied by the business is unlimited by default, and the remaining free resources on the node can be used
Shared memory
Shared memory is used for data exchange and sharing between different processes in Linux to improve the performance and efficiency of applications.The shared memory on Baige platform is 10Gi by default, and can be modified as needed if there is additional demand for business
RDMA
When enabled, the system will automatically schedule tasks to nodes that support RDMA
set up data sources
parameter
explain
Storage type
Select "Local Disk" or "PFS"
Associated file system
Associate the PFS instance by default or use the specified local disk path
Mount Path
Specify the PFS mount path or local disk mount path
Advanced Configuration
parameter
explain
Tensorboard
Open the Tensorboard for this task.You need to specify the log reading path after enabling.,This path must be consistent with the Tensorboard log path in the code, otherwise Tensorboard cannot obtain data.See:Training effect monitoring board
give an alarm
Baige platform provides alarm notification mechanism and SMS/email notification mechanism for the status of training tasks and the indicators of training loss.See:Configure message notification for tasks
Submit Task
Confirm the parameters and clickSubmitTo complete task creation