DeepSpeed

Overview

DeepSpeed is a deep learning optimization library.

How deepspeed entry point launch DeepSpeed distributed training

DeepSpeed recommends using the native entry point deepspeed to launch distributed training.

E.g. deepspeed --hostfile myhostfile train.py --deepspeed --deepspeed_config ds_config.json

The entry point will do the following:

Resolving the distributed world info to be used according to the args or according to the hostfile;
Use PDSH to run deepspeed.launcher.launch in all nodes;
For each node, the launcher script will set some world info related environment variables then run processes for each GPU;
For each running process, it will run the user script train.py and training;

As a result, the core logic of deepspeed entry is to generate the environment variables for the user script.

Note: Such DeepSpeed required variables include MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK, LOCAL_RANK.

How to launch DeepSpeed distributed training with ComponentSDK

Currently, AzureML doesn’t support native deepspeed entry point as the DeepSpeed launcher.

As an alternative way, we could run deepspeed job via the following ways:

Use MPI launcher, so the user scripts will be called by mpirun in different nodes, DeepSpeed implements an auto mpi discovery mechanism, which set DeepSpeed required variables according to MPI variables when the user script calls deepspeed.initialize. This is also a recommended way to run DeepSpeed in AzureML.
Use torch.distributed launcher, when the user script is called, the torch.distributed required environment variables have already been set, which are the same as DeepSpeed required variables;

As a consequence, both MPI and torch.distributed launcher almost do the same work as native deepspeed entry point. We could create a DistributedComponent that trains the model by DeepSpeed using the two launchers.

How to write DeepSpeed distributed component yaml spec with MPI or torch.distributed launcher

To create a component that trains the model by DeepSpeed, you need to use the distributed component spec to describe this component. You can find a sample yaml here. And the main points to prepare a DeepSpeed distributed component yaml are:

Specify the component type as “DistributedComponent”
Specify the launcher type as “mpi” or “torch.distributed”
Specify the component running environment

A predefined DeepSpeed environment is recommended. You could also build your own DeepSpeed image based on AzureML OpenMPI base images with the installation instructions.

Example yaml:

$schema: https://componentsdk.azureedge.net/jsonschema/DistributedComponent.json
name: samples.deepspeed_cifar_mpi
version: 0.0.1
display_name: Deepspeed Train CIFAR
type: DistributedComponent
tags: {}
inputs:
  epochs:
    type: integer
    description: Number of epochs for the training
    default: 30
  batch_size:
    type: integer
    description: mini-batch size
    default: 32
  with_aml_log:
    type: boolean
    default: True
outputs: {}
environment:
  name: AzureML-DeepSpeed-0.3-GPU
launcher:
  type: mpi
  additional_arguments: >-
    python train.py --batch_size {inputs.batch_size} --epochs {inputs.epochs}
    --with_aml_log {inputs.with_aml_log} --deepspeed --deepspeed_config ds_config.json --deepspeed_mpi

How to consume DeepSpeed distributed component

After the DeepSpeed distributed component prepared, you can submit a pipeline to run this component by Component SDK. Here is a sample notebook to submit a DeepSpeed component run. You can configure the distributed settings for the job with Component SDK. For example, specify node count, process count per node by RunSettings.

# Load the DeepSpeed component you create by the component yaml
deepspeed_train_func = Component.load(ws, name='samples.deepspeed_cifar_mpi', version='0.0.1')

# Generate a pipeline with one DeepSpeed distributed component
@dsl.pipeline(default_compute_target=cluster_name)
def deepspeed_pipeline(epochs=30, batch_size=32) -> Pipeline:
    deepspeed_train_component = deepspeed_train_func(epochs=epochs, batch_size=batch_size)
    # Note that the process count should not greater than the GPU count in one instance.
    deepspeed_train_component.runsettings.resource_layout.configure(instance_count=2, process_count_per_node=2)
    return deepspeed_train_component.outputs

# submit the pipeline
pipeline = deepspeed_pipeline()
pipeline.validate()
run = pipeline.submit()

How to modify DeepSpeed Configuration in a DeepSpeed job

DeepSpeed uses DeepSpeed Configuration file to configure the features of DeepSpeed. It is straight forward when you use command line to run deepspeed job.

However, it is not that convenient to change a config file for a component job. There are following recommended ways to change DeepSpeed Configuration:

Declare the DeepSpeed Configuration as an input path: --deepspeed_config {inputs.ds_config_dir}/ds_config.json;
Declare a default config file, then modify the config in the user script. Since the configuration is used in deepspeed.initialize, you only need to write the updated config file before calling this function. There are two ways to do that:
- Pass some configurations (e.g. lr, train_batch_size, etc.) by input parameters, then update the values;
- Pass the whole config json string as an environment variable, then update the whole config json;

Reference

DeepSpeed training example

CIFAR-10 Tutorial