Distributed Component
Overview
DistributedComponent is a kind of component to support distributed training scenarios.
How to write DistributedComponent yaml spec
Distributed component is defined with a launcher section.
Please refer to DistributedComponent spec doc.
Please refer to DistributedComponent Schema.
Example yaml:
$schema: https://componentsdk.azureedge.net/jsonschema/DistributedComponent.json
name: microsoft.com.azureml.samples.mpi_example
version: 0.0.1
display_name: MPI Example
type: DistributedComponent
inputs:
input_path:
type: path
description: The directory contains input data.
optional: false
string_parameter:
type: String
description: A parameter accepts a string value.
optional: true
outputs:
output_path:
type: path
description: The directory contains output data.
launcher:
type: mpi
additional_arguments: >-
python train.py --input-path {inputs.input_path} [--string-parameter {inputs.string_parameter}]
--output-path {outputs.output_path}
environment:
name: AzureML-Minimal
Launcher
We assume readers already understand the basic concept of distributed training such as _data parallelism, distributed data parallelism, and model parallelism_. This section aims at helping readers understand the launcher concept and running existing distributed training code on AzureML.
In distributed training the workload to train a model is split up and shared among multiple processors. These processors work in parallel to speed up model training.
Users rarely launch all distributed processes manually and often rely on a utility launcher.
Supported Launchers
Based on how distributed traning job is launched in different frameworks like PyTorch and Tensorflow, we defined different launcher type. When submit distributed Component run, AzureML will help to allocate the nodes, and set neccesarry environment variable, and call the right command for you according to the launcher type. Below is a list of launcher types we target to support in Component SDK:
torch.distributed:support native distributed pytorch.
Tensorflow.distributed: support native distributed tensorflow, currently it is not supported.
Deepspeed: DeepSpeed is lightweight wrapper on pytorch, currently native deepspeed launcher is not supported, but we could use DeepSpeed via MPI/torch.distributed, see here for details.
See detail doc on each launch type: