Distributed Component

Overview

DistributedComponent is a kind of component to support distributed training scenarios.

How to write DistributedComponent yaml spec

Distributed component is defined with a launcher section.

Please refer to DistributedComponent spec doc.

Please refer to DistributedComponent Schema.

Example yaml:

$schema: https://componentsdk.azureedge.net/jsonschema/DistributedComponent.json
name: microsoft.com.azureml.samples.mpi_example
version: 0.0.1
display_name: MPI Example
type: DistributedComponent
inputs:
  input_path:
    type: path
    description: The directory contains input data.
    optional: false
  string_parameter:
    type: String
    description: A parameter accepts a string value.
    optional: true
outputs:
  output_path:
    type: path
    description: The directory contains output data.
launcher:
  type: mpi
  additional_arguments: >-
    python train.py --input-path {inputs.input_path} [--string-parameter {inputs.string_parameter}]
    --output-path {outputs.output_path}
environment:
  name: AzureML-Minimal

Launcher

We assume readers already understand the basic concept of distributed training such as _data parallelism, distributed data parallelism, and model parallelism_. This section aims at helping readers understand the launcher concept and running existing distributed training code on AzureML.

In distributed training the workload to train a model is split up and shared among multiple processors. These processors work in parallel to speed up model training.

Users rarely launch all distributed processes manually and often rely on a utility launcher.

Supported Launchers

Based on how distributed traning job is launched in different frameworks like PyTorch and Tensorflow, we defined different launcher type. When submit distributed Component run, AzureML will help to allocate the nodes, and set neccesarry environment variable, and call the right command for you according to the launcher type. Below is a list of launcher types we target to support in Component SDK:

  • MPI: support the MPI distributed traning model which can be used in most frameworks.
    • Horovod: horovod works well with mpi launcher, so we don’t have independent horovod launcher type. Reference here for details.

  • torch.distributed:support native distributed pytorch.

  • Tensorflow.distributed: support native distributed tensorflow, currently it is not supported.

  • Deepspeed: DeepSpeed is lightweight wrapper on pytorch, currently native deepspeed launcher is not supported, but we could use DeepSpeed via MPI/torch.distributed, see here for details.

See detail doc on each launch type: