Horovod
Overview
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
The python script using Horovod can be launched by two commands, horovodrun and mpirun.
horovodrun
horovodrunis a convenient, Open MPI-based wrapper for running Horovod script. It supports some unique Horovod knobs, such as--autotune,--fusion-threshold-mb, etc. Below is an example:horovodrun -np 4 -H server1:2,server2:2 --autotune python train.py
mpirun
Horovod jobs can also be launched by the
mpiruncommand directly, and this is what we do today in Azure ML.Most of the
horovodruncommands have the equivalentmpiruncommands. And for thehorovodrunsupports knobs, many of them can be used withmpirunthrough the use of environment variables. Below is an example equivalent with the abovehorovodrunexample:mpirun -x HOROVOD_AUTOTUNE=1 -np 4 -H server1:2,server2:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
For a Horovod knob and environment variable name lookup, please check the table in the reference.
Note:
Component SDK currently only supports mpirun approach.
How to write horovod distributed component yaml spec
To create a component that trains the model by Horovod, you need to use the distributed component spec to describe this component. You can find a sample yaml here. And the main points to prepare a horovod distributed component yaml are:
Specify the component type as “DistributedComponent”
Specify the launcher type as “mpi”
Specify the component running environment
The environment should meet the requirements of the Horovod. You can provide one OpenMPI image as well as the Horovod package as a pip requirement.
Example yaml:
$schema: https://componentsdk.azureedge.net/jsonschema/DistributedComponent.json
name: samples.train_sentiment_classification
version: 0.0.1
display_name: Train Sentiment Classification
type: DistributedComponent
description: A dummy component to show how to distributedly train sentiment classification with Horovod by
custom component.
tags: {}
inputs:
input_data:
type: path
max_tokens:
type: integer
default: 10000
sequence_length:
type: integer
default: 260
embedding_dimension:
type: integer
default: 16
epochs:
type: integer
default: 10
batch_size:
type: integer
default: 32
learning_rate:
type: float
default: 0.001
outputs:
trained_model:
type: ModelDirectory
environment:
docker:
image: mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04
conda:
conda_dependencies:
name: project_environment
channels:
- defaults
dependencies:
- python=3.6.8
- pip=20.2
- pip:
- tensorflow-gpu==2.3.0
- horovod==0.19.5
- pandas==1.0.4
os: Linux
launcher:
type: mpi
additional_arguments: >-
python train_sentiment_classification.py --input-data {inputs.input_data} --trained-model {outputs.trained_model}
--max-tokens {inputs.max_tokens} --sequence-length {inputs.sequence_length}
--embedding-dimension {inputs.embedding_dimension} --epochs {inputs.epochs} --batch-size {inputs.batch_size}
--learning-rate {inputs.learning_rate}
How to consume horovod distributed component
After the horovod distributed component prepared, you can submit a pipeline to run this component by Component SDK. Here is a sample notebook to submit a horovod component run. You can configure the Horovod job with Component API. For example, specify node count, process count per node and horovod knobs by RunSettings.
# Load the horovod component you create by the component yaml
horovod_component_func = Component.load(ws, name='horovod_sample_component', version='0.0.1')
# Generate a pipeline with one horovod component
@dsl.pipeline(name='Horovod_component_example')
def generated_pipeline() -> Pipeline:
horovod_component = horovod_component_func(script_arg_0=arg_0, script_arg_1=arg_1)
# below runsettings give the equivalent command as:
# horovodrun -np 4 -H server1:2,server2:2 --timeline-filename ./outputs/timeline.json python component_entry.py \
# --script-arg-0 arg_0 --script-arg-1 arg_1
horovod_component.runsettings.environment_variables = {'HOROVOD_TIMELINE': './outputs/timeline.json'} # specify the Horovod --timeline-filename
horovod_component.runsettings.resource_layout.configure(
node_count=2, # use 2 nodes to train the model
process_count_per_node=2) # use 2 processes per node
# submit the pipeline
pipeline = generated_pipeline()
pipeline.submit(experiment_name='horovod-sample-experiment')
Reference
Prepare Horovod script
Horovod arguments and environment variable name lookup
| Category | Argument | Environment variables support | Description |
|---|---|---|---|
| Optional arguments | -np | Total number of training processes. | |
| --disable-cache | Disable horovod initialization check | ||
| --start-timeout | Has to perform all the checks and init before the timeout | ||
| --network-interface | Network interface for communication | ||
| --output-filename | For Gloo, writes stdout / stderr of all processes to a filename | ||
| --config-file | Path to YAML file containing runtime parameter configuration for Horovod. | ||
| SSH arguments | -p | SSH port | |
| -I | Private key file | ||
| Tuneable parameter | --fusion-threshold-mb | HOROVOD_FUSION_THRESHOLD | Fusion buffer size |
| --cycle-time-ms | HOROVOD_CYCLE_TIME | Delay between each tensor fusion cycle | |
| --cache-capacity | OROVOD_CACHE_CAPACITY | Maximum number of tensor names that will be cached | |
| --hierarchical-allreduce | HOROVOD_HIERARCHICAL_ALLREDUCE | Perform hierarchical allreduce between workers instead of ring allreduce. | |
| --no-hierarchical-allreduce | Disable hierarchical allreduce | ||
| --hierarchical-allgather | HOROVOD_HIERARCHICAL_ALLGATHER | Perform hierarchical allgather between workers instead of ring allgather | |
| --no-hierarchical-allgather | Disable hierarchical allgather | ||
| Autotune | --autotune | HOROVOD_AUTOTUNE | Auto select params to maximize the throughput |
| --autotune-log-file | HOROVOD_AUTOTUNE_LOG | Comma-separated log of trials containing each hyperparameter and the score of the trial | |
| --autotune-warmup-samples | HOROVOD_AUTOTUNE_WARMUP_SAMPLES | Number of samples to discard before beginning the optimization process during autotuning. | |
| --autotune-steps-per-sample | HOROVOD_AUTOTUNE_STEPS_PER_SAMPLE | Number of steps (approximate) to record before observing a sample. | |
| --autotune-bayes-opt-max-samples | HOROVOD_AUTOTUNE_BAYES_OPT_MAX_SAMPLES | Maximum number of samples to collect for each Bayesian optimization process. | |
| --autotune-gaussian-process-noise | HOROVOD_AUTOTUNE_GAUSSIAN_PROCESS_NOISE | Regularization value [0, 1] applied to account for noise in samples | |
| Elastic | --min-np | Minimum number of processes running for training to continue | |
| --max-np | Maximum number of training processes, beyond which no additional processes will be created. | ||
| --slots-per-host | Number of slots for processes per host. Normally 1 slot per GPU per host. | ||
| --elastic-timeout | Timeout for elastic initialisation after re-scaling the cluster | ||
| --reset-limit | Maximum number of times that the training job can scale up or down | ||
| Timeline | --timeline-filename | HOROVOD_TIMELINE | JSON file containing timeline of Horovod events used for debugging performance. |
| --timeline-mark-cycles | HOROVOD_TIMELINE_MARK_CYCLES | Mark cycles on the timeline. | |
| Stall check | --no-stall-check | HOROVOD_STALL_CHECK_DISABLE | The stall check will log a warning when workers have stalled waiting |
| --stall-check-warning-time-seconds | HOROVOD_STALL_CHECK_TIME_SECONDS | Seconds until the stall warning is logged to stderr. | |
| --stall-check-shutdown-time-seconds | HOROVOD_STALL_SHUTDOWN_TIME_SECONDS | Seconds until Horovod is shutdown due to stall. | |
| Library | --mpi-threads-disable | HOROVOD_MPI_THREADS_DISABLE | Disable MPI threading support. |
| --mpi-args | Extra MPI arguments to pass to mpirun | ||
| --tcp | NCCL_IB_DISABLE | If this flag is set, only TCP is used for communication. | |
| --binding-args | Process binding arguments. | ||
| --num-nccl-streams | HOROVOD_NUM_NCCL_STREAMS | Number of NCCL streams. | |
| --thread-affinity | HOROVOD_THREAD_AFFINITY | Horovod background thread affinity. | |
| --gloo-timeout-seconds | HOROVOD_GLOO_TIMEOUT_SECONDS | Timeout in seconds for Gloo operations to complete. | |
| Logging | --log-level | HOROVOD_LOG_LEVEL | Minimum level to log to stderr from the Horovod backend. |
| --log-without-timestamp | HOROVOD_LOG_HIDE_TIME | Hide the timestamp from Horovod internal log messages. | |
| --prefix-output-with-timestamp | Timestamp each line of output to stdout, stderr, and stddiag. | ||
| Host | --hosts | List of host names and the number of available slots for running processes on each. | |
| --hostfile | Path to a host file containing the list of host names and the number of available slots | ||
| --host-discovery-script | Used for elastic training (autoscaling and fault tolerance). | ||
| Controller | --gloo | Run Horovod using the Gloo controller. | |
| --mpi | Run Horovod using the MPI controller. | ||
| --jsrun | Launch Horovod processes with jsrun and use the MPI controller |