Horovod

Overview

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The python script using Horovod can be launched by two commands, horovodrun and mpirun.

  • horovodrun

    horovodrun is a convenient, Open MPI-based wrapper for running Horovod script. It supports some unique Horovod knobs, such as --autotune, --fusion-threshold-mb, etc. Below is an example:

    horovodrun -np 4 -H server1:2,server2:2 --autotune python train.py
    
  • mpirun

    Horovod jobs can also be launched by the mpirun command directly, and this is what we do today in Azure ML.

    Most of the horovodrun commands have the equivalent mpirun commands. And for the horovodrun supports knobs, many of them can be used with mpirun through the use of environment variables. Below is an example equivalent with the above horovodrun example:

    mpirun -x HOROVOD_AUTOTUNE=1 -np 4 -H server1:2,server2:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
    

    For a Horovod knob and environment variable name lookup, please check the table in the reference.

Note:

Component SDK currently only supports mpirun approach.

How to write horovod distributed component yaml spec

To create a component that trains the model by Horovod, you need to use the distributed component spec to describe this component. You can find a sample yaml here. And the main points to prepare a horovod distributed component yaml are:

  • Specify the component type as “DistributedComponent”

  • Specify the launcher type as “mpi”

  • Specify the component running environment

    The environment should meet the requirements of the Horovod. You can provide one OpenMPI image as well as the Horovod package as a pip requirement.

Example yaml:

$schema: https://componentsdk.azureedge.net/jsonschema/DistributedComponent.json
name: samples.train_sentiment_classification
version: 0.0.1
display_name: Train Sentiment Classification
type: DistributedComponent
description: A dummy component to show how to distributedly train sentiment classification with Horovod by
  custom component.
tags: {}
inputs:
  input_data:
    type: path
  max_tokens:
    type: integer
    default: 10000
  sequence_length:
    type: integer
    default: 260
  embedding_dimension:
    type: integer
    default: 16
  epochs:
    type: integer
    default: 10
  batch_size:
    type: integer
    default: 32
  learning_rate:
    type: float
    default: 0.001
outputs:
  trained_model:
    type: ModelDirectory
environment:
  docker:
    image: mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04
  conda:
    conda_dependencies:
      name: project_environment
      channels:
        - defaults
      dependencies:
        - python=3.6.8
        - pip=20.2
        - pip:
            - tensorflow-gpu==2.3.0
            - horovod==0.19.5
            - pandas==1.0.4
  os: Linux
launcher:
  type: mpi
  additional_arguments: >-
    python train_sentiment_classification.py --input-data {inputs.input_data} --trained-model {outputs.trained_model}
    --max-tokens {inputs.max_tokens} --sequence-length {inputs.sequence_length}
    --embedding-dimension {inputs.embedding_dimension} --epochs {inputs.epochs} --batch-size {inputs.batch_size}
    --learning-rate {inputs.learning_rate}

How to consume horovod distributed component

After the horovod distributed component prepared, you can submit a pipeline to run this component by Component SDK. Here is a sample notebook to submit a horovod component run. You can configure the Horovod job with Component API. For example, specify node count, process count per node and horovod knobs by RunSettings.

# Load the horovod component you create by the component yaml
horovod_component_func = Component.load(ws, name='horovod_sample_component', version='0.0.1')

# Generate a pipeline with one horovod component
@dsl.pipeline(name='Horovod_component_example')
def generated_pipeline() -> Pipeline:
    horovod_component = horovod_component_func(script_arg_0=arg_0, script_arg_1=arg_1)
	
    # below runsettings give the equivalent command as:
    # horovodrun -np 4 -H server1:2,server2:2 --timeline-filename ./outputs/timeline.json python component_entry.py \
    # --script-arg-0 arg_0 --script-arg-1 arg_1
    horovod_component.runsettings.environment_variables = {'HOROVOD_TIMELINE': './outputs/timeline.json'}  # specify the Horovod --timeline-filename
    horovod_component.runsettings.resource_layout.configure(
        node_count=2,  # use 2 nodes to train the model
        process_count_per_node=2)  # use 2 processes per node

# submit the pipeline
pipeline = generated_pipeline()
pipeline.submit(experiment_name='horovod-sample-experiment')

Reference

Prepare Horovod script

  1. Horovod examples

  2. Distributed model training using Horovod

Horovod arguments and environment variable name lookup

Category Argument Environment variables support Description
Optional arguments -np Total number of training processes.
--disable-cache Disable horovod initialization check
--start-timeout Has to perform all the checks and init before the timeout
--network-interface Network interface for communication
--output-filename For Gloo, writes stdout / stderr of all processes to a filename
--config-file Path to YAML file containing runtime parameter configuration for Horovod.
SSH arguments -p SSH port
-I Private key file
Tuneable parameter --fusion-threshold-mb HOROVOD_FUSION_THRESHOLD Fusion buffer size
--cycle-time-ms HOROVOD_CYCLE_TIME Delay between each tensor fusion cycle
--cache-capacity OROVOD_CACHE_CAPACITY Maximum number of tensor names that will be cached
--hierarchical-allreduce HOROVOD_HIERARCHICAL_ALLREDUCE Perform hierarchical allreduce between workers instead of ring allreduce.
--no-hierarchical-allreduce Disable hierarchical allreduce
--hierarchical-allgather HOROVOD_HIERARCHICAL_ALLGATHER Perform hierarchical allgather between workers instead of ring allgather
--no-hierarchical-allgather Disable hierarchical allgather
Autotune --autotune HOROVOD_AUTOTUNE Auto select params to maximize the throughput
--autotune-log-file HOROVOD_AUTOTUNE_LOG Comma-separated log of trials containing each hyperparameter and the score of the trial
--autotune-warmup-samples HOROVOD_AUTOTUNE_WARMUP_SAMPLES Number of samples to discard before beginning the optimization process during autotuning.
--autotune-steps-per-sample HOROVOD_AUTOTUNE_STEPS_PER_SAMPLE Number of steps (approximate) to record before observing a sample.
--autotune-bayes-opt-max-samples HOROVOD_AUTOTUNE_BAYES_OPT_MAX_SAMPLES Maximum number of samples to collect for each Bayesian optimization process.
--autotune-gaussian-process-noise HOROVOD_AUTOTUNE_GAUSSIAN_PROCESS_NOISE Regularization value [0, 1] applied to account for noise in samples
Elastic --min-np Minimum number of processes running for training to continue
--max-np Maximum number of training processes, beyond which no additional processes will be created.
--slots-per-host Number of slots for processes per host. Normally 1 slot per GPU per host.
--elastic-timeout Timeout for elastic initialisation after re-scaling the cluster
--reset-limit Maximum number of times that the training job can scale up or down
Timeline --timeline-filename HOROVOD_TIMELINE JSON file containing timeline of Horovod events used for debugging performance.
--timeline-mark-cycles HOROVOD_TIMELINE_MARK_CYCLES Mark cycles on the timeline.
Stall check --no-stall-check HOROVOD_STALL_CHECK_DISABLE The stall check will log a warning when workers have stalled waiting
--stall-check-warning-time-seconds HOROVOD_STALL_CHECK_TIME_SECONDS Seconds until the stall warning is logged to stderr.
--stall-check-shutdown-time-seconds HOROVOD_STALL_SHUTDOWN_TIME_SECONDS Seconds until Horovod is shutdown due to stall.
Library --mpi-threads-disable HOROVOD_MPI_THREADS_DISABLE Disable MPI threading support.
--mpi-args Extra MPI arguments to pass to mpirun
--tcp NCCL_IB_DISABLE If this flag is set, only TCP is used for communication.
--binding-args Process binding arguments.
--num-nccl-streams HOROVOD_NUM_NCCL_STREAMS Number of NCCL streams.
--thread-affinity HOROVOD_THREAD_AFFINITY Horovod background thread affinity.
--gloo-timeout-seconds HOROVOD_GLOO_TIMEOUT_SECONDS Timeout in seconds for Gloo operations to complete.
Logging --log-level HOROVOD_LOG_LEVEL Minimum level to log to stderr from the Horovod backend.
--log-without-timestamp HOROVOD_LOG_HIDE_TIME Hide the timestamp from Horovod internal log messages.
--prefix-output-with-timestamp Timestamp each line of output to stdout, stderr, and stddiag.
Host --hosts List of host names and the number of available slots for running processes on each.
--hostfile Path to a host file containing the list of host names and the number of available slots
--host-discovery-script Used for elastic training (autoscaling and fault tolerance).
Controller --gloo Run Horovod using the Gloo controller.
--mpi Run Horovod using the MPI controller.
--jsrun Launch Horovod processes with jsrun and use the MPI controller