Process Group and Communication Backend

The backbone of any distributed training is based on a group of processes that knows each other and can communicate with each other using a backend. For PyTorch, the process group is created by calling torch.distributed.init_process_group in all distributed processes to collectively form a process group.

torch.distributed.init_process_group(backend='nccl', init_method='env://', ...)

The most common backend used are mpi, nccl and gloo. For GPU based training nccl is strongly preferred and should be used whenever possible. Init_method specifies how each processes can discover each other and initialize as well as verify the process group using the communication backend. By default, PyTorch will look for environment variables. The following is a list of key environment variables, documented here.

MASTER_PORT - required; has to be a free port on machine with rank 0
MASTER_ADDR - required; address of rank 0 node
WORLD_SIZE - required; can be set either here, or in a call to init method. The total number of processes. This should be equal to the number of devices (GPU) used for distributed training.
RANK - required; can be set either here, or in a call to init method. The rank (0 to world_size - 1) of the current process in the process group.

Beyond these, many application will need the following

LOCAL_RANK - the relative rank within the node. This information is useful for example many operations such as data preparation only should be performed once per node — usually on local_rank = 0.
NODE_RANK - the relative rank for the node among all the nodes.

Note:

RANK can be inferred by NODE_RANK and LOCAL_RANK. NODE_RANK is often used by utility launcher script (such as torch.distributed.launch) that can created multiple processes on the same node.

LOCAL_RANK and RANK are both process level environment variables which are not set for the node but for the process.

Distributed Training Launcher

Users rarely launch all distributed processes manually and often rely on a utility launcher. There are 3 types of utility launchers.

Per-process-launcher: the system will launch all distributed processes for users, with all the relevant information (e.g. environment variables) to set up process groups.
Per-node-launcher: the utility launcher will launch processes on a given node. User is responsible to run the launcher from multiple nodes and provide global information such as WORLD_SIZE and MASTER_ADDR, MASTER_PORT. Locally within each node, RANK and LOCAL_RANK is set up by the launcher, with user provided NODE_RANK. torch.distributed.launch belongs to this category, as well as pytorch-lightning Trainer using ddp accelerator.
Head-node-launcher: User run launcher at the head provide information about the cluster (e.g. a hostfile) and launcher arguments, the training script and arguments for the training script. Three examples of head node launcher are mpirun, DeepSpeed launcher and Horovodrun.

This three categories of launchers are named with respect to user experience. Per-process-launcher means user does not need to do extra launching effort, per-node-launcher means user need to be able to run launcher script on every node, and head-node-launcher requires user to get on a headnode with cluster information usually in a hostfile. There are no fundamental differences between the three types of launchers and eventually what matters is the process group getting initiated with the proper backend. Behind the scene a head-node-launcher is often used on behalf of the user by the system so user are exposed to a per-process-launcher experience. Head-node-launcher is often implemented as a wrapper of per-node-launcher.