Gradient AllReduce

The Gradient AllReduce algorithm is a popular synchronous data-parallel distributed algorithm. It is the algorithm implemented in most existing solutions such as PyTorch DistributedDataParallel, Horovod, and TensorFlow Mirrored Strategy.

With this algorithm, each worker does the following steps in each iteration.

Compute the gradient using a minibatch.
Compute the mean of the gradients on all workers by using the AllReduce collective.
Update the model with the averaged gradient.

In Bagua, this algorithm is supported via the GradientAllReduce algorithm class. The performance of the GradientAllReduce implementation in Bagua by default should be on par with PyTorch DDP and faster than Horovod in most cases. Bagua supports additional optimizations such as hierarchical communication that can be configured when instantiating the GradientAllReduce class. They can make Bagua faster than other implementations in certain scenarios, for example when the inter-machine network is a bottleneck.

Example usage

A complete example of running Gradient AllReduce can be found at Bagua examples with --algorithm gradient_allreduce command line argument.

You need to initialize the Bagua algorithm with (see API documentation for what parameters you can customize):

from bagua.torch_api.algorithms import gradient_allreduce
algorithm = gradient_allreduce.GradientAllReduceAlgorithm()

Then decorate your model with:

model = model.with_bagua([optimizer], algorithm)

Bagua Tutorials

Gradient AllReduce

Example usage