Gradient AllReduce
The Gradient AllReduce algorithm is a popular synchronous data-parallel distributed algorithm. It is the algorithm implemented in most existing solutions such as PyTorch DistributedDataParallel, Horovod, and TensorFlow Mirrored Strategy.
With this algorithm, each worker does the following steps in each iteration.
- Compute the gradient using a minibatch.
- Compute the mean of the gradients on all workers by using the AllReduce collective.
- Update the model with the averaged gradient.
In Bagua, this algorithm is supported via the GradientAllReduce
algorithm
class. The performance of the GradientAllReduce
implementation in Bagua by
default should be on par with PyTorch DDP and faster than Horovod in most cases.
Bagua supports additional optimizations such as hierarchical communication that
can be configured when instantiating the GradientAllReduce
class. They can
make Bagua faster than other implementations in certain scenarios, for example
when the inter-machine network is a bottleneck.
Example usage
A complete example of running Gradient AllReduce can be found at Bagua examples
with --algorithm gradient_allreduce
command line argument.
You need to initialize the Bagua algorithm with (see API documentation for what parameters you can customize):
from bagua.torch_api.algorithms import gradient_allreduce
algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
Then decorate your model with:
model = model.with_bagua([optimizer], algorithm)