More Optimizations
Besides communication algorithms, Bagua supports many convenient tools to further accelerate your training workload. Currently we support:
- Generic fused optimizer, which fuses optimizer step operations for multiple layers, and it is generic because it can be applied to arbitrary PyTorch optimizer, in contrast to NVIDIA Apex's approach, where only some specific optimizers are implemented
- Load balanced data loader, which accelerates workloads such as NLP and speech where training samples are of different length. This dataloader distribute training samples in a way that each worker receives samples of similar length, so that they finish a batch in similar time, mitigating the straggler problem in distributed setups.
We welcome more contributions!