ServiceNow Research

Breadth-First Pipeline Parallelism

Abstract

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers the training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed increases of up to 53% in training speed.

Publication
Workshop at the Neural Information Processing Systems (NeurIPS)
Joel Lamy Poirier
Joel Lamy Poirier
Applied Research Scientist

Applied Research Scientist at AI Research Deployment​ located at Montreal, QC, Canada.