ServiceNow Research

Breadth-First Pipeline Parallelism

Abstract

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers the training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed increases of up to 53% in training speed.

Publication
Workshop at the Neural Information Processing Systems (NeurIPS)
Joel Lamy Poirier
Joel Lamy Poirier
Applied Research Scientist

Applied Research Scientist at Large Language Models Lab located at Montreal, QC, Canada.