Surrogate Minimization: An Optimization Algorithm for Training Large Neural Networks with Model Parallelism

Reza Asad, Reza Babanezhad, Issam H. Laradji, Nicolas Le Roux, Sharan Vaswani

December 2023

Abstract

Optimizing large memory-intensive neural networks requires distributing its layers across multiple GPUs (referred to as model parallelism). We develop a framework that allows decomposing a neural network layer-wise and training it by optimizing layer-wise local losses in parallel. By using the resulting framework with GPipe [12], an effective pipelining strategy for model parallelism, we propose the Surrogate Minimization (SM) algorithm. SM allows for multiple parallel updates to the layer-wise parameters of a distributed neural network and consequently improves the GPU utilization of GPipe. Our framework ensures that the sum of local losses is a global upper-bound on the neural network loss, and can be minimized efficiently. Under mild technical assumptions, we prove that SM requires O(1/ϵ) iterations in order to guarantee convergence to an ϵ-neighbourhood of a stationary point of the neural network loss. Finally, our experimental results on MLPs demonstrate that SM leads to faster convergence compared to competitive baselines.

Type

Workshop

Publication

Workshop at the Neural Information Processing Systems (NeurIPS)

Optimization Theory of Machine Learning

Issam H. Laradji

Research Manager

Research Manager at AI Frontier Research located at Vancouver, BC, Canada.