How to train an LLM, fast

Fast-LLM

Training a large language model (LLM) can be a significant challenge, requiring a considerable investment in hardware, datasets, labor, etc.

With more and more institutions and researchers attempting to train their own models, urgent attention by the AI research and development community is needed to ensure LLMs are trained responsibly and at substantially lower operating and environmental costs.

In other words, the software should be optimized to reduce training time and cost, as well as easy to use so that researchers can work fast and spend time on other aspects of their work.

Training optimization challenges

Training optimization varies according to the size of the language models being developed. For small and medium models, training throughput is driven by kernel efficiency, which is typically limited by graphics processing unit (GPU) memory bandwidth.

As models grow to midsize configurations—around 5 billion parameters—the size of the GPU memory becomes a limiting factor, and memory optimizations become crucial to limit the need for expensive memory management methods, such as activation recomputation and model parallelism.

When dealing with very large models—above 100 billion parameters—the astronomical computational power required becomes the main challenge, making mass parallelization the foremost challenge to address.

Introducing Fast-LLM

We introduce Fast-LLM, an innovative open-source library developed by ServiceNow Research's Foundation Models Lab. It prioritizes speed, flexibility, and convenience to accelerate LLM training.

Fast-LLM was engineered to meet the rigorous demands of professional AI researchers, AI/machine learning engineers, academic and industrial research institutions, and enterprise product development teams pushing the limits of generative AI.

The open-source library reduces the time needed to train a model so that researchers can complete more experiments and product engineering teams can get specialized models to market faster.

With Fast-LLM, we aim to simplify the training of common LLMs while supporting a wide variety of model architectures, training schemes, and data processing options—all accessible through easy configuration and a user-friendly workflow. When these standard components aren’t enough, Fast-LLM allows for custom extensions, helping to ensure nearly every researcher’s needs are met.

On top of that, Fast-LLM integrates seamlessly with other aspects of language model development, helping researchers access and use its powerful features with minimal effort.

This open-source library is now available to model developers with royalty-free commercial use under the Apache 2.0 license.

Fast-LLM’s kernel efficiency

Fast-LLM was designed from the ground up to use a minimal amount of highly efficient kernels. This includes both custom-made kernels using OpenAI Triton and custom third-party kernels such as Flash Attention.

The library also replaces stock PyTorch constructs—such as AMP, (parts of) autograd, Fully Sharded Data Parallel (FSDP), torch.optim, and torch.compile—with carefully tuned implementations. By doing so, it optimizes not only the forward pass, but also the entirety of the training step, from copying input tokens to updating weights.

With these optimizations, Fast-LLM trains Mistral-7B with a whopping 11,200 tokens per second per GPU¹ on H100. This means 52% GPU utilization, which we believe is a record for a model of that size.

Additionally, Fast-LLM features an optimized mixture of experts (MoE) with a streamlined implementation of dropless-MoE, which outperforms MegaBlocks in terms of speed.

Memory efficiency

Fast-LLM's approach to memory usage includes a careful optimization of the memory usage from all sources, including the training state, activations, and gradients, as well as a near-complete elimination of memory fragmentation.

This includes an enhanced allocation of activations, pre-allocation of buffers for the training state, and a custom implementation of ZeRO-2/3 (also known as FSDP). With these optimizations, Fast-LLM trains Mixtral-8x7B with data parallelism alone, achieving a state-of-the-art throughput of 4,000 tokens per second per GPU² on H100s.

Additionally, Fast-LLM retains its efficiency with long contexts by distributing the context along data, tensor, and pipeline-parallel dimensions.

Large-scale training

Fast-LLM is packaged with all the necessary 3D parallelism to train the largest models that require myriad GPUs.

In addition to state-of-the-art tensor parallelism, Fast-LLM provides the first open-source implementation of Breadth-First Pipeline Parallelism, which ensures that ZeRO-2/3 is always available and efficient (see Figure 1).

Furthermore, Fast-LLM facilitates fast distributed checkpointing, allowing for frequent checkpoints and rapid resumption in cases of failure. This comprehensive approach ensures that even the largest models can be trained effectively.

Figure 1: With typical, depth-first gradient accumulation (a), e.g., pipeline parallelism, data-parallel network communication is inefficient, and FSDP makes the problem much worse (b). The breadth-first schedule optimizes communication, both without (c) and with (d) FSDP. Source: https://arxiv.org/abs/2211.05953

Configurations = streamlined development

With Fast-LLM, there’s only one transformer, augmented with extensive configuration options. Thus, simply by adjusting configuration parameters, one has the freedom to recover a variety of common architectures, such as Llama, Mistral, Mixtral, and StarCoder, or create an entirely new architecture. No matter which architecture is chosen, it consistently comes with the same set of optimizations, features, and options.

This one transformer reflects Fast-LLM’s configuration-first approach, which also applies to data processing, training schemes, optimizers, and more. This approach facilitates seamless interoperability, empowering researchers to experiment with new model variations through simple configuration changes, reducing the need for additional code and streamlining the development process.

run:
  experiment_dir: /mnt/workspace/test_experiment
model:
  base_model:
    transformer:
      normalization:
        type: rms_norm
      num_layers: 24
      hidden_size: 4096
      num_attention_heads: 32
      head_groups: 4
      add_linear_biases: false
      rotary:
        type: default
        theta: 10000 
      gated: true
      activation_type: silu
    vocab_size: 32000 
    tie_word_embeddings: false
  distributed:
    training_dtype: bfloat16
training:
  logs:
    interval: 10
  train_iters: 100000
  export:
    interval: 10000
  batch:
    batch_size: 1
    sequence_length: 4096
data:
  path: [/mnt/data/dataset]
optimizer:
  learning_rate:
    base: 0.0001

Figure 2: Configuration file example_mistral.yaml for training a Mistral-7B model. Run fast-llm train gpt --config example_mistral.yaml to use it.

Extending Fast-LLM through Transformers

Fast-LLM is deeply integrated into the Transformers ecosystem, allowing for seamless integration with various tools that are already compatible with it. All Fast-LLM models can be wrapped as Transformers models, allowing for out-of-the-box token generation and direct model interaction.

Certain models also support full checkpoint conversion, enabling the loading of pretrained checkpoints from external sources and further conversions to and from other formats.

fast-llm convert gpt \
 input.format=distributed \
 input.path=/mnt/workspace/test_experiment/export/10000 \
 output.type=mistral \
 output.path=/mnt/workspace/converted_checkpoint

Figure 3: Convert checkpoints between Fast-LLM and Transformers with a simple command.

Fast-LLM case study: StarCoder-2 3B

In late 2023, as part of the BigCode project, we kicked off a project to train the StarCoder2 model on The Stack V2, piloting the use of Fast-LLM for pretraining. This innovative approach optimized GPU utilization and reduced training time by more than 15% over traditional mainstream training frameworks for all three model sizes (3B, 7B, 15B).

Key features, such as grouped-query attention and fill-in-the-middle training, enhanced StarCoder2’s ability to comprehend and predict extensive code contexts.

Ultimately, we selected Fast-LLM for training the 3 billion-parameter version of StarCoder2, which demonstrated Fast-LLM's capabilities in LLM training and set a new standard for efficiency and performance in coding model development.

Roadmap

Looking ahead, we have an exciting roadmap that includes more models, features, and training schemes. This will encompass multimodal models, such as those integrating vision capabilities, and the implementation of staged training processes.

Additionally, we expect to improve support for user-specific components, providing for entirely custom data loading and preprocessing, tailored training schemes, and bespoke models.

We’re also committed to further optimizations, such as additional Triton kernels, advancements in 3D parallel optimizations, and asynchronous checkpointing. And we’re looking into multi-hardware support and are open to collaboration on the matter.

Getting started

To train a model with Fast-LLM, simply define a configuration and launch it through the `fast-llm` command wrapped with `torchrun`. Follow our Quick Start guide for a fully functional training example that also demonstrates Fast-LLM’s data preparation and fine-tuning capabilities. Whether you aim to run it in a Docker container, Slurm, Kubernetes with Kubeflow, or bare metal, our guide has you covered.

Find out more about ServiceNow Research.

¹ Measured on 4 DGX nodes (32 GPUs) with batch size 128 and sequence length 4096

² Measured on 16 DGX nodes (128 GPUs) with a batch size of 256 and a sequence length of 4096