Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as μP, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete(d) Parameterisation that unifies scaling in width and depth — using an adaptation of CompleteP — as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.

† University of Cambridge
** Work done while at Apple

Figure 1: We optimise hyperparameters at a small 50M parameters/1.6B tokens scale (learning rate, initialisation scale, Adam ε, momenta, and weight decay) with an evolutionary strategy. These hyperparameters (HPs) can be optimised either globally with a shared value across the entire model, or per-module (with 13 module types, some additionally tuned per depth). The per-module approach leads to better results at the 50M scale—optimal global HPs require 2.3× longer training to achieve the same performance. Crucially, our new parameterisation, Complete(d)P, enables direct transfer (without subsequent tuning) to a ~14000× larger FLOP budget.