MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

This paper was accepted at the Workshop on Navigating and Addressing Data Problems for Foundation Models (NADPFM) at ICLR 2026.

Principled domain reweighting can substantially improve sample efficiency and downstream generalization; however, data-mixture optimization for multimodal pretraining remains underexplored. Current multimodal training recipes tune mixtures from only a single perspective such as data format or task type. We introduce MixAtlas, a principled framework for compute-efficient multimodal mixture optimization via systematic domain decomposition and smaller proxy models. MixAtlas factorizes the training data along two interpretable axes—\emph{image concepts} and \emph{task supervision} —enabling interpretable mixture control and fine-grained attribution of downstream performance to specific domains within each axis. Using small proxy models and a Gaussian-process surrogate, we explore the mixture space at 1/100th the cost of full-scale training. The resulting mixtures yield substantial improvements: up to 3 faster convergence and consistent gains of 2—5% across diverse benchmarks over existing approaches, with especially strong boosts on text-rich benchmarks like ChartQA (+10%) and TextVQA (+13%). Importantly, we show that mixtures obtained via smaller proxy models transfer to larger scale model training, preserving both efficiency and accuracy gains. Overall, MixAtlas makes multimodal mixture optimization practical and interpretable, providing concrete, compute-efficient recipes for training next-generation MLLMs.

† Virginia Tech
‡ University of Washington
** Work done while at Apple