Federated Learning for Speech Recognition: Revisiting Current Trends Towards Large-Scale ASR

This paper was accepted at the Federated Learning in the Age of Foundation Models workshop at NeurIPS 2023.

While automatic speech recognition (ASR) has witnessed remarkable achievements in recent years, it has not garnered a widespread focus within the federated learning (FL) and differential privacy (DP) communities. Meanwhile, ASR is also a well suited benchmark for FL and DP as there is (i) a natural data split across users by using speaker information; (ii) heterogeneous data across speakers close to practical settings; (iii) interplay between acoustic and language modeling; (iv) and it is a sequence-to-sequence task. Recent production-ready state-of-the-art models in ASR include \textitlarge conformer and transformer models, optimization of which is known to pose challenges even for the central training. While the main trends and benchmarks in FL and DP focus on \textitsmall models, we show the necessity of disentangling optimization and model size: the behaviour of FL and DP for \textitlarge models is different from the one for \textitsmall models. We speculate that FL and DP is harder for \textitsmall models due to harder optimization problem even in central training. In this paper, we analyze the key FL parameters (optimizers, training from scratch or from a seed model pre-trained centrally, cohort size, data heterogeneity) and propose \textitfirst benchmark of \textitFL with DP in the context of \textitlarge models in ASR. We examine the applicability of prior results and present an overview of observed departures from the trends in prior works and from training different ASR models. Through this work, we provide researchers and practitioners in the fields of FL and DP with valuable insights into the fundamental differences that may arise when applying FL and DP research to large-scale ASR training.