With the increasing integration of speech front-ends and large language models (LLM),
there is a need to explore architectures that integrate these modalities.
While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler.
Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio.
In this paper we present a ‘streaming’ TTS that can generate audio from streaming text using a novel decoder-only architecture that interleaves text and speech.
The model is trained using next-step prediction on interleaved data that is generated from force-alignment of text transcripts to speech.
Duing inference our system processes text incrementally while generating consistent speech output, making it suitable for real-time applications like conversational AI agents where an LLM can stream text to a TTS system.
Results demonstrate that our approach matches the quality of batch TTS systems while enabling streaming capabilities.