Despite all that has been accomplished with large language models (LLMs), the underlying concept that powers all of these models is simple — we just need to accurately predict the next token! Though some may (reasonably) argue that recent research on LLMs goes beyond this basic idea, next token prediction still underlies the pre-training, fine-tuning (depending on the variant), and inference process of all causal language models, making it a fundamental and important concept for any LLM practitioner to understand.
“It is perhaps surprising that underlying all this progress is still the original autoregressive mechanism for generating text, which makes token-level decisions one by one and in a left-to-right fashion.” — from [10]
Within this overview, we will take a deep and practical dive into the concept of next token prediction to understand how it is used by language models both during training and inference. First, we will learn these ideas at a conceptual level. Then, we will walk through an actual implementation (in PyTorch) of the language model pretraining and inference processes to make the idea of next token prediction more concrete.
Prior to diving into the topic of this overview, there are a few fundamental ideas that we need to understand. Within this section, we will quickly overview these important concepts and provide links to further reading for each.
The transformer architecture. First, we need to have a working understanding of the transformer architecture [5], especially the decoder-only variant. Luckily, we have covered these ideas extensively in the past:
More fundamentally, we also need to understand the idea of self-attention and the role that it plays in the transformer architecture. More specifically, large causal language models — the kind that we will study in this overview — use a particular variant of self-attention called multi-headed causal…