Due to the surge of interest in large language models (LLMs), AI practitioners are commonly asked questions such as: How can we train a specialized LLM over our own data? However, answering this question is far from simple. Recent advances in generative AI are powered by massive models with many parameters, and training such an LLM requires expensive hardware (i.e., many expensive GPUs with a lot of memory) and fancy training techniques (e.g., fully-sharded data parallel training). Luckily, these models are usually trained in two phases — pretraining and finetuning — where the former phase is (much) more expensive. Given that high-quality pretrained LLMs are readily available online, most AI practitioners can simply download a pretrained model and focus upon adapting this model (via finetuning) to their desired task.
“Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage/switching cost for hosting independent instances for different tasks.” — from [1]
Nonetheless, the size of the model does not change during finetuning! As a result, finetuning an LLM — though cheaper than pretraining — is not easy. We still need training techniques and hardware than can handle such a model. Plus, and every finetuning run creates an entirely separate “copy” of the LLM that we must store, maintain, and deploy — this can quickly become both complicated and expensive!
How do we fix this? Within this overview, we will learn about a popular solution to the issues outlined above — parameter-efficient finetuning. Instead of training the full model end-to-end, parameter-efficient finetuning leaves pretrained model weights fixed and only adapts a small number of task-specific parameters during finetuning. Such an approach drastically reduces memory overhead, simplifies the storage/deployment process, and allows us to finetune LLMs with more accessible hardware. Although the overview will include a many techniques (e.g., prefix tuning and adapter layers), our focus will be upon Low-Rank Adaptation (LoRA) [1], a simple and widely-used approach for efficiently finetuning LLMs.