There are various methods to align LLMs with human preferences. Beyond reinforcement learning with human feedback (RLHF), often seen as too resource-intensive for consistent application on newly fine-tuned models, Direct Preference Optimization (DPO) is one of the most popular alternatives for LLM alignment.
Although DPO is significantly more cost-effective than RLHF, it still requires a reference model in addition to the “policy” model (i.e., the model being actively trained). This means both models must be loaded into GPU memory simultaneously, which can be challenging for single-GPU configurations, especially with large models.
A more memory-efficient approach would be to use LoRA for DPO training. Instead of training the entire model, we freeze its parameters and train a small adapter. This method becomes even more efficient if both the policy and reference models share the same base model; in that case, we load the base model once, then load a frozen adapter for the reference model and a trainable adapter for the policy model, significantly reducing memory requirements.
However, the effect of LoRA on DPO’s performance is still understudied in my opinion. While LoRA can closely approximate full training, its performance…