HuggingFace can be complex and complicated if you don’t know where to start to learn it. One entry point into HuggingFace repository are run_mlm.py and run_clm.py scripts.
In this post, we will go through run_mlm.py script. This script picks a masked language model from HuggingFace and fine tune it on a dataset (or train it from scratch). If you are a beginner and you have very little exposure to HuggingFace codes, this post will help you understand the basics.
We will pick a masked language model and load a dataset from HuggingFace and fine tune the model on the dataset. At the end, we will evaluate the model. This is all for the sake of understanding the code structure, so our focus is not on any specific usecase.
Let’s get started.
Fine-tuning is a common technique in deep learning to take a pre-trained neural network model and tweak it to better suit a new dataset or task.
Fine tuning works well when your dataset is not large enough to train a deep model from scratch! So you start from an already learned base model.
In fine tuning, you take a model pre-trained on a large data source (e.g. ImageNet for images or BooksCorpus for NLP), then continue training it on your dataset to adapt the mode for your task. This requires much less additional data and epochs than training from random weights.
HuggingFace (HF) has a lot of built-in functions that allow us to fine tune a pre-trained model in few lines of codes. The major steps are as following:
- load the pre-trained model
- load the pre-trained tokenizer
- load the dataset you want to use for fine tuning
- tokenize above dataset using the tokenizer
- use Trainer object to train the pre-trained model on the tokenized dataset
Let’s see each step in code. We will intentionally leave out many details to just give an overview of how the overall structure look.