Large Language Models, StructBERT — Incorporating Language Structures into Pretraining | by Vyacheslav Efimov

Making models smarter by incorporating better learning objectives

After its first appearance, BERT has shown phenomenal results in a variety of NLP tasks including sentiment analysis, text similarity, question answering, etc. Since then, researchers notoriously tried to make BERT even more performant by either modifying its architecture, augmenting training data, increasing vocabulary size or changing the hidden size of layers.

Despite the creation of other powerful BERT-based models like RoBERTa, researchers found another efficient way to boost BERT’s performance which is going to be discussed in this article. This led to the development of a new model called StructBERT which confidently surpasses BERT on top benchmarks.

The StructBERT idea is relatively simple and focuses on slightly modifying BERT’s pretraining objective.

In this article, we will go through the main details of the StructBERT paper and understand the originally modified objectives.

For the most part, StructBERT has the same architectural principles as BERT. Nevertheless, StructBERT presents two new pretraining objectives to expand linguistic knowledge of BERT. The model is trained on this objective alongside masked language modeling. Let us look at these two objectives below.

Experiments showed that masked language modeling (MSM) task plays a crucial role in the BERT setting to help it obtain vast linguistic knowledge. After pretraining, BERT can correctly guess masked words with high accuracy. Nevertheless, it is not capable of correctly reconstructing a sentence whose words are shuffled. To achieve this goal, the StructBERT developers modified the MSM objective by partially shuffling input tokens.

As in the original BERT, an input sequence is tokenised, masked and then mapped to token, positional and segment embeddings. All of these embeddings are then summed up to produce combined embeddings which are fed to BERT.

During masking, 15% of randomly chosen tokens are masked and then used for language modeling, as in BERT. But right after masking, StructBERT randomly selects 5% of K consecutive unmasked tokens and shuffles them within each subsequence. By default, StructBERT operates on trigrams (K = 3).

When the last hidden layer is computed, output embeddings of masked and shuffled tokens are then used to predict original tokens taking into account their initial positions.

Ultimately, word sentence objective is combined with MLM objective with equal weights.

Next sentence prediction which is another BERT pretraining task is considered relatively simple. Mastering it does not lead to a significant boost to BERT performance on most downstream tasks. That is why StructBERT researchers increased the difficulty of this objective by making BERT predict the sentence order.

By taking a pair of sequential sentences S₁ and S₂ in a document, StructBERT uses them to construct a training example in one of three possible ways. Each of these ways occurs with an equal probability of 1 / 3:

S₂ is followed by S₁ (label 1);
S₁ is followed by S₂ (label 2);
Another sentence S₃ from a random document is sampled and is followed by S₁ (label 0).

Each of these three procedures results in a ordered pair of sentences which are then concatenated. The token [CLS] is added before the beginning of the first sentence and [SEP] tokens are used to mark the end of each sentence. BERT takes this sequence as input and outputs a set of embeddings on the last hidden layer.

The output of the [CLS] embedding which was originally used in BERT for next sentence prediction task, is now used in StructBERT to correctly identify one of three possible labels corresponding to the original way the input sequence was built.

The final objective consists of a linear combination of word and sentence structural objectives.

BERT pretraining including word and sentence structural objectives

All of the main pretraining details are the same in BERT and StructBERT:

StructBERT uses the same pretraining corpus as BERT: English Wikipedia (2500M words) and BookCorpus (800M words). Tokenization is done by WordPiece tokenizer.
Optimisator: Adam (learning rate l = 1e-4, weight decay L₂ = 0.01, β₁ = 0.9, β₂ = 0.999).
Learning rate warmup is performed over the first 10% of total steps and then reduced linearly.
Dropout (α = 0.1) layer is used on all layers.
Activation function: GELU.
The pretraining procedure is run for 40 epochs.

Like the original BERT, StructBERT comes up with base and large versions. All the main settings like the number of layers, attention heads, hidden size and the number parameters correspond exactly to base and large versions of BERT respectively.

Comparison of StructBERT base and StructBERT large

By introducing a new pair of training objectives, StructBERT reaches new limits in NLP consistently outperforming BERT on various downstream tasks. It was demonstrated that both of the objectives play an indispensable role in the StructBERT setting. While the word structural objective mostly enhances the model’s performance on single-sentence problems making StructBERT able to reconstruct word order, the sentence structural objective improves the ability to understand inter-sentence relations which is particularly important for sentence-pair tasks.

All images unless otherwise noted are by the author