SentenceTransformer: A Model For Computing Sentence Embedding | by Mina Ghashami

Convert BERT to an efficient sentence transformer

In this post, we look at SentenceTransformer [1] which was published in 2019. SentenceTransformer has a bi-encoder architecture and adapts BERT to produce efficient sentence embeddings.

BERT (Bidirectional Encoder Representation of Transformers) is built with the ideology that all NLP tasks rely on the meaning of tokens/words. BERT is trained in two phases: 1) pre-training phase where BERT learns the general meaning of the language, and 2) fine-tuning where BERT is trained on specific tasks.

BERT is very good at learning the meaning of words/tokens. But It is not good at learning meaning of sentences. As a result it is not good at certain tasks such as sentence classification, sentence pair-wise similarity.

Since BERT produces token embedding, one way to get sentence embedding out of BERT is to average the embedding of all tokens. The SentenceTransformer paper [1] showed this produces very low quality sentence embeddings almost as bad as getting GLOVE embeddings. These embeddings do not capture the meaning of sentences.

In order to create sentences embeddings from BERT that are meaningful, SentenceTransformer trains BERT on few sentence related task such as:

NLI (natural language inferencing): This task receives two input sentences and outputs either “entailment”, “contradiction” or “neutral”. In case of “entailment” sentence1 entails sentence 2. In case of “contradiction” sentence1 contradicts sentence2. And in the third case which is “neutral” the two sentences have no relation.
STS (sentence textual similarity): This task receives two sentences and decides the similarity of them. Often similarity is calculated using cosine similarity function.
Triplet dataset

SentenceTransformer train BERT on NLI task using a Siamese network. Siamese means twins and it consists of…