Small Training Dataset? You Need SetFit | by Matt Chapman

Small Training Dataset? You Need SetFit | by Matt Chapman | Jan, 2025

Last updated: 2025/01/28 at 10:13 AM

Editor AI News

1 Min Read

The enterprise-friendly way to train NLP classifiers with Python in 2025

Data scarcity is a big problem for many data scientists.

That might sound ridiculous (“isn’t this the age of Big Data?”), but in many domains there simply isn’t enough labelled training data to train performant models using traditional ML approaches.

In classification tasks, the lazy approach to this problem is to “throw AI at it”: take an off-the-shelf pre-trained LLM, add a clever prompt, and Bob’s your uncle.

But LLMs aren’t always the best tool for the job. At scale, LLM pipelines can be slow, expensive, and unreliable.

An alternative option is to use a fine-tuning/training technique that’s designed for few-shot scenarios (where there’s little training data).

In this article, I’ll introduce you to a favourite technique of mine: SetFit, a fine-tuning framework that can help you build highly performant NLP classifiers with as few as 8 labelled samples per class.