Can Large Language Models (LLMs) be used to label data? | by Maja Pavlovic

Contents

Prompting: Zero vs. Few-shot Prompting: Sensitivity Model Choice Model Bias Model Parameter: Temperature Language Limitations Human Reasoning & Behaviour (Natural Language Explanations)

Prompting: Zero vs. Few-shot

Obtaining meaningful responses from LLMs can be a bit of a challenge. How do you then best prompt an LLM to label your data? As we can see from Table 1, the above studies explored either zero-shot or few-shot prompting, or both. Zero-shot prompting expects an answer from the LLM without having seen any examples in the prompt. Whereas few-shot prompting includes multiple examples in the prompt itself so that the LLM knows what a desired response looks like:

Zero Vs Few-Shot Prompting | source of example (amitsangani) | image by author

The studies differ in their views on which approach returns better results. Some resort to few-shot prompting on their tasks, others to zero-shot prompting. So you might want to explore what works best for your particular use case and model.

If you are wondering how to even start with good prompting Sander Schulhoff & Shyamal H Anadkat have created LearnPrompting which can help you with basics and also more advanced techniques.

Prompting: Sensitivity

LLMs are sensitive to minor modifications in the prompt. Changing one word of your prompt can affect the response. If you want to account for that variability to some degree you could approach it as in study [3]. First, they let a task expert provide the initial prompt. Then, using GPT, they generate 4 more with similar meaning and average the results over all 5 prompts. Or you could also explore moving away from hand-written prompts and try replacing them with signatures leaving it to DSPy to optimise the prompt for you as shown in Leonie Monigatti’s blog post.

Model Choice

Which model should you choose for labelling your dataset? There are a few factors to consider. Let’s briefly touch on some key considerations:

Open Source vs. Closed Source: Do you go for the latest best performing model? Or is open-source customisation more important to you? You’ll need to think about things such as your budget, performance requirements, customization and ownership preferences, security needs, and community support requirements.
Guardrails: LLMs have guardrails in place to prevent them from responding with undesirable or harmful content. If your task involves sensitive content, models might refuse to label your data. Also, LLMs vary in the strength of their safeguards, so you should explore and compare them to find the most suitable one for your task.
Model Size: LLMs come in different sizes and bigger models might perform better but they also require more compute resources. If you prefer to use open-source LLMs and have limited compute, you could consider quantisation. In the case of closed-source models, the larger models currently have higher costs per prompt associated with them. But is bigger always better?

Model Bias

According to study [3] larger, instruction-tuned³ models show superior labelling performance. However, the study doesn’t evaluate bias in their results. Another research effort shows that bias tends to increase with both scale and ambiguous contexts. Several studies also warn about left-leaning tendencies and the limited capability to accurately represent the opinions of minority groups (e.g. older individuals or underrepresented religions). All in all, current LLMs show considerable cultural biases and respond with stereotyped views of minority individuals. Depending on your task and its aims, these are things to consider across every timeline in your project.

“By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries” — quote from study [2]

Model Parameter: Temperature

A commonly mentioned parameter across most studies in Table 1 is the temperature parameter, which adjusts the “creativity” of the LLMs outputs. Studies [5] and [6] experiment with both higher and lower temperatures, and find that LLMs have higher consistency in responses with lower temperatures without sacrificing accuracy; therefore they recommend lower values for annotation tasks.

Language Limitations

As we can see in Table 1, most of the studies measure the LLMs labelling performance on English datasets. Study [7] explores French, Dutch and English tasks and sees a considerable decline in performance with the non-English languages. Currently, LLMs perform better in English, but alternatives are underway to extend their benefits to non-English users. Two such initiatives include: YugoGPT (for Serbian, Croatian, Bosnian, Montenegrin) by Aleksa Gordić & Aya (101 different languages) by Cohere for AI.

Human Reasoning & Behaviour (Natural Language Explanations)

Apart from simply requesting a label from the LLM, we can also ask it to provide an explanation for the chosen label. One of the studies [10] finds that GPT returns explanations that are comparable, if not more clear than those produced by humans. However, we also have researchers from Carnegie Mellon & Google highlighting that LLMs are not yet capable of simulating human decision making and don’t show human-like behavior in their choices. They find that instruction-tuned models show even less human-like behaviour and say that LLMs should not be used to substitute humans in the annotation pipeline. I would also caution the use of natural language explanations at this stage in time.

“Substitution undermines three values: the representation of participants’ interests; participants’ inclusion and empowerment in the development process” — quote from Agnew (2023)