Self-Instruct Framework, Explained | by Tsiu-zhen-tsin Dmitrii

Contents

Step 4— Finetuning the LM to Follow Instructions Quantity Diversity Quality Costs Stanford Alpaca³ Self-Rewarding Language Models⁴

That’s the main idea behind Self-Intsruct!

Step 4— Finetuning the LM to Follow Instructions

After completing all previous steps, we can take a pre-trained LM and instruction-tune it on the generated dataset to achieve better metrics.

At the beginning of the article, I covered some challenges that “instruction-tuned” LLMs face; let’s see how Self-Instruct enables overcoming them.

Quantity

With the help of only 175 initial human-written tasks, 52K instructions and 82K instances were generated:

Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

Diversity

To investigate how diverse the generated dataset is, authors of Self-Instruct used Berkley Neural Parser to parse instructions and then extract the closest verb to the root and its first direct noun object. 26K out of 52K instructions have a verb-noun format, but the other 26K instructions have more complex structure (e.g., “Classify whether this tweet contains political content or not.”) or are framed as questions (e.g., “Which of these statements are true?”).

The top 20 most common root verbs (inner circle) and their top 4 direct noun objects (outer circle) in the generated instructions | Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

Quality

To prove that Self-Instruct can generate high-quality tasks, it was randomly selected 200 generated instructions and sampled 1 instance per instruction, and then the author of the framework assessed them, obtaining the following results:

As we can see, 92% of all tasks describe a valid task, and 54% — have all valid fields (given that we generated 52K tasks, at least 26K will represent high-quality data, which is fantastic!)

Costs

The Self-Instruct framework also introduces significant cost advantages as well. The initial phases of task generation (Steps 1-3 ) amount to a mere $600, while the last step of fine-tuning using the GPT-3 model incurs a cost of $338. It’s vital to keep in mind when we look at results!

How Self-Instruct can enhance the ROUGE-L metric on the SuperNI (Super-Natural Instructions) dataset? For that, we can compare the results of 1) off-the-shelf pre-trained LMs without any instruction fine-tuning (Vanilla LMs), 2) Instruction-tuned models (Instruction-tuned w/o SuperNI), and 3) Instruction-tuned models trained on SuperNI (Instruction-tuned w/ SuperNI):

Evaluation results on ***unseen*** tasks from SuperNI | Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

As we can see, using Self-Instruct demonstrates a 33% absolute improvement over the original model on the dataset (1); simultaneously, it shows that using the framework can also slightly improve metrics after fine-tuning the SuperNI dataset (3).

Moreover, if we create a new (=unseen) dataset of 252 instructions and 1 instance per instruction and evaluate a selection of instruction-tuned variants, we can see the following results:

Performance of GPT3 model and its instruction-tuned variants, evaluated by human experts on our 252 user-oriented instructions | Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

GPT3 + Self-Instruct shows impressive results compared to other instruction-tuned variants, but there is still a place for improvement compared to InstructGPT (previously available LLMs by OpenAI) variants.

The idea behind Self-Instruct is straightforward, but at the same time, it is compelling, so let’s look at how we can use it in different cases.

Stanford Alpaca³

In 2023, Alpaca LLM from Stanford gained colossal interest due to affordability, accessibility, and the fact that it was developed for less than $600, and at the same time, it combined LLaMA and Self-Instruct ideas.

High-level overview of Alpaca | Source: Alpaca: A Strong, Replicable Instruction-Following Model

Alpaca’s version of Self-Instruct were slightly modified:

Step 1 (instruction generation): more aggressive batch decoding was applied, i.e., generating 20 instructions at once
Step 2 (classification task): this step was wholly excluded
Step 3 (instance generation): only one instance is generated per instruction

In the end, researchers from Stanford could achieve significant improvements in comparison to the initial set-up in Self-Instruct and based on performed a blind pairwise comparison between text-davinci-003 (InstructGPT-003) and Alpaca 7B: Alpaca wins 90 versus 89 comparisons against text-davinci-003.