Machine learning models learn only during training time, right? Can they learn anything during inference? It turns out that models with attention, for example, GPT, can learn at inference time.
Table of Contents
- What is in-context learning?
- What is zero-shot prompting/zero-shot inference in GPT?
- What is one-shot prompting in GPT?
- What is few-shot prompting in GPT?
- Do you need to provide correct examples?
- How does in-context learning work?
- When is in-context learning better than finetuning?
- Research papers on in-context learning
What is in-context learning?
The ability to learn at inference time is called in-context learning.
When we use a GPT model, we can observe strange behavior. If we type a prompt and the model cannot produce a useful result, we can often improve the outcome by prepending our prompt with one or several examples. When we do it, the model can deal with a task it couldn’t handle without those examples. In short, GPT learned from the examples provided in the context of our prompt. Hence, the name: in-context learning.
Of course, whatever the model learns is available only during the single inference action. Models don’t memorize those examples forever.
We can distinguish between prompts without examples, with one example, and with more than one example. We call them zero-shot prompting, one-shot prompting, and few-shot prompting, respectively.
What is zero-shot prompting/zero-shot inference in GPT?
In the case of zero-shot prompting/zero-shot inference, we provide our instructions and don’t give any hints. If GPT can handle the task, great. If not, we are out of luck.
For example, when I send the following prompt:
1. Split the given sentence into words.
2. Every odd word is a key of a JSON object.
3. Every even word is the value of the key defined before that word.
4. Return a JSON object with all words in the sentence.
###
Cats sleep in boxes
GPT-3 returns:
Answer: {Cats: "sleep", "in": "boxes"}
Almost correct.
What is one-shot prompting in GPT?
To help GPT, we can provide some examples. If we give only one, our request is called a one-shot prompting. For example:
1. Split the given sentence into words.
2. Every odd word is a key of a JSON object.
3. Every even word is the value of the key defined before that word.
4. Return a JSON object with all words in the sentence.
Example:
Input: Every dog fetches balls.
Output: {"Every": "dog", "fetches": "balls"}
###
Cats sleep in boxes
This time, everything worked fine:
Output: {"Cats": "sleep", "in": "boxes"}
What is few-shot prompting in GPT?
Of course, we can extend the number of examples. There is no upper limit (besides running out of tokens), and everything with more than one example is called few-shot prompting.
GPT-3 handled our previous prompt perfectly with only one example, so we will need something more difficult this time. First, let’s see what happens when we don’t provide examples.
Every line contains a number and one word.
Write a sentence containing the word. The sentence length (in words) should be equal to the given number.
###
cat 4
dog 3
banana 4
And the GPT-3 answer:
My cat loves to play. Doggo just wants cuddles. Banana is a yummy treat!
Now, let’s create some examples (split into two sets) and see what happens:
Every line contains a number and one word.
Write a sentence containing the word. The sentence length (in words) should be equal to the given number.
Example 1:
meetings 3
office 7
tests 4
We hate meetings.
Never go to the office ever again.
Programmers must write tests.
Example 2:
tea 4
oranges 3
goat 5
I need a tea
Oranges are orange.
A goat ate my homework
###
cat 4
dog 3
banana 4
Now, GPT knows what it should do and easily handles the task:
My cat is fluffy.
Dog barks loudly.
Bananas are yellow fruits.
Do you need to provide correct examples?
Interestingly, the correctness of examples doesn’t matter. I can provide a bad example as long as the structure of the expected input and output is the same as the correct result:
1. Split the given sentence into words.
2. Every odd word is a key of a JSON object.
3. Every even word is the value of the key defined before that word.
4. Return a JSON object with all words in the sentence.
Example:
Input: Every dog fetches balls.
Output: {"Bring": "me", "more": "pizza"}
###
Cats sleep in boxes and chase every toy
GPT-3 can handle the task even when the example is incorrect:
Output: {"Cats": "sleep", "in": "boxes", "and": "chase", "every": "toy"}
How does GPT do it? In the article “How does in-context learning work? A framework for understanding the differences from traditional supervised learning,” Sang Michael Xie and Sewon Min explained why only the distribution of input and output matters, not the actual values.
If our prompt offers a useless example like the one below, GPT-3 will produce a nonsense result.
1. Split the given sentence into words.
2. Every odd word is a key of a JSON object.
3. Every even word is the value of the key defined before that word.
4. Return a JSON object with all words in the sentence.
Example:
Input: Every dog fetches balls.
Output: [1, {}, 3, "goat"]
###
Cats sleep in boxes and chase every toy
GPT-3 tried to do its best, but the instructions were confusing, and here is the result:
Output: [1, {}, 3, "sleep", 5, "in", 7, "boxes", 9, "and", 11, "chase", 13, "every", 15, "toy"]
Therefore, we don’t need to care about example correctness, but we must ensure the proper input and output distribution.
- Do you want to classify text? Give examples of the texts you want to classify as the input and all (or at least as many as practical) expected output labels.
- Do you generate output in a strictly specified format? Providing several examples may be a more effective way of instructing GPT than writing a detailed description.
- Occasionally, you can skip the input example and provide only example outputs. Try this first if you don’t want to use too many tokens or type long prompts. It may be enough.
How does in-context learning work?
In December 2022, Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei published a paper “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers” (arXiv:2212.10559 [cs.CL]). In the paper, they prove that in-context learning “behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level.”
How is it possible? When we train the model, the learning happens by updating the model parameters with back-propagated gradients. There is no back-propagation at inference time, so in-context learning shouldn’t work. However, the forward computation creates “meta-gradiants” applied to the model through the attention layer. In-context learning works like implicit finetuning at inference time.
Both processes perform gradient descent, “the only difference is that ICL produces meta-gradients by forward computation while finetuning acquires real gradients by back-propagation.”
When is in-context learning better than finetuning?
Can in-context learning ever be better than finetuning? Surprisingly, yes. In the paper mentioned in the previous section, we can find an observation that in-context learning works better at few-shot scenarios than finetuning.
Why does it matter? If you have only a handful of examples, don’t spend time finetuning the model. It won’t improve much. Instead of finetuning, pass the examples in your prompt.
 
            