iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🌍

Systematic Learning with the [Prompt Engineering Guide]: LLM Prompt Engineering #3

に公開

Overview

This is a (personal) note to systematically learn Prompt Engineering techniques, referring to the publicly available Prompt Engineering Guide.

In this series, only Meta Prompting remains.
This is a type of soft Prompting, which is a different concept from Hard prompting, where "prompts" are added using natural language.
I would like to explain Meta Prompting while organizing its theoretical background.

Please check the previous articles at the links below.

https://zenn.dev/akitek/articles/36b8bf5ba4af4b
https://zenn.dev/akitek/articles/37dde9668ebfe8

For Busy People

(*) The content of this article has been summarized using ChatGPT.

Method Name Classification Features Representative Paper / Source Remarks
Hard Prompting Hard Prompt Directly use manually designed text prompts Numerous existing studies, GPT-3 paper, etc. Easy to implement but often suboptimal
Soft Prompting Soft Prompt Use learnable embedding vectors (numerical) as prompts Unreadable/uneditable, low interpretability but effective
Prefix Tuning Soft Prompt Add embeddings to the model's input layer (affects all layers) Li & Liang (2021) “Prefix-Tuning: Optimizing Continuous Prompts for Generation” Strengths in text generation tasks
Prompt Tuning Soft Prompt Attach embeddings only to the input; Transformer layers are fixed Lester et al. (2021) "The Power of Scale for Parameter-Efficient Prompt Tuning" Small scope of influence on model output, but fast and efficient
AutoPrompt Hard Prompt Optimize discrete token sequences using gradient-based methods Shin et al. (2020) “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts” Intermediate method (discrete tokens), neither soft nor hard
Meta Prompting (MAML) Soft Prompt + Meta Meta-learning for task generalization; optimizing initial prompt values across multiple tasks Zhu et al. (2023) “Meta-Prompting: Learning to Learn Better Prompts” (arXiv:2209.11486) Rapid adaptation to unseen tasks in just a few steps, utilizing the MAML framework

Knowledge Background Leading to Meta Prompting

Toward Automatic Prompt Generation

In the first article of this series, I introduced techniques for manually designing good prompts to obtain quality responses from LLMs. For example, there was Few-shot Prompting, which includes examples in the prompt, and Chain-of-thought Prompting, which incorporates the reasoning process itself.

It became clear that even with pre-trained (frozen) LLMs, it is possible to apply the same model to various tasks just by changing the prompt.

While we realized that desired answers can be obtained simply by inputting natural language prompts into LLMs, challenges have emerged in the design of those prompts.

The paper on AutoPrompt, which I will introduce later, identifies the following issues:

Unfortunately, prompt engineering requires manual text input into the language model. This process is time-consuming and far from intuitive. More importantly, language models are quite sensitive to the prompt; inappropriate prompts can degrade model performance.

Seriously designing prompts requires the effort of tailoring sentences to the specific task and the language model being used.

Automatic Prompt Engineer (APE), introduced in the first article, was a method motivated by the idea of automatically generating the prompt itself.

Consequently, the importance of versatile prompts that do not depend on specific language models or tasks is increasing.

AutoPrompt

In particular, to maximize the capabilities of a language model, one must have a good grasp of its internal knowledge (or rather, its internal mechanics).

Attempts to explore the internal knowledge of language models have been approached through methods such as shallow probes (classifiers) and attention visualization. However, there were concerns about false positives and misunderstandings of causality.

A direct way to extract knowledge from a language model is through prompting. In other words, if we try asking the language model various questions (Input) and collect the answers (Output), can't we extract its knowledge? That is the idea. Of course, typing in random prompts would be meaningless.

AutoPrompt is a method that automatically creates prompts to draw out the task performance of the language model being used, without adding additional parameters or fine-tuning the model.

As shown in the figure, AutoPrompt is a method for discovering effective prompts for a task by automatically generating a sequence of trigger words (Trigger Tokens) for mask positions ([MASK]). These Trigger Tokens are common to all inputs.

The process involves searching for what kind of characters should be embedded in the Trigger Tokens to obtain good answers. Here, tokens are selected to optimize the log-likelihood probability of Input and Output using a gradient-guided search.[1]

AutoPrompt is a means of giving task instructions "without parameters," but by optimizing Trigger Tokens for the specific language model used, the internal knowledge of the MLM (Masked LM) can be powerfully activated, resulting in better answers.

https://arxiv.org/abs/2010.15980

Integrating Prompts into the Model

Methods like APE and AutoPrompt have been proposed to automatically generate prompts suitable for language models. Creating these prompts in a way that humans can also understand as natural language is called Hard Prompting. Entering a sentence like "What's the weather today?" into ChatGPT is a type of Hard Prompting.

It is known that with Hard Prompting, you can get good answers from LLMs if you design them well. However, there is a technique that aims even higher. That is soft Prompting.

Soft prompts are created during the prompt tuning process.
Unlike hard prompts, soft prompts cannot be viewed or edited as text.
Prompts are composed of embeddings (sequences of numbers), which draw out knowledge from large-scale models.
Therefore, a clear disadvantage is that soft prompts lack interpretability. The AI automatically finds prompts relevant to a specific task, but it cannot explain why it chose those embeddings. Like deep learning models, soft prompts are black boxes.

https://cobusgreyling.medium.com/prompt-tuning-hard-prompts-soft-prompts-49740de6c64c

Prefix Tuning

A famous method for Soft Prompting is Prefix Tuning, proposed by Li et al, 2021.

https://arxiv.org/abs/2101.00190

To summarize the method,
Prefix Tuning is a technology that enables task adaptation using only a small amount of prefix parameters while keeping the parameters of the large-scale LM fixed. It achieves generation performance equal to or better than fine-tuning, while also combining storage efficiency, generalization performance, and robustness with small amounts of data, making it a very promising lightweight generation task adaptation method.

In this method as well, newly added tokens are automatically learned so that the language model can generate optimal answers.

However, unlike AutoPrompt, in Prefix Tuning, these tokens are embedded into the model itself.
And while the tokens in AutoPrompt were discrete symbols (words), the tokens in Prefix Tuning are continuous values, allowing richer expressive power to be embedded in the tokens.

The original motivation for the research was the desire to handle various language tasks with lighter (light-weighted) adjustments, rather than computationally expensive fine-tuning.
The authors mention that prompting (Hard Prompting) served as a hint for this—namely, the simplicity of being able to apply the same model to different tasks just by slightly changing the input.


In the proposed method, by attaching a Prefix (the equivalent of tokens in AutoPrompt) before the input (and output), they have devised a way to learn only a small sequence of continuous vectors (the prefix) for each task while freezing the parameters of the model body (Transformer).

This prefix corresponds to each layer of the Transformer and acts like virtual tokens. The model then refers to this prefix when generating the next token, enabling it to output responses corresponding to various tasks.

The selling point is the ease of use: you can share a common base language model (LM) and handle tasks simply by swapping out this prefix for each task, without having to perform full fine-tuning.

In fact, it achieves task performance while updating only about 0.1% of the parameters.

Prompt Tuning

Let's introduce another Soft Prompting method.
It is Prompt Tuning, proposed in Lester et al, 2021.
The direction of the research motivation is similar to the Prefix Tuning mentioned earlier.
Both Prompt Tuning and Prefix Tuning are "parameter-efficient fine-tuning (PEFT) methods to adapt large language models (LLMs) to new tasks without updating the LLM's weights at all."
However, there are differences in where the token sequences are added internally and how they are manipulated.

In short:

  1. In Prefix Tuning, the prefix (token sequence) is placed in each layer within the model.
  2. In Prompt Tuning, the token sequence is placed at the very beginning of the model, i.e., only in the input embedding layer.

Therefore, Prompt Tuning is very efficient because the token sequence to be learned is limited. However, the accuracy may be lower than Prefix Tuning, which embeds token sequences into each layer of the model.

Below is a simple comparison table (summarized by ChatGPT 😇)

Comparison Item Prompt Tuning Prefix Tuning
Insertion Location Add soft prompt before input tokens Add prefix vectors to key/value (KV) in each layer
Parameters to Change Soft prompt corresponding only to input embedding Prefix vectors passed to Transformer blocks in each layer
Model Body Completely frozen Completely frozen
Implementation Difficulty Relatively easy (only embedding preprocessing) Medium (requires inserting hooks into each layer)
Computational Cost Low Slightly higher (prefix added to all layers)
Performance (General Trend) Inferior in small models, but strong in large models Generally more stable performance than Prompt Tuning
Model Size Recommended Method Reason
Small to Medium (<1B) Prefix Tuning Performance tends to be unstable with Prompt Tuning
Large (>1B, especially >10B) Prompt Tuning Since the model's own linguistic knowledge is powerful, a shallow soft prompt is sufficient
High Performance + Low Storage Prompt Tuning Only soft prompt is needed, lightweight (a few KB)
Stability/Low-data Robustness Prefix Tuning Higher generalization performance due to intervention in deeper layers

https://arxiv.org/abs/2104.08691

The method of adjusting separately prepared token sequences without re-learning the entire model seems to meet practical needs. Below is a comparison diagram of Fine Tuning, Soft Prompting (Prompt Tuning), and Hard Prompting (Prompt Design) (referenced from the blog below).

https://research.google/blog/guiding-frozen-language-models-with-learned-soft-prompts/

Please also refer to the blog by the Google researchers who proposed it.


Summary So Far

Method Type Prompt Format Interpretability Model Update Features
Fine-Tuning N/A Update parameters of the entire model ✅ Updates body High performance but high cost
Manual Prompting Hard Natural language strings ✅ High ❌ None Requires manual design
AutoPrompt (Shin et al., 2020) Hard Strings consisting of vocabulary tokens ✅ High ❌ None Automatically generated but visualizable
Prefix-Tuning (Li & Liang, 2021) Soft Continuous vectors injected before each layer ❌ Low ❌ Body fixed Learns task-specific virtual tokens
Prompt Tuning (Lester et al., 2021) Soft Learns embeddings attached just before input ❌ Low ❌ Body fixed Simpler and more lightweight than prefix-tuning ([arXiv][1], [arXiv][2])

Meta Prompting

Now, the final technique in this article is Meta Prompting.
Before explaining this technique, let's categorize prompting into two types and look at the problems that Meta Prompting aimed to solve.

  1. Limits of hard prompts (manual prompts)
    • As we have seen, manual prompts are methods that add "some kind of instruction (natural language)" to the prompt given to the LLM.
    • Few-shot tasks using fixed natural language inputs (e.g., "The movie was [MASK]") are highly difficult to design manually and require task-specific adjustments.
  2. Dependency of soft prompts
    • In Soft Prompting, the goal is to improve model accuracy for individual tasks by attaching prompts to the input and learning the prompts as parameters (the difference from the former is that they are not passed as natural language).
    • While "soft prompts" (attaching continuous vectors as prompts) can be expected to improve performance, they are strongly dependent on their initialization method, and fine-tuning does not work well without a good initial value setting.
    • Current methods require a deep understanding of the internal structure of each task and manual design of initialization for each task, which hinders practicality and scalability.

Meta Prompting is a method that addresses the latter—the dependency of soft prompts—and was proposed as a general method for soft prompts that had previously been designed in a task-specific manner.

https://arxiv.org/abs/2209.11486

In other words, Meta Prompting was proposed to solve this "initialization problem" of soft prompts.
Specifically, it uses MAML (Model-Agnostic Meta-Learning) to acquire initial prompt parameters (\phi_{meta}) that should be learned from meta-tasks spanning multiple domains.
Then, for a new task, using this common initial value as a starting point, it becomes possible to learn stably and with high accuracy using only a few shots.

The figure shows an image of the Meta Prompting process (quoted from the paper).

Here is a brief explanation of the procedure:

Let the token embedding sequence of the soft prompt be \theta.

  1. Sampling Meta-tasks
    • Randomly sample multiple tasks (e.g., T_{1}, T_{2}, ..., T_{n}) from a set of tasks.
  2. Inner Loop Update for Each Task
    • For each task T_{i}, perform learning on a support set (a small amount of training data) of T_{i} using the current common prompt parameters \theta:
\theta^{'}_{i} = \theta - \alpha \nabla_{\theta}L^{support}_{T_{i}}(\theta)
    • This is the prompt parameter \theta slightly adapted to T_{i} (\theta^{'}_{i}).
  1. Meta Update (Outer Loop)
    • Calculate the loss using the \theta^{'}_{i} obtained above on the test set (query set) of each task:
L^{query}_{T_{i}}(\theta^{'}_{i})
    • Take the average loss for all tasks and update the original \theta:
\theta \leftarrow \theta - \beta \nabla_{\theta}\sum_{i} L^{query}_{T_{i}}(\theta^{'}_{i})

The final goal is to acquire a token sequence that allows for general-purpose and stable learning for each task.
Therefore, the processing image is that in step 1 above, token sequences suitable for each task are first created individually, and then in step 3, a token sequence that is effective across tasks is created.

Summary

In this article, I have explained the remaining Meta Prompting, referring to the publicly available Prompt Engineering Guide.
I also touched upon other soft prompting methods represented by Meta Prompting.

This content is not even thoroughly covered in the official Prompt Engineering Guide. I hope it proves useful.

References

https://webbigdata.jp/post-12756/

Japanese article about Soft Prompt

https://zenn.dev/elith/articles/a4b17e072d4870

Explains Prompt Tuning and Prefix Tuning within Soft Prompting

脚注
  1. I thought this part was similar to APE. ↩︎

Discussion