iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🌎

Learning LLM Prompt Engineering Systematically with the Prompt Engineering Guide ②

に公開

Overview

Continuing from the previous post, these are my personal notes for systematically learning Prompt Engineering techniques based on the public Prompt Engineering Guide.
https://github.com/dair-ai/Prompt-Engineering-Guide

In this post, I will organize the remaining techniques, excluding Meta prompting. Please refer to my previous article for the earlier content.
https://zenn.dev/akitek/articles/36b8bf5ba4af4b

For those in a hurry

For those who want a quick understanding, I have summarized the techniques introduced this time in a table.
(The following table was organized by ChatGPT based on the main text.)

Technique Name Key Features References / Links
Automatic Prompt Engineer (APE) - Automatically generate and score instructions using LLMs for optimization
- Search for instructions as a black-box optimization
- Perform semantic-preserving restructuring via Monte Carlo methods
Zhou et al., 2022
Active-Prompt - Focuses on selecting effective CoT examples
- Determines annotation targets by estimating problem uncertainty
- Aims for accuracy improvement with only a small number of effective examples
Diao et al., 2023
Directional Stimulus Prompting (DSP) - A small LM generates hints → augments input for the LLM
- Train the hint generation model (Policy LM) through SFT + RL
- Can be used without modifying the core LLM
Li et al., 2023
PAL (Program-Aided Language Models) - Describe intermediate reasoning in code rather than natural language
- Problem-solving via executable code like Python
- Enables programming-based reasoning
Gao et al., 2022
GitHub: PAL
ReAct - Integrates reasoning (Thought) + action (Action)
- Deepens thinking while interacting with external APIs and tools
- Excellent balance of visibility, flexibility, and performance
Yao et al., 2022
Qiita Article
Reflexion - ReAct + Self-reflection + Memory model
- Composed of three parts: Actor, Evaluator, and Self-reflection
- Evaluates trajectories and applies lessons to future thinking
- Highly interpretable and versatile
Shinn et al., 2023
Multimodal CoT - Chain-of-thought integrating text and images
- Two-stage process: Rationale Generation → Answer Inference
- Achieves performance comparable to large LLMs even with small models
Zhang et al., 2023
arXiv:2302.14045
Graph Prompting - Introduces the concept of prompting to GNNs
- Consistent formulation using similarity prediction tasks
- Simplifies the utilization of pre-trained knowledge for downstream tasks
Liu et al., 2023
Sequel: Generalized Graph Prompt

Prompt Engineering

Automatic Prompt Engineer

Numerous studies have shown that including appropriate instructions in a prompt is effective for obtaining better answers from an LLM. However, these "appropriate instructions" often rely on the knowledge of the human operator and are manually created.

Zhou et al., 2022 proposed the Automatic Prompt Engineer (APE), a framework for automatic instruction generation and selection.
https://arxiv.org/abs/2211.01910

In this paper, automatic generation of appropriate instructions is performed through the following steps. In this context, this automatic generation of instructions is referred to as "natural language program synthesis" and is addressed as a black-box optimization problem.

Following the diagram:

  1. First, the LLM is used as an inference model to list candidate instructions from a few demonstrations. (In diagram (A), input-output pairs are given to solve the task of inferring the work instructions given by a professor. First, candidate instructions that might have existed are listed by referring to the input-output pairs. (①))
  2. Next, the LLM is used to score the "quality" of each candidate instruction. (Here, each instruction is incorporated into the professor's original statement (②), an output is generated from the demo input, and the "likelihood" of each candidate instruction is calculated as a log probability by comparing the predicted sentence with the ground truth output. (③))
  3. Finally, following the Monte Carlo Method, the process of taking the instruction with the best score from step 2, increasing its variations without changing its meaning, and re-scoring them is repeated. This mechanism eventually leads to the acquisition of the best instruction. (In the example, sentences with the same meaning as "write the antonym of the word," which had the best score in step 2, such as "write the opposite of the word given," are being scored.)

In other words, this approach can be considered to hold the perspective that a good instruction is one that captures the essence of the instruction, whose superiority does not easily crumble even if the wording changes.

In a previous article, I introduced research showing that simply adding the sentence "Let's think step by step" to a prompt allows it to solve advanced reasoning tasks.

APE has discovered a Zero-Shot CoT prompt that is even better than this one: ("Let's work this out in a step by step way to be sure we have the right answer").

It can be said that this framework has proven effective in generating instructions that yield more efficient and accurate answers than those found by hand.

Active-Prompt

Continuing on, this is a method that addresses the question: What constitutes a prompting technique with good examples?
The Chain-of-thought (CoT) method is a technique that enables dealing with advanced reasoning processes by creating prompts that explicitly show several intermediate reasoning steps.

However, as pointed out previously, it depends on human-annotated example sets, and there is a possibility that these may not necessarily be good example sets when dealing with different tasks.

Diao et al., 2023 proposed Active-Prompt as a way to adapt LLMs to different tasks while using task-specific example prompts.

https://arxiv.org/abs/2302.12246

The main objective is to address the challenge of determining which questions among a pre-conceived set are important and beneficial to annotate. In other words, in Chain-of-thought (CoT) Prompting, where intermediate reasoning processes are incorporated manually, the motivation is to automatically select problem examples that improve the response accuracy from the LLM, even with just a few examples.
Ultimately, the goal is to save the effort required for creating examples.

The approach is roughly as follows:

  1. Scoring questions in the pool (unlabeled)
    • For each question, measure the uncertainty of multiple CoT reasonings generated by the LLM.
    • Questions with higher uncertainty are considered more difficult for the model.
  2. Example selection (active selection)
    • Select questions with the highest uncertainty and manually annotate the CoT steps and answers.
  3. Example set construction and inference
    • Use the selected example set as a few-shot prompt.
    • Further improve performance by combining it with self-consistency (a type of majority-vote labeling).

Let's organize the specific flow using the diagram from the paper.
First, assume a collection of questions is prepared in advance (Unlabeled Questions). These are in an untouched state, without any human annotations yet.

The final goal is to extract only the truly beneficial questions from this collection and instruct a person to annotate them for the task the LLM is about to solve.

Next, these Unlabeled Questions and a small number of pre-annotated examples are combined and solved by the LLM. Here, it is solved k times, and the uncertainty of each answer is calculated ((1) Uncertainty Estimation).
Then, questions with a high overall average uncertainty are selected ((2) Selection), and a human is instructed to provide additional annotations for those questions. ((3) Annotation)
Finally, using the selected questions and the set of examples annotated for them, the LLM's response to the actual task is confirmed. ((4) Inference)

Active-Prompt is novel in that it applies the idea of active learning to CoT prompting, and it is a method that can visualize "what the model is struggling with" and dynamically reinforce examples accordingly.
This way of thinking may provide powerful suggestions for future prompt design. Additionally, example selection based on uncertainty is promising in terms of cost efficiency.

Directional Stimulus Prompting

As another approach to the challenge of what instructions to include for better LLM responses, Li et al., 2023 proposed a method that introduces a small tunable policy language model (e.g., T5) to generate a directional stimulus prompt that guides the output for each input.

https://arxiv.org/abs/2302.11520


The first image above shows a directional stimulus prompt in a summary task, where a Hint is included in the prompt to obtain the desired summary.
The second image shows a framework where a separate Policy LM is prepared to automatically create that stimulus prompt.

Key features of DSP (Directional Stimulus Prompting) proposed in this paper include:

  1. Stimulus Generation by Policy LM
    • Generates discrete stimuli, such as keywords or a group of hint words, for the original input text, and augments the input to the LLM based on these.
  2. Training Method
    • Supervised Fine-Tuning (SFT): Initial training of the stimulus generation model with known correct data.
    • Reinforcement Learning (RL): Fine-tuning the policy model using the LLM's final output evaluation (e.g., ROUGE score, human evaluation) as a reward.
  3. End-to-End Design
    • Only the policy model is the target for adjustment, while the LLM itself remains a black box, maintaining versatility.

Since it involves preparing a model independent of the LLM to create better instructions, this approach seems applicable in a way that does not depend on the specific features or capabilities of the LLM.

https://github.com/reasoning-machines/pal/tree/main

ReAct

While LLMs possess both the ability to reason using Chain-of-Thought (CoT) and the ability to output actions, these have often been treated separately until now. Reasoning brings internal thoughts to the surface, whereas actions require coordination with external tools or APIs. However, reasoning alone has limits in knowledge acquisition and error correction, while actions alone make it difficult to maintain a consistent thought process.

To address these challenges, the well-known technique ReAct (Reason + Act) was proposed (Yao et al., 2022).
ReAct is a new method that integrates Reasoning (Thought) and Action (Action) at the prompt level.

A key feature is that it complements the hallucinations or fallacies that conventional CoT is prone to by linking with APIs, allowing it to respond to tasks dynamically even in interactive environments. In particular, the balance of "few-shot," "high performance," and "interpretable thought" makes it superior in terms of explainability, practicality, and operational efficiency, making it a potential pillar for future agent design.

As shown in the diagram, ReAct employs a circular process where, in response to an input question, it repeats the visualization of specific reasoning processes (e.g., "Search") and their execution (e.g., searching for information on Google) to improve answer accuracy.

Since it is already a famous method, I will leave the detailed explanation to other articles.
https://qiita.com/kzkymn/items/de4e3a17db6e5363705d
https://zenn.dev/yusu29/articles/azure_react_llm

Reflexion

ReAct was a framework for generating better answers by simultaneously repeating reasoning and action processes.
However, since it does not include a mechanism to refine these processes themselves, computational costs simply continue to rise.
Humans, on the other hand, can reflect on failures, improve plans, and generate more efficient plans and actions.
In other words, it could be said that LLM agents lacked this "self-reflection" capability.

Shinn et al. (2023) proposed Reflexion, which is a framework for enhancing language-based agents through linguistic feedback.

As shown in the diagram, Reflexion is composed of three different models (LMs):

  1. Actor

    • Generates text and actions based on state observations, which in turn generates a trajectory.
    • Chain-of-Thought (CoT) and ReAct are used as Actor models.
    • Additionally, memory components (short-term memory, long-term memory) are added to provide the agent with extra context.
  2. Evaluator

    • Scores the output generated by the Actor. Specifically, it takes the generated trajectory (also referred to as short-term memory) as input and outputs a reward score.
    • Different reward functions are used depending on the task.
  3. Self-reflection

    • Generates linguistic reinforcement cues to assist in self-improvement.
    • This role is fulfilled by an LLM, providing valuable feedback for future attempts.
    • To generate specific and relevant feedback, the self-reflection model utilizes reward signals, the current trajectory, and persistent memory.
    • These experiences (Long-term memory) are leveraged by the agent to quickly improve its decision-making.

It can be said that Reflexion extends the ReAct framework by introducing self-evaluation, self-reflection, and memory components.
The diagram below shows an example of how a reflection agent iteratively optimizes actions to solve various tasks such as decision-making, programming, and reasoning.

For a given Task (a), it first generates a sequence of reasoning and action processes called a Trajectory (b).
The generated trajectory is then evaluated internally (Evaluator) or externally (manually) (c).
Based on the evaluation, it reflects on its own generated trajectory (d). It stores that experience in memory and then generates a new trajectory (e)... repeating this process.

Since all models are implemented as LMs (Language Models), it is considered a very versatile framework. It also possesses the following features:

  1. Lightweight and Non-Fine-Tuning: Can be applied quickly without making changes to the LLM's weights.
  2. Interpretability and Transparency: Linguistic reflection clarifies the reasons for failure and suggestions for improvement, making the learning process visible.
  3. Flexible Feedback Formats: Can handle various signals such as numerical evaluations, heuristics, and self-tests.

https://arxiv.org/abs/2303.11366

https://arxiv.org/abs/2302.00923
https://arxiv.org/abs/2302.14045

Graph Prompting

Graphs are powerful tools capable of providing abstract representations of various data. With the advent of Graph Neural Networks (GNNs), it has become possible to acquire meaningful latent representations from complex graph structures.

In the context of LLMs, there is a growing movement to apply the CoT framework to modalities other than natural language, such as Multimodal CoT.

An approach focused on graphs is GraphPrompt, introduced by Liu et al. (2023). This is a new prompting framework for graphs that improves performance in downstream tasks.

Fundamentally, a previous drawback was the gap between pre-trained general-purpose models and their use in specific downstream tasks.

  1. Scarcity of Labeled Data
    • Downstream tasks like node classification or graph classification require a large amount of task-specific labeled data.
    • However, creating labels involves high expertise and cost (e.g., medical and chemical fields require annotation by experts).
  2. Task-specific Model Retraining
    • Current GNNs require retraining for each task by changing the model's output structure or loss function.
    • This creates a structural misalignment where the architecture and training methods differ between pre-training and downstream tasks.

Reflecting more on this problem, in natural language, prompting has been utilized to bridge the gap between pre-training and downstream tasks. For example, using some pre-trained LLM, a wide variety of tasks such as word prediction or summarization could be solved simply by changing the prompting.

What about GNNs? They still cannot be handled with prompting like natural language due to the following points:

  • Structures such as nodes and subgraphs differ from sequential inputs in NLP, making it difficult to design an input format suitable for prompts.
  • Additionally, because it is necessary to change the Readout function (pooling) or classifier for each task, it is difficult to leverage pre-trained knowledge directly for downstream tasks.

As a result, it is difficult to reuse the knowledge (representations) obtained through pre-training.

Based on the above, the research started with the following question:

"Could a unified prompt template be introduced to GNNs, similar to NLP, so that pre-training and downstream tasks can be handled in a consistent way?"

The proposed solutions are the following two points:

  1. Formulating Similarity Prediction Using Subgraph Pairs as a Common Task
    • Allows both pre-training and downstream tasks to be handled in the same format: "Are these two graphs similar?"
  2. Introducing Task-specific Prompts to Control Information Extraction
    • Uses different prompt vectors for each task in the Readout layer to extract the optimal information from pre-trained representations based on the objective.

In other words, GraphPrompt is a new solution that bridges the divide between pre-training and task adaptation in GNNs by introducing a "unified template based on structural similarity" + "learnable prompt vectors."

Organizing the points above, the features are as follows:

  1. Achieving a unified interface by conditioning downstream tasks with prompts.
  2. Flexibly handling multiple tasks by refining the input format (graphs with prompts).

The proposed method has been further extended in the subsequent paper "Generalized Graph Prompt." Please refer to that as well.
https://arxiv.org/abs/2311.15317

Summary

Continuing from the previous post, this article systematically explored Prompt Engineering techniques based on the public Prompt Engineering Guide.

The techniques covered this time were more advanced than those in the previous article, involving applied topics such as "how to generate good prompts" and application to non-natural language data.

While some techniques like ReAct and Reflexion have already become very famous, others like Graph Prompt and Multimodal CoT are likely to attract more attention in the future.

Particularly for the content in the second half, I went back and re-read the papers because there is very little information about them in the original Prompt Engineering Guide.

Now, only Meta Prompting remains. This is a type of Soft Prompting, which is a different concept from so-called Hard Prompting where instructions are added using natural language.

Regarding Meta Prompting, I have decided to explain it in a separate article as I felt it necessary to organize the theoretical background separately.

Discussion