LM #5 | Prompt Engineering


Introduction

  • New Tasks without Extensive Training
    • Zero‑Shot Prompting : Describes tasks in natural language without examples—relies purely on the model’s innate knowledge.
    • Few‑Shot Prompting : Includes a handful of input–output demonstrations in the prompt to guide the model’s behavior.
    • Automatic Prompt Engineer (APE) : Rather than relying on human-crafted prompts, an LLM generates a pool of candidate instructions
  • Reasoning and Logic
    • Chain-of-Thought (CoT) : proposal and exploration of CoT prompting, demonstrating its effectiveness in eliciting more structured and thoughtful responses from LLMs compared to traditional prompts.
    • Automatic Chain of Thought Prompting : automatically instruct LLMs with a “Let’s think step-by-step” prompt to generate reasoning chains.
    • Self-Consistency : a decoding strategy enhancing reasoning performance compared to greedy decoding in CoT prompting.
    • Tree of Thoughts : extends CoT prompting by managing a tree structure of intermediate reasoning steps, known as “thoughts”. Each thought represents a coherent language sequence moving toward the final solution.
    • Graph of Thoughts : encompass modeling the reasoning process as a directed graph, offering a modular architecture with diverse transformation operations
    • Self Refine Prompting : enhances LLM performance by iteratively refining outputs, through a structured threestep process: generating an initial response, prompting the model to critique its own output, and refining the response based on this feedback.
  • Reduce Hallucination
    • Retrieval Augmented Generation (RAG) : RAG analyzes user input, crafts a targeted query, and scours a pre-built knowledge base for relevant resources. Retrieved snippets are incorporated into the original prompt, enriching it with contextual background.
    • ReAct Prompting : enables LLMs to generate reasoning traces and task-specific actions concurrently
    • Chain of Verification : involves a systematic four-step process including the model generate baseline responses, plan verification questions to check its work, answer the questions independently, and produce a revised response incorporating the verification
  • Others
    • Automatic Prompt Engineer (APE): sheds the limitations of static, hand-designed prompts by dynamically generating and selecting the most impactful prompts for specific tasks


A Systematic Survey of Prompt Engineering in Large Language Models : Techniques and Applications

  • Abstract

    • Prompt engineering is a powerful technique that extends the abilities of LMs by using task-specific prompts to guide behavior without adjusting model params.
    • The survey categorizes and systematically reviews over 29 different prompting methods across tasks like question answering, reasoning, code generation, and common sense inference.
  • Introduction

    • The paper emphasizes that prompt engineering has become essential for leveraging pre-trained LLMs/VLMs without costly retraining.
    • Techniques range from simple (zero-shot and few-shot) to advanced forms like chain-of-thought and code-based prompting.
    • Despite rapid development, there’s a lack of a structured, application-centric overview—hence the need for this thorough survey.
  • Methods:

    • New Tasks without Extensive Training
      • Zero‑Shot Prompting : Describes tasks in natural language without examples—relies purely on the model’s innate knowledge.
      • Few‑Shot Prompting : Includes a handful of input–output demonstrations in the prompt to guide the model’s behavior.
    • Reasoning and Logic
      • Chain-of-THought (CoT) : proposal and exploration of CoT prompting, demonstrating its effectiveness in eliciting more structured and thoughtful responses from LLMs compared to traditional prompts.
      • Automatic Chain of Thought Prompting : automatically instruct LLMs with a “Let’s think step-by-step” prompt to generate reasoning chains.
      • Self-Consistency : a decoding strategy enhancing reasoning performance compared to greedy decoding in CoT prompting.
      • Tree of Thoughts : extends CoT prompting by managing a tree structure of intermediate reasoning steps, known as “thoughts”. Each thought represents a coherent language sequence moving toward the final solution.
      • Graph of Thoughts : encompass modeling the reasoning process as a directed graph, offering a modular architecture with diverse transformation operations
      • Self Refine Prompting : enhances LLM performance by iteratively refining outputs, through a structured threestep process: generating an initial response, prompting the model to critique its own output, and refining the response based on this feedback.
    • Reduce Hallucination
      • Retrieval Augmented Generation (RAG) : RAG analyzes user input, crafts a targeted query, and scours a pre-built knowledge base for relevant resources. Retrieved snippets are incorporated into the original prompt, enriching it with contextual background.
      • ReAct Prompting : enables LLMs to generate reasoning traces and task-specific actions concurrently
      • Chain of Verification : involves a systematic four-step process including the model generate baseline responses, plan verification questions to check its work, answer the questions independently, and produce a revised response incorporating the verification.


Chain-of-Thought Prompting Elicits Reasoning

  • Abstract :
    • The paper investigates Chain‑of‑Thought (CoT) prompting, where a prompt includes intermediate reasoning steps to guide LLMs toward answers.
    • Experiments on large models like PaLM 540B (540B parameters) show that CoT dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks—achieving state-of-the-art on the GSM8K math benchmark—even outperforming finetuned GPT‑3
  • Introduction:
    • Scaling LLMs boosts general capabilities, yet challenging reasoning tasks (multi-step math, logic puzzles) often hit performance plateaus.
    • Inspired by human problem-solving, the authors propose inserting few-shot CoT exemplars into the prompt.
    • This enables sufficiently large LLMs to generate their own chains of reasoning, unlocking more complex reasoning ability
  • Methods: Few-shot CoT Prompting :
    • Add multiple examples (where each Q is paired with detailed reasoning steps leading to the answer) into the prompt
  • Conclusion:
    • CoT prompting is a simple yet powerful way to evoke multi-step reasoning in LLMs without additional training data.
    • It’s interpretable (revealing the model’s reasoning path), scalable (works across reasoning domains), and most effective with large-scale models.
    • The authors suggest the approach holds promise for broader applications and call for future work on automating CoT demonstrations
# Prompt for Math Word Problems (additional Q-A pairs with detailed resoning steps in the prompt)

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they
had 74 - 35 = 39. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did
Jason give to Denny?
A: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8.
The answer is 8.
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23
- 15 is 8. The answer is 8.
...


Self-Consistency Improves Chain of Thought Reasoning in Language Models

  • Introduction :

    • While CoT stylistically guides models to think through problems step by step, it typically uses greedy decoding which may produce suboptimal or inconsistent reasoning.

    • The paper observes that complex questions often admit multiple reasoning paths that yield the same correct answer.

    • Building on that intuition, self‑consistency ensembles by sampling multiple CoT outputs and voting on the most frequent final answer—effectively providing confidence through agreement across reasoning paths.

  • Methods

    1. Prompt the model with chain‑of‑thought exemplars as in standard CoT prompting.

    2. Sample multiple reasoning paths (answer) from the model (via stochastic decoding—e.g. temperature or top‑k sampling).

    3. Aggregate by choosing the answer with majority/plurality among all sampled responses.

      • This approximates marginalizing over reasoning trajectories: the model’s final output is the most consistently generated answer.
  • Conclusion :

    • Self‑consistency consistently enhances LM reasoning, offering large absolute accuracy gains on multiple tasks without extra training or annotation. Improvements range from ~4 % to over 20 % depending on the dataset.
# 1. Prompt the model w/ CoT
prompt = "Kyle bought last year’s best-selling book for $19.50. This is with a 25% discount from the original price. What was the original price of "the book?"

# 2. Sample multiple reasoning paths (answer) : generate multiple answers by setting higher temperature
import openai
responses = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.8,
    n=20  # generate 20 reasoning paths
)
paths = [choice['message']['content'] for choice in responses['choices']]

# 3. From each sampled reasoning paths, extract the final answer (usually at the end of the text), and count which answer appears most often
paths[0] # >> The original price of the book is $19.50. ... The answer is 26.
paths[1] # >> the discounted price is 0.75x. ... The answer is $26.
most_common_anser # >> "The answer is $26"


Tree of Thoughts: Deliberate Problem Solving with Large Language Models

  • Introduction

    • LLMs have shown success in reasoning tasks when guided by Chain-of-Thought prompting, but often struggle with planning, self-evaluation, and error correction.

    • ToT addresses these limitations by formalizing the reasoning process as search over a tree of intermediate thoughts, combining LMs with classical search algorithms.

    • This framework allows models to look ahead, backtrack, and assess multiple reasoning paths—offering a more systematic and controllable approach than linear generation.

  • Methods: ToT consists of four main component

    1. Thought Decomposition – A task is broken down into a sequence of intermediate thoughts or subproblems where each thought represents a partial solution or reasoning step. Depending on different problems, a thought could be a couple of words (Crosswords), a line of equation (Game of 24), or a whole paragraph of writing plan (Creative Writing).

    2. Thought Generator - the language model acts as a thought generator, producing multiple candidate thoughts based on the current state.

    3. State Evaluator – To decide which thoughts are promising, each state (sequence of thoughts) is evaluated and scored by using the LM to deliberately reason about states

    4. Search Algorithm – (e.g., breadth-first or depth-first with pruning) is used to navigate toward the best solution.

  • Conclusion

    • ToT is model-agnostic and can be applied with standard LMs without fine-tuning. The authors demonstrate ToT’s effectiveness on tasks like game solving and mathematical reasoning, where it outperforms CoT and other baselines.

    • The authors suggest that this approach could serve as a foundation for more advanced, deliberative AI systems capable of more human-like problem solving.



Graph of Thoughts: Solving Elaborate Problems with Large Language Models

  • Introduction :

    • The paper identifies the limitations of existing prompting strategies such as Chain of Thought (CoT), which follows a linear path, and Tree of Thoughts (ToT), which allows branching but remains restricted to a tree structure.

    • These methods often fail to capture the rich and iterative nature of human reasoning. To address this, GoT introduces a more generalized and expressive approach by organizing thoughts within a graph.

  • Methods : represents each reasoning step as a node in a graph, while edges denote logical or causal dependencies between them.

    • Transformations of Thoughts

      • Generation of new thoughts from existing ones,

      • Aggregation of multiple thoughts into a unified form,

      • Refinement of thoughts through feedback loops.

    • System Architecture : This modular architecture allows GoT to flexibly explore reasoning paths and improve results through iterative and structured exploration.

      • Prompter : prepares the prompts to be sent to the LLM. This module is responsible for the specifics of encoding the graph structure within the prompt.

      • Parser : interprets LLM outputs back into graph updates

      • Validator : verify whether a given LLM thought satisfies potential correctness conditions, and then we assign it a score

      • Controller : manages the evolution of the graph by selecting which transformations to apply based on the task context.

  • Conclusion

    • By organizing thoughts in a graph structure and enabling dynamic operations on them, GoT supports more robust and context-aware reasoning.

    • The experimental results demonstrate that GoT achieves higher accuracy and lower computational cost across various tasks compared to CoT and ToT.



Self-Refine: Iterative Refinement with Self-Feedback

  • Introduction

    • Inspired by human writing processes, the authors propose Self-Refine, where a model first drafts an initial output, then provides feedback (self-critique), and finally revises its response.

    • Unlike existing methods that rely on human feedback or few-shot exemplars, Self-Refine only uses task instructions and self-generated feedback, making it widely applicable across tasks without task-specific engineering.

  • Method

    • Initial Generation: The model produces a response given a task prompt $P_{gen}$.
    • Self-Critique: It analyzes its own output and provides structured feedback (e.g., pointing out inconsistencies or missing details).
      • given a task-specific prompt $P_{fb}$ for generating feedback
    • Revision: Using the feedback, the model generates a revised version of the response.
  • Conclusion

    • Self-Refine shows that language models can significantly improve their outputs through structured self-feedback and iteration.

    • The approach is simple, general-purpose, and effective, requiring minimal engineering or supervision.



Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

  • Summary : RAG literally (1) retrieves the top-k documents for a given query, (2) augments the latent vector by concatenating the original input x with the retrieved passage z, and (3) generates the final output.
  • Abstract
    • Background. LLMs have been shown to store factual knowledge in their parameters, and achieve SOTA results when fine-tuned on down-stream NLP tasks.
    • Problem Def. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures.
    • Contrib. We explore a general-purpose fine-tuning recipe for RAG ? models which combine pre-trained parametric and non-parametric memory for language generation.
  • Method : We explore RAG models, which use the input sequence $x$ to retrieve text documents $z$ and use them as additional context when generating the target sequence $y$.
    • Retrieval (DPR) : returns top-K distributions over text passages given a query $x$
      • To train the retriever and generator end-to-end, we treat the retrieved document as a latent variable.
    • Generator $p_\theta(y_i | x, z, y_{1:i-1})$ : generates a current token based on a context of the previous tokens $y_{1:i-1}$ , the original input $x$ and a retrived passage $z$
      • could be modelled using any encoder-decoder.
      • To combine the input $x$ with the retrieved content $z$ when generating from BART, we simply concatenate them.

RAG overview



ReAct: Synergizing Reasoning and Acting in Language Models

  • Introduction :

    • Prior work treated reasoning (via chain‑of‑thought) and acting (task‑specific steps in environments) as separate capabilities.

    • Chain‑of‑thought helps solve problems with internal logic but can hallucinate or propagate errors; acting approaches let LMs execute tasks but without an explicit reasoning process.

    • Human intelligence combines verbal reasoning and task‑oriented action. Motivated by this, the authors propose ReAct to tightly interleave reasoning and action in one agent, improving both performance and interpretability over isolated approaches

  • Method : ReAct treats the LM’s output space as containing two types: L (language reasoning traces) and A (actions).

    • The model alternates these: it reasons internally (without affecting the environment), then takes an action (like a search or navigation), receives observation feedback, and continues reasoning based on updated context.

    • This allows dynamic plan creation, commonsense injection, progress monitoring, and exception handling.

    • The prompting style is flexible—dense reasoning for QA tasks, sparse traces for action‑heavy tasks—with only a few in‑context examples required

  • Conclusion :

    • ReAct demonstrates that intertwining reasoning and acting leads to emergent capabilities beyond either alone.

    • It excels across both language reasoning and embodied decision tasks, offering enhanced accuracy, robustness, and interpretability.

    • Because reasoning traces serve as a transparent intermediate representation, errors become easier to diagnose and correct.

- Question) What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?

- Thought 1) I need to search Colorado orogeny, find the area that the eastern sector of the Colorado orogeny extends into, then find the elevation range of thearea.
- Action 1) Search[Colorado orogeny]
- Observation 1) The Colorado orogeny was an episode of mountain building (an orogeny) in Colorado and surrounding areas.

- Thought 2) It does not mention the eastern sector. So I need to look up eastern sector.
- Action 2) Lookup[eastern sector]
- Observation 2) (Result 1 / 1) The eastern sector extends into the High Plains and is called the Central Plains orogeny.

- Thought 3) The eastern sector of Colorado orogeny extends into the High Plains. So I need to search High Plains and find its elevation range.
- Action 3) Search[High Plains]
- Observation 3) High Plains refers to one of two distinct land regions

- Thought 4) I need to instead search High Plains (United States).
- Action 4) Search[High Plains (United States)]
...


Chain-of-Verification Reduces Hallucination in Large Language Models

  • Abstract :

    • The paper introduces Chain of Verification, a new framework for multi-hop question answering (QA) that enhances answer trustworthiness by verifying intermediate reasoning steps.

    • Instead of relying solely on direct answer generation, the method first predicts reasoning chains and then verifies each step using a separate verifier model, improving both factual accuracy and model reliability.

  • Introduction

    • Traditional chain-of-thought (CoT) methods generate reasoning steps to improve transparency, but these steps can contain errors, leading to incorrect answers.

    • This paper proposes treating verification as a first-class citizen, introducing verifier models to assess the correctness of individual reasoning steps, which helps ensure that the final answer is grounded in accurate intermediate steps.

  • Method :

    • (1) Generate Baseline Response : Given a query, generate the response using the LLM.

    • (2) Plan Verifications: Given both query and baseline response, generate a list of verification questions that could help to self-analyze if there are any mistakes in the original response.

    • (3) Execute Verifications: Answer each verification question in turn, and hence check the answer against the original response to check for inconsistencies or mistakes.

    • (4) Generate Final Verified Response: Given the discovered inconsistencies (if any), generate a revised response incorporating the verification results.

  • Conclusion

    • CoVe significantly improves multi-hop QA performance and robustness by shifting focus from just generating answers to validating the reasoning behind them.

    • It achieves better factual correctness than traditional CoT or direct-answer methods, setting a new direction for building more trustworthy AI systems in complex reasoning tasks.



Automatic Prompt Engineer : Large Language Models Are Human-Level Prompt Engineers

  • Abstract : The authors introduce Automatic Prompt Engineer (APE), which reframes prompt engineering as a form of natural language program synthesis. Rather than relying on human-crafted prompts, an LLM generates a pool of candidate instructions, then another LLM evaluates and selects the best prompt based on defined score functions.

  • Introduction :

    • Prompt quality critically influences LLM task performance, yet crafting effective prompts remains labor-intensive and human-dependent.

    • Inspired by program synthesis techniques, the authors propose treating prompts as “programs” to be automatically generated and optimized.

    • APE aims to reduce reliance on human prompt design, leveraging LLMs themselves to search, propose, and refine prompt candidates

  • Methods : The APE framework unfolds in several stages:

    • Candidate Generation: An LLM produces instruction candidates (directly infers instructions from input–output examples).

    • Scoring: Each candidate prompt is evaluated using metrics such as execution accuracy or the log-probability of correct outputs.

    • Iterative Monte Carlo Search: Based on scores, high-performing prompts are resampled and refined over multiple iterations—though the benefits of additional sampling plateau over time.

  • Conclusion : APE reframes prompt engineering as an automated, black-box optimization task guided by LLMs. The findings highlight that LLMs can achieve—or even surpass—human-level prompt design with minimal human involvement. This establishes APE as a powerful tool for making LLM-based systems more efficient and adaptable. The paper lays the groundwork for future exploration in automated prompt generation, including improved scoring strategies, cross-model prompt transferability, and tighter integration with in-context learning methods