SCOPE: Self-Evolving Prompts Transform AI Agents from Static Executors to Adaptive Learners

Paper: SCOPE: Prompt Evolution for Enhancing Agent Effectiveness Authors: Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu Affiliations: The Chinese University of Hong Kong, Huawei Noah's Ark Lab Code: GitHub Repository


Executive Summary

SCOPE (Self-evolving Context Optimization via Prompt Evolution) introduces a paradigm shift in how AI agents learn from experience. Rather than treating prompts as static instructions, SCOPE treats them as evolvable parameters that adapt during task execution. The framework addresses a critical "capability gap" in LLM agents: while they have access to massive execution contexts, their static prompts fail to adapt to dynamic feedback. Through a dual-memory architecture and perspective-driven exploration, SCOPE achieves remarkable performance gains - improving success rates from 14.23% to 38.64% on Humanity's Last Exam benchmark (+24.4 percentage points) and from 32.73% to 56.97% on GAIA.

Problem Statement and Motivation

Modern LLM agents suffer from two systematic failure modes that severely limit their reliability:

The "Alarm" Effect (Corrective Failure)

When an error occurs, agents treat error logs as generic alarms rather than actionable feedback. For example, when an API returns a format error with explicit correction instructions, agents often: - Retry the same failed approach - Fall into "error loops" repeating identical mistakes - Fabricate data to bypass validation rather than fixing the root cause

Missed Optimization (Enhancement Failure)

Even when no error occurs, agents persist with suboptimal strategies. A search agent might use a single keyword "walks" when synonyms like "base on balls" would yield better results. Static prompts lack mechanisms to learn from successful-but-inefficient trajectories.

The Core Insight

The fundamental problem is that static prompts cannot encode the dynamic knowledge required for complex, multi-step tasks. SCOPE's key insight: treat execution trajectories as learning signals to automatically evolve prompts during task execution.

Technical Innovation: The SCOPE Framework

Architecture Overview

SCOPE operates as a meta-learning layer that sits alongside the executing agent, consisting of four specialized components:

Agent Execution → Trigger Detection → Guideline Synthesis → Dual-Stream Routing → Memory Optimization → Prompt Update ↑ ↓ └────────────────────────── Next Step ←──────────────────────────────────────────┘

Component 1: Guideline Synthesis (The Generator)

The Generator analyzes execution traces to synthesize natural language rules using a "Best-of-N" strategy:

Corrective Synthesis: When errors occur, generate rules for recovery - Example: "Use 'final_answer_tool' not 'final_answer'"

Enhancement Synthesis: When steps complete successfully but suboptimally, generate optimization rules - Example: "If web search fails, try generalizing keywords or use synonyms"

Notably, 61% of synthesized rules are enhancement-based, making SCOPE a proactive optimizer rather than just a debugger.

Component 2: Dual-Stream Routing (The Memory System)

A classifier evaluates each synthesized guideline and routes it to one of two memory streams:

Tactical Memory (Task-Specific) - Stores rules valid only for the current session - Example: "Clean the 'Amount' column for this specific dataset" - Confidence threshold: < 0.85

Strategic Memory (Universal Principles) - Stores high-confidence rules that persist across future tasks - Example: "If a website is blocked, try Archive.org as fallback" - Confidence threshold: >= 0.85

This dual-stream architecture prevents prompt pollution while preserving valuable generalizable knowledge.

Component 3: Memory Optimization (The Optimizer)

To prevent prompts from exceeding context limits or becoming contradictory, the Optimizer performs:

  1. Conflict Resolution: Merges contradictory rules into consistent guidelines
  2. Subsumption Pruning: Removes specific rules covered by more general ones
  3. Consolidation: Merges similar rules into comprehensive guidelines

The optimization triggers when domain-specific guidelines exceed 10 rules.

Component 4: Perspective-Driven Exploration

SCOPE initializes parallel agent streams with different "personas" to maximize strategy coverage:

Efficiency Stream: Optimizes for "fail-fast" logic and concise plans - Prioritizes speed and minimal steps - Quick hypothesis testing

Thoroughness Stream: Optimizes for exhaustive search and resilience - Uses fallback strategies (e.g., Archive.org for blocked sites) - Comprehensive exploration before concluding

The system executes both perspectives and selects the best outcome.

Step-Level Adaptation

Unlike previous methods that update prompts only after task completion, SCOPE updates prompts during execution. This enables: - Mid-task error recovery - Real-time strategy adaptation - Faster convergence to optimal behavior

Experimental Results

Benchmark Performance

SCOPE was evaluated on three challenging benchmarks, significantly outperforming static baselines and existing optimization methods:

| Benchmark | Baseline | SCOPE | Improvement | |-----------|----------|-------|-------------| | HLE (Humanity's Last Exam) | 14.23% | 38.64% | +24.4pp | | GAIA | 32.73% | 56.97% | +24.2pp | | DeepSearch | 14.00% | 32.00% | +18.0pp |

Domain-Specific Impact

The improvements were most dramatic in knowledge-intensive domains requiring strict protocol adherence:

| Domain | Baseline | SCOPE | Improvement | |--------|----------|-------|-------------| | Chemistry | 14.1% | 50.3% | +36.2pp | | Biology | 14.9% | 43.2% | +28.3pp | | Physics | ~15% | ~40% | +25pp |

These domains benefit most from accumulated procedural knowledge and error recovery patterns.

Ablation Study Results

Contribution of each component to GAIA accuracy:

| Component | Contribution | |-----------|-------------| | Perspective-Driven Exploration | +10.91% (largest) | | Guideline Synthesis | +6.2% | | Dual-Stream Routing | +4.1% | | Memory Optimization | +2.8% |

Model Robustness

SCOPE demonstrates model-agnostic behavior: - GPT-4.1 as optimizer: 46.67% GAIA accuracy - Gemini-2.5-Pro as optimizer: 46.06% GAIA accuracy - Gemini generated 46% more guidelines, but final performance was nearly identical

LLM Configuration Used

Functional Agents: - Planning Agent: GPT-4.1 - Browser Agent: GPT-4.1 - Web Search Agent: Gemini-2.5-Pro - Analyzer Agent: Gemini-2.5-Pro

SCOPE Meta-Agents: - Generator, Selector, Classifier, Optimizer: GPT-4.1

Broader Implications

Language Adoption Evidence

The authors demonstrated that agents actively adopt the phrasing of synthesized guidelines in subsequent outputs. This proves that evolved prompts directly influence decision-making rather than being ignored.

Per-Agent Specialization

In multi-agent systems, SCOPE evolves unique prompts for each role (Browser vs. Planner) rather than using a shared library. This prevents conflicting instructions between specialized agents.

Analogy for Understanding

Consider SCOPE as a mentor for a new employee (the Agent) given a generic training manual (Static Prompt):

  1. Guideline Synthesis: When the employee struggles, the mentor provides sticky notes: "Next time, click the blue button, not the red one"
  2. Dual-Stream Routing: Task-specific notes go on the monitor (Tactical); career wisdom goes in the permanent notebook (Strategic)
  3. Perspective-Driven: The mentor hires two employees - one fast, one thorough - and keeps whichever produces better work

Limitations and Future Directions

Current Limitations

Computational Overhead: Running parallel exploration streams doubles agent execution costs

Meta-Agent Dependencies: Quality of synthesized guidelines depends on meta-agent capabilities

Cold Start Problem: Strategic memory is initially empty, requiring bootstrap period for new domains

Future Research Opportunities

Hierarchical Memory: Multi-level abstraction for guidelines (tactical → strategic → meta-strategic)

Cross-Task Transfer: Sharing strategic memory across related task families

Efficient Exploration: Dynamic allocation of exploration budget based on task difficulty

Smaller Optimizers: Distilling optimization capabilities into efficient specialized models

Practical Applications

Autonomous Research Agents

Code Generation Agents

Enterprise Automation

Conclusion

SCOPE fundamentally transforms how we think about agent prompts - from static instructions to dynamic, evolving parameters. By introducing step-level adaptation, dual-memory architecture, and perspective-driven exploration, SCOPE achieves dramatic improvements on challenging benchmarks while maintaining model-agnostic design.

The framework's open-source release as a lightweight Python package makes it immediately applicable to existing agent systems. For practitioners building autonomous agents, SCOPE offers a proven methodology for dramatically improving reliability without architectural overhaul.

Key Takeaway: The future of AI agents lies not in bigger models with static prompts, but in adaptive systems that learn from every execution step. SCOPE provides the blueprint.

Recommendation: Strong candidate for integration into production agent systems where reliability is paramount. The 24+ percentage point improvements on benchmark tasks demonstrate that prompt evolution is not merely incremental optimization but a fundamental capability enhancement.

Created 2026-01-04T21:36:31-08:00