SCOPE: Self-Evolving Prompts Transform AI Agents from Static Executors to Adaptive Learners

Paper: SCOPE: Prompt Evolution for Enhancing Agent Effectiveness Authors: Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu Affiliations: The Chinese University of Hong Kong, Huawei Noah's Ark Lab Code: GitHub Repository

Executive Summary

SCOPE (Self-evolving Context Optimization via Prompt Evolution) introduces a paradigm shift in how AI agents learn from experience. Rather than treating prompts as static instructions, SCOPE treats them as evolvable parameters that adapt during task execution. The framework addresses a critical "capability gap" in LLM agents: while they have access to massive execution contexts, their static prompts fail to adapt to dynamic feedback. Through a dual-memory architecture and perspective-driven exploration, SCOPE achieves remarkable performance gains - improving success rates from 14.23% to 38.64% on Humanity's Last Exam benchmark (+24.4 percentage points) and from 32.73% to 56.97% on GAIA.

Problem Statement and Motivation

Modern LLM agents suffer from two systematic failure modes that severely limit their reliability:

The "Alarm" Effect (Corrective Failure)

When an error occurs, agents treat error logs as generic alarms rather than actionable feedback. For example, when an API returns a format error with explicit correction instructions, agents often: - Retry the same failed approach - Fall into "error loops" repeating identical mistakes - Fabricate data to bypass validation rather than fixing the root cause

Missed Optimization (Enhancement Failure)

Even when no error occurs, agents persist with suboptimal strategies. A search agent might use a single keyword "walks" when synonyms like "base on balls" would yield better results. Static prompts lack mechanisms to learn from successful-but-inefficient trajectories.

The Core Insight

The fundamental problem is that static prompts cannot encode the dynamic knowledge required for complex, multi-step tasks. SCOPE's key insight: treat execution trajectories as learning signals to automatically evolve prompts during task execution.

Technical Innovation: The SCOPE Framework

Architecture Overview

SCOPE operates as a meta-learning layer that sits alongside the executing agent, consisting of four specialized components:

Agent Execution → Trigger Detection → Guideline Synthesis → Dual-Stream Routing → Memory Optimization → Prompt Update ↑ ↓ └────────────────────────── Next Step ←──────────────────────────────────────────┘

Component 1: Guideline Synthesis (The Generator)

The Generator analyzes execution traces to synthesize natural language rules using a "Best-of-N" strategy:

Corrective Synthesis: When errors occur, generate rules for recovery - Example: "Use 'final_answer_tool' not 'final_answer'"

Enhancement Synthesis: When steps complete successfully but suboptimally, generate optimization rules - Example: "If web search fails, try generalizing keywords or use synonyms"

Notably, 61% of synthesized rules are enhancement-based, making SCOPE a proactive optimizer rather than just a debugger.

Component 2: Dual-Stream Routing (The Memory System)

A classifier evaluates each synthesized guideline and routes it to one of two memory streams:

Tactical Memory (Task-Specific) - Stores rules valid only for the current session - Example: "Clean the 'Amount' column for this specific dataset" - Confidence threshold: < 0.85

Strategic Memory (Universal Principles) - Stores high-confidence rules that persist across future tasks - Example: "If a website is blocked, try Archive.org as fallback" - Confidence threshold: >= 0.85

This dual-stream architecture prevents prompt pollution while preserving valuable generalizable knowledge.

Component 3: Memory Optimization (The Optimizer)

To prevent prompts from exceeding context limits or becoming contradictory, the Optimizer performs:

Conflict Resolution: Merges contradictory rules into consistent guidelines
Subsumption Pruning: Removes specific rules covered by more general ones
Consolidation: Merges similar rules into comprehensive guidelines

The optimization triggers when domain-specific guidelines exceed 10 rules.

Component 4: Perspective-Driven Exploration

SCOPE initializes parallel agent streams with different "personas" to maximize strategy coverage:

Efficiency Stream: Optimizes for "fail-fast" logic and concise plans - Prioritizes speed and minimal steps - Quick hypothesis testing

Thoroughness Stream: Optimizes for exhaustive search and resilience - Uses fallback strategies (e.g., Archive.org for blocked sites) - Comprehensive exploration before concluding

The system executes both perspectives and selects the best outcome.

Step-Level Adaptation

Unlike previous methods that update prompts only after task completion, SCOPE updates prompts during execution. This enables: - Mid-task error recovery - Real-time strategy adaptation - Faster convergence to optimal behavior

Experimental Results

Benchmark Performance

SCOPE was evaluated on three challenging benchmarks, significantly outperforming static baselines and existing optimization methods:

| Benchmark | Baseline | SCOPE | Improvement | |-----------|----------|-------|-------------| | HLE (Humanity's Last Exam) | 14.23% | 38.64% | +24.4pp | | GAIA | 32.73% | 56.97% | +24.2pp | | DeepSearch | 14.00% | 32.00% | +18.0pp |

Domain-Specific Impact

The improvements were most dramatic in knowledge-intensive domains requiring strict protocol adherence:

| Domain | Baseline | SCOPE | Improvement | |--------|----------|-------|-------------| | Chemistry | 14.1% | 50.3% | +36.2pp | | Biology | 14.9% | 43.2% | +28.3pp | | Physics | ~15% | ~40% | +25pp |

These domains benefit most from accumulated procedural knowledge and error recovery patterns.

Ablation Study Results

Contribution of each component to GAIA accuracy:

| Component | Contribution | |-----------|-------------| | Perspective-Driven Exploration | +10.91% (largest) | | Guideline Synthesis | +6.2% | | Dual-Stream Routing | +4.1% | | Memory Optimization | +2.8% |

Model Robustness

SCOPE demonstrates model-agnostic behavior: - GPT-4.1 as optimizer: 46.67% GAIA accuracy - Gemini-2.5-Pro as optimizer: 46.06% GAIA accuracy - Gemini generated 46% more guidelines, but final performance was nearly identical

LLM Configuration Used

Functional Agents: - Planning Agent: GPT-4.1 - Browser Agent: GPT-4.1 - Web Search Agent: Gemini-2.5-Pro - Analyzer Agent: Gemini-2.5-Pro

SCOPE Meta-Agents: - Generator, Selector, Classifier, Optimizer: GPT-4.1

Broader Implications

Language Adoption Evidence

The authors demonstrated that agents actively adopt the phrasing of synthesized guidelines in subsequent outputs. This proves that evolved prompts directly influence decision-making rather than being ignored.

Per-Agent Specialization

In multi-agent systems, SCOPE evolves unique prompts for each role (Browser vs. Planner) rather than using a shared library. This prevents conflicting instructions between specialized agents.

Analogy for Understanding

Consider SCOPE as a mentor for a new employee (the Agent) given a generic training manual (Static Prompt):

Guideline Synthesis: When the employee struggles, the mentor provides sticky notes: "Next time, click the blue button, not the red one"
Dual-Stream Routing: Task-specific notes go on the monitor (Tactical); career wisdom goes in the permanent notebook (Strategic)
Perspective-Driven: The mentor hires two employees - one fast, one thorough - and keeps whichever produces better work

Limitations and Future Directions

Current Limitations

Computational Overhead: Running parallel exploration streams doubles agent execution costs

Meta-Agent Dependencies: Quality of synthesized guidelines depends on meta-agent capabilities

Cold Start Problem: Strategic memory is initially empty, requiring bootstrap period for new domains

Future Research Opportunities

Hierarchical Memory: Multi-level abstraction for guidelines (tactical → strategic → meta-strategic)

Cross-Task Transfer: Sharing strategic memory across related task families

Efficient Exploration: Dynamic allocation of exploration budget based on task difficulty

Smaller Optimizers: Distilling optimization capabilities into efficient specialized models

Practical Applications

Autonomous Research Agents

Literature review with evolving search strategies
Experimental protocol optimization through failure analysis
Cross-domain knowledge synthesis

Code Generation Agents

Debugging pattern accumulation
API usage best practices evolution
Test case generation improvement

Enterprise Automation

Customer service escalation learning
Document processing refinement
Workflow optimization through execution analysis

Conclusion

SCOPE fundamentally transforms how we think about agent prompts - from static instructions to dynamic, evolving parameters. By introducing step-level adaptation, dual-memory architecture, and perspective-driven exploration, SCOPE achieves dramatic improvements on challenging benchmarks while maintaining model-agnostic design.

The framework's open-source release as a lightweight Python package makes it immediately applicable to existing agent systems. For practitioners building autonomous agents, SCOPE offers a proven methodology for dramatically improving reliability without architectural overhaul.

Key Takeaway: The future of AI agents lies not in bigger models with static prompts, but in adaptive systems that learn from every execution step. SCOPE provides the blueprint.

Recommendation: Strong candidate for integration into production agent systems where reliability is paramount. The 24+ percentage point improvements on benchmark tasks demonstrate that prompt evolution is not merely incremental optimization but a fundamental capability enhancement.

Created 2026-01-04T21:36:31-08:00