Paper: SCOPE: Prompt Evolution for Enhancing Agent Effectiveness Authors: Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu Affiliations: The Chinese University of Hong Kong, Huawei Noah's Ark Lab Code: GitHub Repository
SCOPE (Self-evolving Context Optimization via Prompt Evolution) introduces a paradigm shift in how AI agents learn from experience. Rather than treating prompts as static instructions, SCOPE treats them as evolvable parameters that adapt during task execution. The framework addresses a critical "capability gap" in LLM agents: while they have access to massive execution contexts, their static prompts fail to adapt to dynamic feedback. Through a dual-memory architecture and perspective-driven exploration, SCOPE achieves remarkable performance gains - improving success rates from 14.23% to 38.64% on Humanity's Last Exam benchmark (+24.4 percentage points) and from 32.73% to 56.97% on GAIA.
Modern LLM agents suffer from two systematic failure modes that severely limit their reliability:
When an error occurs, agents treat error logs as generic alarms rather than actionable feedback. For example, when an API returns a format error with explicit correction instructions, agents often: - Retry the same failed approach - Fall into "error loops" repeating identical mistakes - Fabricate data to bypass validation rather than fixing the root cause
Even when no error occurs, agents persist with suboptimal strategies. A search agent might use a single keyword "walks" when synonyms like "base on balls" would yield better results. Static prompts lack mechanisms to learn from successful-but-inefficient trajectories.
The fundamental problem is that static prompts cannot encode the dynamic knowledge required for complex, multi-step tasks. SCOPE's key insight: treat execution trajectories as learning signals to automatically evolve prompts during task execution.
SCOPE operates as a meta-learning layer that sits alongside the executing agent, consisting of four specialized components:
Agent Execution → Trigger Detection → Guideline Synthesis → Dual-Stream Routing → Memory Optimization → Prompt Update
↑ ↓
└────────────────────────── Next Step ←──────────────────────────────────────────┘
The Generator analyzes execution traces to synthesize natural language rules using a "Best-of-N" strategy:
Corrective Synthesis: When errors occur, generate rules for recovery - Example: "Use 'final_answer_tool' not 'final_answer'"
Enhancement Synthesis: When steps complete successfully but suboptimally, generate optimization rules - Example: "If web search fails, try generalizing keywords or use synonyms"
Notably, 61% of synthesized rules are enhancement-based, making SCOPE a proactive optimizer rather than just a debugger.
A classifier evaluates each synthesized guideline and routes it to one of two memory streams:
Tactical Memory (Task-Specific) - Stores rules valid only for the current session - Example: "Clean the 'Amount' column for this specific dataset" - Confidence threshold: < 0.85
Strategic Memory (Universal Principles) - Stores high-confidence rules that persist across future tasks - Example: "If a website is blocked, try Archive.org as fallback" - Confidence threshold: >= 0.85
This dual-stream architecture prevents prompt pollution while preserving valuable generalizable knowledge.
To prevent prompts from exceeding context limits or becoming contradictory, the Optimizer performs:
The optimization triggers when domain-specific guidelines exceed 10 rules.
SCOPE initializes parallel agent streams with different "personas" to maximize strategy coverage:
Efficiency Stream: Optimizes for "fail-fast" logic and concise plans - Prioritizes speed and minimal steps - Quick hypothesis testing
Thoroughness Stream: Optimizes for exhaustive search and resilience - Uses fallback strategies (e.g., Archive.org for blocked sites) - Comprehensive exploration before concluding
The system executes both perspectives and selects the best outcome.
Unlike previous methods that update prompts only after task completion, SCOPE updates prompts during execution. This enables: - Mid-task error recovery - Real-time strategy adaptation - Faster convergence to optimal behavior
SCOPE was evaluated on three challenging benchmarks, significantly outperforming static baselines and existing optimization methods:
| Benchmark | Baseline | SCOPE | Improvement | |-----------|----------|-------|-------------| | HLE (Humanity's Last Exam) | 14.23% | 38.64% | +24.4pp | | GAIA | 32.73% | 56.97% | +24.2pp | | DeepSearch | 14.00% | 32.00% | +18.0pp |
The improvements were most dramatic in knowledge-intensive domains requiring strict protocol adherence:
| Domain | Baseline | SCOPE | Improvement | |--------|----------|-------|-------------| | Chemistry | 14.1% | 50.3% | +36.2pp | | Biology | 14.9% | 43.2% | +28.3pp | | Physics | ~15% | ~40% | +25pp |
These domains benefit most from accumulated procedural knowledge and error recovery patterns.
Contribution of each component to GAIA accuracy:
| Component | Contribution | |-----------|-------------| | Perspective-Driven Exploration | +10.91% (largest) | | Guideline Synthesis | +6.2% | | Dual-Stream Routing | +4.1% | | Memory Optimization | +2.8% |
SCOPE demonstrates model-agnostic behavior: - GPT-4.1 as optimizer: 46.67% GAIA accuracy - Gemini-2.5-Pro as optimizer: 46.06% GAIA accuracy - Gemini generated 46% more guidelines, but final performance was nearly identical
Functional Agents: - Planning Agent: GPT-4.1 - Browser Agent: GPT-4.1 - Web Search Agent: Gemini-2.5-Pro - Analyzer Agent: Gemini-2.5-Pro
SCOPE Meta-Agents: - Generator, Selector, Classifier, Optimizer: GPT-4.1
The authors demonstrated that agents actively adopt the phrasing of synthesized guidelines in subsequent outputs. This proves that evolved prompts directly influence decision-making rather than being ignored.
In multi-agent systems, SCOPE evolves unique prompts for each role (Browser vs. Planner) rather than using a shared library. This prevents conflicting instructions between specialized agents.
Consider SCOPE as a mentor for a new employee (the Agent) given a generic training manual (Static Prompt):
Computational Overhead: Running parallel exploration streams doubles agent execution costs
Meta-Agent Dependencies: Quality of synthesized guidelines depends on meta-agent capabilities
Cold Start Problem: Strategic memory is initially empty, requiring bootstrap period for new domains
Hierarchical Memory: Multi-level abstraction for guidelines (tactical → strategic → meta-strategic)
Cross-Task Transfer: Sharing strategic memory across related task families
Efficient Exploration: Dynamic allocation of exploration budget based on task difficulty
Smaller Optimizers: Distilling optimization capabilities into efficient specialized models
SCOPE fundamentally transforms how we think about agent prompts - from static instructions to dynamic, evolving parameters. By introducing step-level adaptation, dual-memory architecture, and perspective-driven exploration, SCOPE achieves dramatic improvements on challenging benchmarks while maintaining model-agnostic design.
The framework's open-source release as a lightweight Python package makes it immediately applicable to existing agent systems. For practitioners building autonomous agents, SCOPE offers a proven methodology for dramatically improving reliability without architectural overhaul.
Key Takeaway: The future of AI agents lies not in bigger models with static prompts, but in adaptive systems that learn from every execution step. SCOPE provides the blueprint.
Recommendation: Strong candidate for integration into production agent systems where reliability is paramount. The 24+ percentage point improvements on benchmark tasks demonstrate that prompt evolution is not merely incremental optimization but a fundamental capability enhancement.
Created 2026-01-04T21:36:31-08:00