Cognitive Architectures for Language Agents

Cognitive Architectures for Language Agents: A Framework for Building Intelligent Language Models. Large language models (LLMs) have achieved impressive results on many natural language tasks. However, to build truly intelligent agents, we need to equip LLMs with additional capabilities like memory, reasoning, learning, and interacting with the environment. A new paper titled "Cognitive Architectures for Language Agents" proposes a framework called CoALA to guide the development of such language agents.

Key Ideas

CoALA draws inspiration from cognitive architectures and production systems in classical AI. It outlines three main components for language agents:

  1. Memory - Agents should have both short-term working memory to track current state, and long-term memory to store knowledge (semantic memory), past experiences (episodic memory), and skills (procedural memory).

  2. Action Space - Agents can perform both internal actions (memory retrieval, reasoning, learning) and external actions (interacting with the physical world, digital interfaces, or humans).

  3. Decision Making - A central control loop allows the agent to use reasoning and memory to plan, then select and execute an action. This loops continuously as the agent operates.

By structuring agents in this modular way, CoALA provides a blueprint to move beyond using LLMs for simple reasoning to building sophisticated cognitive agents. Key research directions include:

Language Models as Probabilistic Production Systems

Language models and production systems both operate on strings, making them naturally analogous. Production systems define rules to rewrite strings, while language models define probability distributions over possible string completions.

More formally, we can view the task of text completion as a production. Given a prompt X and a completion Y, we have the production rule:

X → X Y

Language models assign probabilities to possible Y completions, which can be interpreted as defining a distribution over productions P(Y|X). Each time the language model is called, it samples from this distribution to generate a completion:

X ∼∼▸ X Y

While traditional production systems use deterministic rules, language models offer a probabilistic formulation. This allows them to capture the inherent uncertainty in language, but also makes them more opaque and harder to control than production systems with handcrafted symbolic rules. However, by pretraining on large datasets, language models learn a very effective prior over likely string completions, allowing them to perform well on many tasks out of the box.

Prompt Engineering as Control Flow

The prompts given to a language model are analogous to the control flow of a production system. Different prompt formulations prioritize different completions, guiding the model to perform different tasks.

Early work on prompt engineering manipulated the prompt text to improve language model outputs for a given task. These techniques can be mathematically described as a series of productions, transforming the initial query Q:

Zero-shot: Q ∼∼▸ Q A

The simplest prompting method, zero-shot, directly feeds the query Q into the language model to generate an answer A.

Few-shot: Q → Q1 A1 Q2 A2 Q ∼∼▸ Q1 A1 Q2 A2 Q A

Few-shot prompting augments the query with example input-output pairs (Q1, A1) and (Q2, A2) before generating an answer, leveraging the model's ability to learn from examples.

Retrieval-augmented: Q → Q O ∼∼▸ Q O A

Retrieval-augmented prompts first retrieve relevant information O (e.g., from a knowledge base) based on the query, then feed both the query and retrieved context into the model to generate an answer.

Later work introduced more sophisticated prompting methods that use the language model itself to perform multi-step reasoning:

Socratic: Q ∼∼▸ Q O ∼∼▸ Q O A

Socratic prompting first uses the language model to generate relevant information O given the query, then feeds both the query and generated context back into the model to produce an answer.

Self-critique: Q ∼∼▸ Q A ∼∼▸ Q A C ∼∼▸ Q A C A

Self-critique prompting generates an initial answer A, then uses the model to critique that answer (producing C), and finally generates a revised answer based on the query, initial answer, and critique.

Selection-inference: Q → Q1 A1 Q2 A2 ... QN AN ∼∼▸ Q O ∼∼▸ Q O A

Selection-inference prompting generates a large set of candidate question-answer pairs, selects the most relevant ones to form a context O, then generates a final answer based on the query and selected context.

By recursively calling the language model on its own outputs, these prompting methods enable complex, multi-step reasoning processes to be implemented. This evolution of prompting techniques mirrors the development of production systems, which began as simple string rewriting rules but grew to incorporate long inference chains and control flow guided by the system's own outputs.

As we'll see next, the similarity between language models and production systems runs deeper than their surface-level parallels - it also motivates the use of cognitive architectures to structure language model-based agents.

CoALA: A Framework for Language Agents

Cognitive Architectures for Language Agents (CoALA) provides a conceptual framework to describe and build language agents. It positions the language model as the core component of a larger cognitive architecture, surrounded by memory modules, an action space, and a decision-making procedure.

Memory Modules

CoALA agents have both short-term working memory to track current state, and long-term memory for storing knowledge (semantic memory), past experiences (episodic memory), and skills (procedural memory).

Working memory maintains information for the current decision cycle, such as perceptual inputs, active knowledge from reasoning or memory retrieval, and the agent's current goals. It acts as a central hub, interfacing between the language model, long-term memory, and the environment.

Episodic memory stores the agent's past experiences, which can be retrieved to inform reasoning and decision making. Semantic memory stores general knowledge about the world, which can be retrieved to augment the agent's context. Procedural memory stores the agent's skills, either implicitly in the language model weights or explicitly as code libraries.

Action Space

CoALA divides the agent's action space into internal cognitive actions (memory retrieval, reasoning, learning) and external environment interactions (physical, dialogue, digital).

Retrieval actions read information from long-term memory into working memory. Reasoning actions use the language model to process working memory and generate new information. Learning actions write experiences, knowledge, or skills to long-term memory.

Grounding actions interface with the external world. This could involve controlling a robot, conversing with a human, interacting with a website API, or even calling a language model that interfaces with APIs (e.g. a search engine).

Decision Making

The agent's decision-making procedure runs in a continual loop, alternating between a planning stage and an execution stage.

During planning, the agent uses reasoning and memory retrieval to propose action candidates, evaluate their expected utility, and select the best one. This enables the agent to perform multi-step lookahead, simulating the consequences of actions before committing to one.

The selected action is then executed, either performing an external grounding action or an internal learning action. The environment returns an observation, and the decision cycle begins again.

Case Studies

To make these concepts more concrete, let's see how some prominent language agents map onto the CoALA framework:

By expressing these diverse agents in common terms, CoALA provides a unified perspective on their underlying structures and a roadmap for future development.

Open Questions

The paper also highlights some open conceptual questions raised by the CoALA framework, such as:

Conclusion

CoALA provides a thought-provoking framework to structure research on language agents. By considering key components like memory, actions, and decision making, it can guide the development of language models into intelligent interactive agents to solve real-world tasks. While many open questions remain, CoALA is a promising step towards unifying the latest LLM advances with long-standing ideas from cognitive science and AI.

References

Related

Created 2024-04-01T21:54:45-07:00, updated 2024-04-04T22:10:21-07:00 · History · Edit