Voyager: An Open-Ended Embodied Agent with Large Language Models

Introduction

Voyager is the first LLM (Large Language Models) powered embodied lifelong learning agent that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. The agent is designed to operate in the Minecraft environment, a popular open-ended game that offers a rich set of tasks and interactions.

Key Components

Voyager consists of three key components:

Automatic Curriculum: The automatic curriculum is designed to maximize exploration, enabling the agent to discover new tasks and environments.
Ever-growing Skill Library: The skill library stores and retrieves complex behaviors as executable code. This allows the agent to learn and remember a wide range of skills over time.
Iterative Prompting Mechanism: The iterative prompting mechanism incorporates environment feedback, execution errors, and self-verification for program improvement. This enables the agent to learn from its mistakes and improve its performance over time.

Interaction with GPT-4

Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting.

Exploration in Voyager

In the Voyager paper, the agent's exploration is primarily driven by the automatic curriculum component. The automatic curriculum is designed to encourage the agent to explore new tasks and environments in a self-directed manner, without relying on predefined learning objectives or human guidance.

The automatic curriculum in Voyager is not driven by a special prompt, but rather by a carefully designed reward function that encourages the agent to explore novel and diverse states in the Minecraft environment. The key idea is to reward the agent for visiting states that are novel (i.e., dissimilar to previously visited states) and diverse (i.e., covering a wide range of different state types). This reward structure naturally guides the agent to explore new areas, craft new items, and discover new tasks, without explicitly defining what those tasks should be.

The agent's current state in the Minecraft environment is represented by a set of features, such as the agent's position, inventory, nearby blocks, and so on. These features are encoded into a compact vector representation. The agent receives a novelty reward for visiting states that are dissimilar to previously visited states, measured by the distance in the feature space to the nearest previously visited state. The agent also receives a diversity reward for visiting states that cover a wide range of different state types, measured by the distance in the feature space to the centroid of all previously visited states. The novelty and diversity rewards are combined into a single exploration reward that the agent seeks to maximize.

As the agent explores the environment and collects these exploration rewards, it naturally encounters new tasks and challenges. For example, it might discover that it needs to craft a certain tool to mine a new type of block, or that it needs to build a shelter to avoid taking damage at night. These tasks emerge naturally from the environment dynamics and the agent's drive to explore novel and diverse states.

The optimization of the reward function in Voyager can be seen as a form of reinforcement learning (RL). The agent's objective is to maximize the exploration reward, which is a combination of the novelty and diversity rewards. This can be formulated as an RL problem, where the agent's policy is updated to maximize the expected cumulative exploration reward over time.

However, Voyager's approach to policy optimization is somewhat different from traditional RL methods. Instead of using techniques like Q-learning or policy gradients to directly update the policy parameters, Voyager leverages the iterative prompting mechanism with GPT-4 to generate and refine skills. When the agent encounters a new task or challenge, it generates a prompt describing its current state and goal, which is sent to GPT-4 to generate a candidate skill. The agent executes the skill in the Minecraft environment and provides feedback to GPT-4 for iterative refinement until the skill successfully achieves the goal and maximizes the exploration reward. The optimized skill is then stored in the agent's skill library for future use.

This combination of intrinsic exploration rewards and flexible skill learning allows Voyager to engage in open-ended lifelong learning in the Minecraft environment, continuously expanding its capabilities and knowledge over time. The use of GPT-4 allows the agent to learn complex, compositional skills that would be difficult to discover through pure RL methods, while the RL formulation provides a principled way to guide the agent's exploration and optimize its behavior over time.

Results

The paper presents empirical results that demonstrate Voyager's strong in-context lifelong learning capability and exceptional proficiency in playing Minecraft. Voyager obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior state-of-the-art techniques. Additionally, Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.

Conclusion

Voyager represents a significant step towards open-ended embodied agents that can learn and adapt to new environments and tasks over time. The agent's ability to continuously explore and acquire new skills without human intervention makes it a promising approach for developing autonomous agents that can operate in complex and dynamic environments.

References

Created 2024-04-13T22:02:03-07:00