EnvGen: Generating and Adapting Environments via LLMs
for Training Embodied Agents

University of North Carolina at Chapel Hill
*: equal contribution

COLM 2024
Teaser
Comparison of different methods for creating embodied agents. Previous works commonly use (a) small RL agents or (b) LLM agents to explore skills. In (c) EnvGen, we train a small RL agent with diverse LLM-generated environments that train different skills in parallel and can be adapted via feedback to help the agents progressively improve skills that they are weaker at. Our method benefits from using the world knowledge from LLMs while maintaining efficient training through a lightweight RL agent.

Abstract

Recent state-of-the-art approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger performance than previous smaller agents based on reinforcement learning (RL); however, frequently calling LLMs is slow and expensive. This begs an interesting question: Instead of directly employing LLMs as embodied agents, can we use LLMs’ reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at?

In this work, we propose EnvGen, a novel framework to address this question. First, we prompt an LLM to generate training environments that allow agents to quickly learn different tasks in parallel. Concretely, the LLM is given the task description and environment simulator objectives that the agents should learn and is then asked to generate a set of environment configurations (e.g., different terrains, items initially given to agents, chance of finding certain objects, etc.). Next, we train a small RL agent in a mixture of the original and LLM-generated environments. Then, we enable the LLM to continuously adapt the generated environments to progressively improve the skills that the agent is weak at, by providing feedback to the LLM in the form of the agent's performance.

We demonstrate the usefulness of EnvGen with comprehensive experiments in Crafter and Heist game environments. We find that a small RL agent trained with EnvGen can outperform SOTA methods, including a GPT-4 agent, and learns longhorizon tasks significantly faster. We also show qualitatively how the LLM adapts training environments to help improve RL agents' weaker skills over time. We also show that using an LLM to adapt environments dynamically outperforms curriculum learning approaches and how the LLM adapts training environments to help improve RL agents' weaker skills over time. Additionally, EnvGen is substantially more efficient as it only uses a small number of LLM calls (e.g., 4 in total), whereas LLM agents require one or more LLM calls per step (resulting in thousands of LLM calls per episode). Lastly, we present detailed ablation studies for EnvGen's design choices.

Method

method
In EnvGen, we generate multiple environments with an LLM to let the agent learn different skills effectively, with the NCycle training cycles, each consisting of the following four steps.

Step 1: We provide an LLM with a prompt composed of four components (i.e., task description, environment details, output template, and feedback from the previous cycle), and ask the LLM to fill the template and output various environment configurations that can be used to train agents on different skills.
Step 2: We train the RL agent in multiple LLM-generated environments (i.e., LLM environments), so that it can learn different useful skills in parallel.
Step 3: We first train the RL agent in the original environment to mitigate overfitting to the LLM environments. Then we measure the current RL agent’s performance in different tasks in the original environment to check which skills/tasks the agent is still weak at.
Step 4: We provide the LLM with the agent performance from the original environment (measured in step 3) as feedback for adapting the LLM environments in the next cycle to focus on the weaker performing skills.

Experiments

Below, we demonstrate the usefulness of the EnvGen method with comprehensive experiments and analysis. Please check the paper to see additional analysis and ablation studies on EnvGen design choices.

Comparison with State-of-the-Art Methods on Crafter Environment

table1
Comparison of different agents in the Crafter (Hafner, 2022) environment. Following previous works, we report the geometric mean of success rates across its 22 achievements and rewards for 1M Crafter steps. We experiment with EnvGen on two models, PPO and Achievement Distillation. *: scores from the Crafter Scoreboard and Moon et al. (2023). †: average number of LLM calls to run a single episode, according to SPRING (Wu et al., 2023). PT: Pretraining; AD: Achievement Distillation; ±: one standard deviation.

Comparison with Curriculum Learning Approaches

table1
Comparison of RL agents trained in Crafter (Hafner, 2022) using no curriculum, an easy-to-hard curriculum, an adversarial curriculum, and our adaptive+dynamic environments. Agents are trained for 0.96M steps using the curriculum and then 1M in the default Crafter environment.



Detailed Achievement Analysis on Crafter Environment

ppo_0.8
Success rates for all the Crafter achievements of two PPO agents – (1) Baseline: trained in Crafter for 1.96M steps, and (2) Ours: trained for 0.96M steps in CrafterEnvGen and for 1M steps in Crafter. Notably, training in CrafterEnvGen significantly improves the scores of long-horizon achievements (with many prerequisites) such as 'make stone pickaxe', 'make iron pickaxe', and 'make iron sword'.



unlocked_achievement_plots
Unlock times (the first moment when the agent completed an achievement) for three long-horizon achievements ('make stone pickaxe', 'make iron pickaxe', and 'make iron sword') of two PPO agents – (1) Baseline: trained in Crafter for 1.96M steps, and (2) Ours: trained for 0.96M steps in CrafterEnvGenand for 1M steps in Crafter. The plot shows the last 1M training steps out of 1.96M steps. Our agent that was trained in CrafterEnvGen environments unlocks the achievements much quicker than the baseline agent that was only trained in the Crafter environment.



Adaptation of Training Environments Helps the Agent Improve Weaker Skills

cycle_progress
Adaptation of training environments based on agent performance over EnvGen cycles. At the end of each cycle, the RL agent's performance is given to the LLM as feedback (e.g., 'Collect coal is 2%'). The LLM uses the feedback to adaptively generate new environments that can help the agent progressively tackle skills it was previously weak at. As the training proceeds, our RL agent trained with EnvGen shows more rapid improvements than the baseline agent trained only in Crafter, by adaptively focusing the learning on previously weaker skills (i.e., 'collect coal' and 'make stone pickaxe').



Evaluation on Heist Environment

table2
We also evaluate the effectiveness of EnvGen framework with another game environment – Heist, a maze navigation game. Training an agent with HeistEnvGen environments is effective in improving performance by increasing average scores (25.9% → 37.7%) and rewards (4.1% → 5.5%), while also stabilizing training by reducing the score variance (i.e., standard deviation goes down 13.2% → 7.5%).

BibTeX

@inproceedings{Zala2024EnvGen,
      author    = {Abhay Zala* and Jaemin Cho* and Han Lin and Jaehong Yoon and Mohit Bansal},
      title     = {EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents},
      year      = {2024},
      booktitle = {COLM},
}