GeomGym

About GeomGym

In this project, we simulate a physical, dynamic environment, and place agents in it with a two-step task: i.e.,

(1) to find the target amongst several objects with various properties through visual and tactile information, and

(2) to bring this target object to a designated target area.

To solve this task, the agents must acquire several skills relevant to exploring and interacting with their environment.

As the agents are not explicitly told which skills they need to acquire and how to acquire them, we are investigating their learning process and how it can be optimized using different reward systems.

Reinforcement Learning

  • RL is a type of machine learning where the agents learn to make decisions by interacting with an environment
  • The agents learn by trial and error, getting rewards for doing well and penalties for mistakes
  • The agents improve their behavior to maximize rewards and achieve their goal


How Does it Work?

  • The agents observe their states
  • They follow a policy, which is a set of rules that tells them what actions to take
  • The agents perform actions
  • The environment changes based on that action, and the agents receive rewards
  • This loop continues until the agents get really good at making decisions

Proximal Policy Optimization (PPO)

  • PPO is our training algorithm, that helps improve the agents decision-making process
  • PPO balances between exploring new actions and sticking with learned strategies
  • PPO limits how much the agent’s policy can change at once, making learning more stable.

What's Next?

Our agents are currently being trained on high-performance computing systems to tackle these challenges.

Next, we will expand the experiment:

  • Increase the number of agents, which drastically raises the complexity of the task
  • Add more parameters and variables (such as ground texture, object shapes, movement methods, environment shape), making the task more interesting and harder to solve
  • Use this set-up to look into interesting emergent behaviors like language

Introduction

This study explores language’s transformative role in a 3D simulation, where reinforcement learning agents communicate to match target objects. By incorporating embodiment, it extends traditional signaling games, revealing efficient collaboration mechanisms between agents. The findings underscore the robust and efficient emergence of language within the experimental framework.

Objective

  • Extend the conventional signaling game with embodiment, paving the way for future studies on embodied language emergence.
  • Explore partially emergent sender-receiver communication in a collaborative experiment for the efficient completion of a rewarding matching task separated by an obstacle.
  • Provide a proof of concept by sharing our experimental results on grounded language rooted in the physical properties of the simulated multi-agent environment

Figure 1: Top-down view on a level of the environment (rendered with the Unreal Engine 5)

Figure 2: Systematic visualization of the final setup after the successful training

Methodology

Environment

This study explores language’s transformative role in a 3D simulation, where reinforcement learning agents communicate to match target objects. By incorporating embodiment, it extends traditional signaling games [1], revealing efficient collaboration mechanisms between agents. The findings underscore the robust and efficient emergence of language within the experimental framework.

Image: Agents receive 30 floating values as observations, encoding camera data into meaningful latent space.

Language: The receiver gets language inputs from a predefined algorithm, but in deployment, it comes from the sender.

Done Function: The training episodes terminate when either the Agent collides with a border wall, a potential target or flip on their heads.

Reward Function:

The receiver gets positive reward for approaching the correct target, scaled by a constant coefficient 𝑘.

\( R_{\text{receiver}} = k \times (d_t - d_{t+1}) \)

 

The sender is trained with a mean-squared error reward function, using its actions before one-hot-encoding as predictions and the correct utterance as labels, allowing it the choice of not uttering anything without consequences.

\[ R_{\text{Utterance}} = -\frac{1}{N} \sum_{i=-1}^{N} (p_i - u_i)^2 \times \text{IsMeaningful} \]

An additional reward is used to encourage exploring the entire level, based on the trajectory up to the current timestamp 𝑡, considering the last 20 timesteps.

\[ R_{\text{Exploration}}(t) = \sum_{i=1}^{t} \left( \sqrt{(x_t - x_{t-i})^2 + (y_t - y_{t-i})^2} \times d^i \right) \]

Agent

  • We use our own PPO implementation, which is highly optimized for MuJoCo Tasks.
  • Non-recursive communication enables direct information exchange between agents and reduces complexity.
  • A vanilla autoencoder preprocesses visual data from the Agents for reinforcement learning.
  • The autoencoder compresses images into a lower-dimensional latent space, aiming to minimize reconstruction error and learn a meaningful representation.

Analysis

Average Reward: whether the agent learns to move to the target object

Average Length: decreases over time as the receiver learns to move towards target objects.

Accuracy: the share of correct picks by the receiver concerning the reference object; effectiveness of training.

Variance: how often the receiver touches different target objects; the agent’s exploration behavior and decision-making.

Language Analysis: Task performance, Duration, Symbol Perplexity, Symbol consistency, and Symbol Redundancy

Experiment Results

The accuracy of the receiver:

  • an accuracy rate exceeding 80 percent.
  • The receiver exhibited a very good level of proficiency, accurately selecting the appropriate word in more than 90 percent of instances.

The accuracy of the sender:

  • Very good level of proficiency, accurately selecting the appropriate word in more than 90 percent of instances.
  • There was no observed symbol redundancy, as the language structure was predefined.

The reward of the sender:

  • The sender agent acquired a low language perplexity, aligning with our predictions and affirming the successful integration of the given language structure.

Conclusion and Outlook

This research advances the traditional reference game by incorporating embodiment, paving the way for the emerging field of embodied language.

Despite using fixed agent vocabularies, the study lays the foundation for a fully emergent setup. Moving forward, one of the most important milestones will be the switch to a fully emergent language setup. Further points to be adressed are longer sequences, recursion and complex object attributes. Environmental factors, including shared world space and bidirectional setups, have been additionally identified as crucial long-term goals in extending this research.

Environment Creation

We created a dynamic environment using the MuJoCo engine, known for its realistic physics simulations. MuJoCo accurately and efficiently models complex interactions, like collision and movement, making it perfect for RL tasks. To set up the environment, we used our Blueprints library, which translates Python code into XML to design a structured space with walls, agents, distractor objects, target items, and sensors.

By randomizing the objects’ properties and changing conditions, the environment stays unpredictable and challenging, which helps prevent the agents from learning just one specific solution (also known as overfitting) and encourages more adaptable strategies.

The Agents and their Task

Our agents experience their environment through a laser-based sensor system that gives them complete information of their surroundings. But just like we humans when growing up, they have to learn to interpret this information to get an understanding of their surroundings.

To identify the target object, the agents have to learn how it looks and how it reacts when they push or grab it.

Finally, the agents have to learn the task itself. They are rewarded when they place the target object in the target area, but we don’t explicitly tell them what to do. They have to figure out what to do on their own and need to adjust their behavior according to the rewards and penalties they receive. These are only some examples of skills that the agents must learn, and there are many more necessary to solve the task at hand.

Development

Coming soon

Frequently Asked Questions

FAQ

Where are the questions and answers?

Frequently asked questions and answers will be added in the near future.