In this project, we simulate a physical, dynamic environment, and place agents in it with a two-step task: i.e.,
(1) to find the target amongst several objects with various properties through visual and tactile information, and
(2) to bring this target object to a designated target area.
To solve this task, the agents must acquire several skills relevant to exploring and interacting with their environment.
As the agents are not explicitly told which skills they need to acquire and how to acquire them, we are investigating their learning process and how it can be optimized using different reward systems.
Our agents are currently being trained on high-performance computing systems to tackle these challenges.
Next, we will expand the experiment:
This study explores language’s transformative role in a 3D simulation, where reinforcement learning agents communicate to match target objects. By incorporating embodiment, it extends traditional signaling games, revealing efficient collaboration mechanisms between agents. The findings underscore the robust and efficient emergence of language within the experimental framework.
Figure 1: Top-down view on a level of the environment (rendered with the Unreal Engine 5)
Figure 2: Systematic visualization of the final setup after the successful training
This study explores language’s transformative role in a 3D simulation, where reinforcement learning agents communicate to match target objects. By incorporating embodiment, it extends traditional signaling games [1], revealing efficient collaboration mechanisms between agents. The findings underscore the robust and efficient emergence of language within the experimental framework.
Image: Agents receive 30 floating values as observations, encoding camera data into meaningful latent space.
Language: The receiver gets language inputs from a predefined algorithm, but in deployment, it comes from the sender.
Done Function: The training episodes terminate when either the Agent collides with a border wall, a potential target or flip on their heads.
Reward Function:
The receiver gets positive reward for approaching the correct target, scaled by a constant coefficient 𝑘.
\( R_{\text{receiver}} = k \times (d_t - d_{t+1}) \)
The sender is trained with a mean-squared error reward function, using its actions before one-hot-encoding as predictions and the correct utterance as labels, allowing it the choice of not uttering anything without consequences.
\[ R_{\text{Utterance}} = -\frac{1}{N} \sum_{i=-1}^{N} (p_i - u_i)^2 \times \text{IsMeaningful} \]
An additional reward is used to encourage exploring the entire level, based on the trajectory up to the current timestamp 𝑡, considering the last 20 timesteps.
\[ R_{\text{Exploration}}(t) = \sum_{i=1}^{t} \left( \sqrt{(x_t - x_{t-i})^2 + (y_t - y_{t-i})^2} \times d^i \right) \]
Average Reward: whether the agent learns to move to the target object
Average Length: decreases over time as the receiver learns to move towards target objects.
Accuracy: the share of correct picks by the receiver concerning the reference object; effectiveness of training.
Variance: how often the receiver touches different target objects; the agent’s exploration behavior and decision-making.
Language Analysis: Task performance, Duration, Symbol Perplexity, Symbol consistency, and Symbol Redundancy
The accuracy of the receiver:
The accuracy of the sender:
The reward of the sender:
This research advances the traditional reference game by incorporating embodiment, paving the way for the emerging field of embodied language.
Despite using fixed agent vocabularies, the study lays the foundation for a fully emergent setup. Moving forward, one of the most important milestones will be the switch to a fully emergent language setup. Further points to be adressed are longer sequences, recursion and complex object attributes. Environmental factors, including shared world space and bidirectional setups, have been additionally identified as crucial long-term goals in extending this research.
We created a dynamic environment using the MuJoCo engine, known for its realistic physics simulations. MuJoCo accurately and efficiently models complex interactions, like collision and movement, making it perfect for RL tasks. To set up the environment, we used our Blueprints library, which translates Python code into XML to design a structured space with walls, agents, distractor objects, target items, and sensors.
By randomizing the objects’ properties and changing conditions, the environment stays unpredictable and challenging, which helps prevent the agents from learning just one specific solution (also known as overfitting) and encourages more adaptable strategies.
Our agents experience their environment through a laser-based sensor system that gives them complete information of their surroundings. But just like we humans when growing up, they have to learn to interpret this information to get an understanding of their surroundings.
To identify the target object, the agents have to learn how it looks and how it reacts when they push or grab it.
Finally, the agents have to learn the task itself. They are rewarded when they place the target object in the target area, but we don’t explicitly tell them what to do. They have to figure out what to do on their own and need to adjust their behavior according to the rewards and penalties they receive. These are only some examples of skills that the agents must learn, and there are many more necessary to solve the task at hand.
Coming soon
Frequently asked questions and answers will be added in the near future.