iVISPAR

About iVISPAR

iVISPAR is an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. With a robust system for automated experiments and evaluation, iVISPAR allows large-scale benchmarking of VLMs in complex spatial tasks. The platform ranks state-of-the-art models, highlighting their strengths and weaknesses while offering research insights into their spatial reasoning abilities and visual alignment challenges.

Environments

iVISPAR provides a flexible framework that supports multiple environments, enabling diverse and scalable testing setups. The Geom Board Puzzle is a variant of the classic sliding tile puzzle problem. It challenges VLMs with a broad range of spatial reasoning skills, including logical planning, spatial awareness, orientation handling, and multi-step problem-solving. Read more about our environments here:

Leaderboard

We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task’s complexity and feasibility for humans. Results indicate that while some VLMs perform well on simple spatial tasks, they encounter difficulties with more complex configurations and problem properties. See the top performers here:

Research

Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. Notably, while VLMs generally perform better in 2D vision compared to 3D or text-based representations, they consistently fall short of human performance, illustrating the persistent challenge of visual alignment. iVISPAR highlights critical gaps in current VLM capabilities, highlighting their limitations in achieving human-level cognition. Find out more here:

News

Contributors

Farbod Nosrat Nezami

Mohammad Ballout, contributor of iVISPAR

Mohamad Ballout

Serwan Jassim

Prof. Dr. Elia Bruni @ University of Osnabrück, EMIBAS project

Elia Bruni

VLM Leaderboard

We evaluated the spatial reasoning capabilities of VLMs in our SGP environment on 3D vision and compared it to 2D vision and text-based modalities across 300 episodes each. To standardize gameplay, the number of actions per episode was capped at 20. Our selection of open- and closed-source VLMs is based on models that scored high on OpenCompass3 Official Rankings and which support multi-image inputs and a minimum context length of 800 tokens.
VLMs’ success rates of completed games over 900 episodes across all modalities: vision 3D, vision 2D, and text.
Model Date Completed Episodes Step-Deviation from Optimal Path
All 3D 2D Text All 3D 2D Text
Sonnet-3.5Claude Team. Introducing the next generation of Claude.
anthropic.com/news/claude-3-family, 2024.
Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025).
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs.
arXiv:2502.03214
54.5628.6789.6745.33 3.054.101.443.60
Gemini-2.0-flashGemini Team. Gemini 2.0 flash (experimental), 2024. Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025).
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs.
arXiv:2502.03214
27.1112.6747.3321.33 4.875.254.095.26
GPT-4oOpenAI. GPT-4o, 2024. Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025).
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs.
arXiv:2502.03214
17.569.3337.336.00 5.305.454.156.30
InternVL2.5-78BChen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.
arXiv:2412.05271, 2024.
Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025).
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs.
arXiv:2502.03214
10.161.679.4219.33 5.986.395.865.69
LLaVA-OneVision-72BLi, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z., and Li, C.
LLaVA-OneVision: Easy visual task transfer.
arXiv:2408.03326, 2024.
Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025).
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs.
arXiv:2502.03214
8.220.671.3322.67 6.356.756.815.50
Qwen2-72BWang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J.
Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.
arXiv:2409.12191, 2024.
Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025).
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs.
arXiv:2502.03214
5.890.671.6715.33 6.376.666.545.90

Spatial Reasoning in VLMs

coming soon ..

Common Errors

We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task’s complexity and feasibility for humans. Results indicate that while some VLMs perform well on simple spatial tasks, they encounter difficulties with more complex configurations and problem properties. See the top performers here:

Complexity Scaling

coming soon ..

Environments

iVISPAR is an interactive and multi-modal puzzle simulator that provides agents with an image or text representation of the board state. By rendering in 3D space, iVISPAR offers a more realistic depiction of spatial scenes compared to traditional 2D grid puzzles. Additionally, it supports a 2D top-down view and a text-based representation. iVISPAR provides a flexible framework that supports multiple environments, enabling diverse and scalable testing setups.

Agents interact with the board by issuing natural language commands through a text-based API to apply actions to the board. The objective is to rearrange pieces on the board to match a goal configuration. iVISPAR allows for a fine scaling of complexity, customizable random puzzle generation, and benchmarking performance with multiple baseline models.

Geom Board Puzzle

The Sliding Geom Puzzle (SGP) replaces traditional numbered tiles with geometric objects (geoms) that are uniquely defined by color and shape attributes, increasing visual-spatial complexity and enhancing task scalability. This design shift requires models to interpret object features rather than follow simple numerical sequences, mirroring real-world spatial reasoning, where objects are distinguished by appearance, size, or structure. This approach aligns with physical tasks such as organizing items, assembling structures, or packing, promoting a more authentic evaluation of real-world spatial capabilities.


Game dynamics

In each episode, agents receive observations of the start and goal states, accompanied by task instructions. Agents apply move actions to geoms by referencing their unique color and shape combination and specifying the direction of intended movement. Geoms can be moved in cardinal directions (\( \text{LEFT, RIGHT, UP, DOWN} \)), with actions formatted as “move <color> <shape> <direction>”:



move blue sphere right

 

Actions are validated and applied if legal, with agents receiving updated board states regardless of the action’s success after each move command. Effective and ineffective actions both result in valid new board states but, respectively, decrease or increase the path length to the goal state.Invalid moves, such as occupied destination and out-of-bounds actions, fail to alter the board state, as do illegal commands, which violate the instructed action format.This action-perception loop repeats until the goal state is achieved or a step limit is reached. Due to limited context windows, VLM agents receive task instructions at each time step.

Sliding Tile Puzzle (15-Tile Puzzle)


The sequential generalized sliding-tile puzzle (SGSTP) is a generalization of the classic 15-Tile Sliding Tile Puzzle. In the SGSTP, a set of \( n < m_1 \times m_2 \) tiles, each uniquely labeled \( 1, \dots, n \), are placed on a rectangular grid of size \( m_1 \times m_2 \), denoted by \( G = (V, E) \). The grid has \( m_1 \times m_2 – n \) empty positions that allow tile movement. 

A configuration of tiles is represented as an injective mapping from the set \( \{1, \dots, n\} \) to positions \( V = \{(v_x, v_y) : 1 \leq v_x \leq m_2, 1 \leq v_y \leq m_1 \} \). Each tile must be repositioned from an arbitrary initial configuration \( S = \{s_1, \dots, s_n\} \) to a specified goal configuration \( G = \{g_1, \dots, g_n\} \), such as an ordered row-major layout.

 

Let the movement path of tile \( i \), where \( 1 \leq i \leq n \), be expressed as \( p_i : \mathbb{N}_0 \to V \). The puzzle seeks a set of feasible paths \( P = \{p_1, \dots, p_n\} \) that satisfy the following conditions for all \( 1 \leq i, j \leq n \) with \( i \neq j \), and for all time steps \( t \geq 0 \):

 

Incremental Movement:
\( p_i(t+1) = p_i(t) \text{ or } (p_i(t+1), p_i(t)) \in E \)
Tiles move to adjacent, unoccupied positions or stay still.

 

Goal Achievement:
\( p_i(0) = s_i \text{ and } p_i(T) = g_i \text{ for some } T \geq 0 \)
Each tile must start at \( s_i \) and reach \( g_i \).

 

Exclusive Occupancy:
\( p_i(t) \neq p_j(t) \text{ for all } i \neq j \)
Two tiles cannot occupy the same position at the same time.

 

In this sequential version, tiles move one at a time. Therefore, the head-on collision and corner-following constraints found in the generalized sliding-tile puzzle are omitted, as simultaneous tile movements are not permitted.

You are a highly intelligent AI with exceptional spatial reasoning, tasked with solving a shape puzzle game on a 4 by 4 grid board.

Game Overview

The game consists of a grid board with two states: a current active state and a goal state.

Your objective is to generate valid actions to move objects on the board, step by step, along the shortest path until the current state matches the goal state:

  1. Analyze the current state
  2. Compare with the goal state
  3. Check past actions
  4. Generate a new valid action

Key Rules

Object Movement

  • Each object occupies exactly one tile on the board.
  • Objects cannot move beyond the 4 by 4 grid boundaries or occupy the same tile as another object.

Action Format

Your actions must follow this exact format:

move <object color> <object shape> <direction>

Replace <object color>, <object shape>, and <direction> with appropriate values from the lists below. Do not use quotation marks or angle brackets.

Valid Options

  • Object Colors: green, red, blue, yellow
  • Object Shapes: cube, sphere, pyramid, cylinder
  • Directions: up, down, left, right

Example Actions

  • move green cube down
  • move blue sphere up
  • move red pyramid left
  • move yellow cylinder right

Important Notes

  • No Coordinates: Only specify color, shape, and direction — no grid positions.
  • Valid Format Required: Every action must follow the exact format and rules.
  • Invalid Actions: Actions that do not change the state (e.g., blocked or out of bounds) are invalid.

Explain Your Reasoning

  • Before each action, explain your reasoning clearly.
  • End every response with this exact line:
    action: <your action>
    Replace <your action> with a valid move.

Analyze the Images

  • View your current active board state in the image: {text snippet active}
  • Match the goal state from: {text snippet goal}

Additionally Provided

  • The previous state image(s): {text snippet past}
  • Your previously suggested action
  • Use these to understand why an action failed and adjust your next one accordingly.

Invalid Actions

  • No Overlap: Objects cannot occupy the same tile.
  • If an action doesn’t move any object, it is invalid — try a different one.

Final Format Required

Always end your response with:

action: move <object color> <object shape> <direction>

Additionally, always end with:

description: <your object coordinate list>

Do not add any characters after the words action: or description:.

Development

Coming soon

Frequently Asked Questions

FAQ

Where are the questions and answers?

Frequently asked questions and answers will be added in the near future.