iVISPAR is an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. With a robust system for automated experiments and evaluation, iVISPAR allows large-scale benchmarking of VLMs in complex spatial tasks. The platform ranks state-of-the-art models, highlighting their strengths and weaknesses while offering research insights into their spatial reasoning abilities and visual alignment challenges.
iVISPAR provides a flexible framework that supports multiple environments, enabling diverse and scalable testing setups. The Geom Board Puzzle is a variant of the classic sliding tile puzzle problem. It challenges VLMs with a broad range of spatial reasoning skills, including logical planning, spatial awareness, orientation handling, and multi-step problem-solving. Read more about our environments here:
We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task’s complexity and feasibility for humans. Results indicate that while some VLMs perform well on simple spatial tasks, they encounter difficulties with more complex configurations and problem properties. See the top performers here:
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. Notably, while VLMs generally perform better in 2D vision compared to 3D or text-based representations, they consistently fall short of human performance, illustrating the persistent challenge of visual alignment. iVISPAR highlights critical gaps in current VLM capabilities, highlighting their limitations in achieving human-level cognition. Find out more here:
Model | Date | Completed Episodes | Step-Deviation from Optimal Path | ||||||
---|---|---|---|---|---|---|---|---|---|
All | 3D | 2D | Text | All | 3D | 2D | Text | ||
Sonnet-3.5Claude Team. Introducing the next generation of Claude. anthropic.com/news/claude-3-family, 2024. |
Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025). iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs. arXiv:2502.03214 |
54.56 | 28.67 | 89.67 | 45.33 | 3.05 | 4.10 | 1.44 | 3.60 |
Gemini-2.0-flashGemini Team. Gemini 2.0 flash (experimental), 2024. | Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025). iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs. arXiv:2502.03214 |
27.11 | 12.67 | 47.33 | 21.33 | 4.87 | 5.25 | 4.09 | 5.26 |
GPT-4oOpenAI. GPT-4o, 2024. | Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025). iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs. arXiv:2502.03214 |
17.56 | 9.33 | 37.33 | 6.00 | 5.30 | 5.45 | 4.15 | 6.30 |
InternVL2.5-78BChen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv:2412.05271, 2024. |
Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025). iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs. arXiv:2502.03214 |
10.16 | 1.67 | 9.42 | 19.33 | 5.98 | 6.39 | 5.86 | 5.69 |
LLaVA-OneVision-72BLi, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z., and Li, C. LLaVA-OneVision: Easy visual task transfer. arXiv:2408.03326, 2024. |
Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025). iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs. arXiv:2502.03214 |
8.22 | 0.67 | 1.33 | 22.67 | 6.35 | 6.75 | 6.81 | 5.50 |
Qwen2-72BWang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv:2409.12191, 2024. |
Jan '25Mayer, J., Ballout, M., Jassim, S., Nosrat Nezami, F., & Bruni, E. (2025). iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs. arXiv:2502.03214 |
5.89 | 0.67 | 1.67 | 15.33 | 6.37 | 6.66 | 6.54 | 5.90 |
coming soon ..
We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task’s complexity and feasibility for humans. Results indicate that while some VLMs perform well on simple spatial tasks, they encounter difficulties with more complex configurations and problem properties. See the top performers here:
coming soon ..
Agents interact with the board by issuing natural language commands through a text-based API to apply actions to the board. The objective is to rearrange pieces on the board to match a goal configuration. iVISPAR allows for a fine scaling of complexity, customizable random puzzle generation, and benchmarking performance with multiple baseline models.
The Sliding Geom Puzzle (SGP) replaces traditional numbered tiles with geometric objects (geoms) that are uniquely defined by color and shape attributes, increasing visual-spatial complexity and enhancing task scalability. This design shift requires models to interpret object features rather than follow simple numerical sequences, mirroring real-world spatial reasoning, where objects are distinguished by appearance, size, or structure. This approach aligns with physical tasks such as organizing items, assembling structures, or packing, promoting a more authentic evaluation of real-world spatial capabilities.
In each episode, agents receive observations of the start and goal states, accompanied by task instructions. Agents apply move actions to geoms by referencing their unique color and shape combination and specifying the direction of intended movement. Geoms can be moved in cardinal directions (\( \text{LEFT, RIGHT, UP, DOWN} \)), with actions formatted as “move <color> <shape> <direction>”:
move blue sphere right
Actions are validated and applied if legal, with agents receiving updated board states regardless of the action’s success after each move command. Effective and ineffective actions both result in valid new board states but, respectively, decrease or increase the path length to the goal state.Invalid moves, such as occupied destination and out-of-bounds actions, fail to alter the board state, as do illegal commands, which violate the instructed action format.This action-perception loop repeats until the goal state is achieved or a step limit is reached. Due to limited context windows, VLM agents receive task instructions at each time step.
A configuration of tiles is represented as an injective mapping from the set \( \{1, \dots, n\} \) to positions \( V = \{(v_x, v_y) : 1 \leq v_x \leq m_2, 1 \leq v_y \leq m_1 \} \). Each tile must be repositioned from an arbitrary initial configuration \( S = \{s_1, \dots, s_n\} \) to a specified goal configuration \( G = \{g_1, \dots, g_n\} \), such as an ordered row-major layout.
Let the movement path of tile \( i \), where \( 1 \leq i \leq n \), be expressed as \( p_i : \mathbb{N}_0 \to V \). The puzzle seeks a set of feasible paths \( P = \{p_1, \dots, p_n\} \) that satisfy the following conditions for all \( 1 \leq i, j \leq n \) with \( i \neq j \), and for all time steps \( t \geq 0 \):
Incremental Movement:
\( p_i(t+1) = p_i(t) \text{ or } (p_i(t+1), p_i(t)) \in E \)
Tiles move to adjacent, unoccupied positions or stay still.
Goal Achievement:
\( p_i(0) = s_i \text{ and } p_i(T) = g_i \text{ for some } T \geq 0 \)
Each tile must start at \( s_i \) and reach \( g_i \).
Exclusive Occupancy:
\( p_i(t) \neq p_j(t) \text{ for all } i \neq j \)
Two tiles cannot occupy the same position at the same time.
In this sequential version, tiles move one at a time. Therefore, the head-on collision and corner-following constraints found in the generalized sliding-tile puzzle are omitted, as simultaneous tile movements are not permitted.
You are a highly intelligent AI with exceptional spatial reasoning, tasked with solving a shape puzzle game on a 4 by 4 grid board.
The game consists of a grid board with two states: a current active state and a goal state.
Your objective is to generate valid actions to move objects on the board, step by step, along the shortest path until the current state matches the goal state:
Your actions must follow this exact format:
move <object color> <object shape> <direction>
Replace <object color>
, <object shape>
, and <direction>
with appropriate values from the lists below. Do not use quotation marks or angle brackets.
move green cube down
move blue sphere up
move red pyramid left
move yellow cylinder right
action: <your action>
Replace <your action>
with a valid move.
{text snippet active}
{text snippet goal}
{text snippet past}
Always end your response with:
action: move <object color> <object shape> <direction>
Additionally, always end with:
description: <your object coordinate list>
Do not add any characters after the words action:
or description:
.
Coming soon
Frequently asked questions and answers will be added in the near future.