Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We propose CamCue, a pose-aware multi-image framework that grounds language-specified viewpoints to explicit camera poses and generates the corresponding imagined view. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
CAMCUE improves perspective taking in multi-image spatial reasoning by grounding language-specified viewpoints to explicit camera poses. The framework uses camera pose as an explicit geometric anchor for cross-view fusion and generates pose-conditioned imagined views to support answering perspective-shift questions.
We curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human annotated viewpoint descriptions in the test split to evaluate generalization to human language.
CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
Examples of pose-conditioned imagined views generated by CAMCUE for perspective-shift reasoning. The model predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold.
Camera pose estimation accuracy under different viewpoint description sources. Values are the percentage of samples with rotation/translation error within each threshold.
| Viewpoint Description | Rotation Acc. ↑ (%) | Translation Acc. ↑ (%) | ||||
|---|---|---|---|---|---|---|
| R@5° | R@10° | R@20° | t@0.1 | t@0.3 | t@0.5 | |
| Synthetic | 19.3 | 35.4 | 91.5 | 12.0 | 62.4 | 92.9 |
| Human Description | 30.1 | 56.9 | 100.0 | 19.5 | 74.8 | 95.1 |
| Model | Overall (Avg.) | Attribute | Visibility | Distance | Order | Relative | Relation Count |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 71.06 | 93.00 | 84.31 | 71.43 | 59.73 | 55.29 | 55.29 |
| +MindJourney | 72.83 | 92.00 | 84.31 | 80.22 | 65.75 | 50.59 | 50.59 |
| +CamCue | 80.12 | 92.00 | 88.24 | 83.52 | 78.52 | 60.00 | 60.00 |
| Qwen2.5-VL-3B | 67.52 | 97.00 | 80.39 | 62.64 | 57.05 | 50.59 | 50.59 |
| +MindJourney | 70.28 | 95.00 | 76.47 | 69.23 | 64.64 | 50.59 | 50.59 |
| +CamCue | 75.92 | 94.12 | 82.35 | 80.43 | 67.97 | 63.53 | 63.53 |
| InternVL-2.5-8B | 68.11 | 89.00 | 76.47 | 80.22 | 58.56 | 52.94 | 52.94 |
| +MindJourney | 74.21 | 93.00 | 82.35 | 85.71 | 64.64 | 55.29 | 55.29 |
| +CamCue | 77.36 | 94.00 | 80.39 | 89.01 | 72.38 | 54.12 | 54.12 |
@article{zhang2025camcue,
title={Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning},
author={Zhang, Xuejun and Tiwari, Aditi and Wang, Zhenhailong and Ji, Heng},
journal={arXiv preprint},
year={2025}
}