Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

University of Illinois Urbana-Champaign

We propose CamCue, a pose-aware multi-image framework that grounds language-specified viewpoints to explicit camera poses and generates the corresponding imagined view.

Abstract

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We propose CamCue, a pose-aware multi-image framework that grounds language-specified viewpoints to explicit camera poses and generates the corresponding imagined view. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

Overview

CAMCUE improves perspective taking in multi-image spatial reasoning by grounding language-specified viewpoints to explicit camera poses. The framework uses camera pose as an explicit geometric anchor for cross-view fusion and generates pose-conditioned imagined views to support answering perspective-shift questions.

CAMCUE Overview
CAMCUE overview: grounding language-specified viewpoints to camera poses for perspective-shift reasoning.

Data

We curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human annotated viewpoint descriptions in the test split to evaluate generalization to human language.

CAMCUE Data
CAMCUE-DATA: multi-view images, camera poses, and perspective-shift questions.

Framework

CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

CAMCUE Framework
CAMCUE framework architecture.

Camera Prediction

Examples of pose-conditioned imagined views generated by CAMCUE for perspective-shift reasoning. The model predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold.

CAMCUE Render Examples
Pose-conditioned imagined view examples.

Camera pose estimation accuracy under different viewpoint description sources. Values are the percentage of samples with rotation/translation error within each threshold.

Viewpoint Description Rotation Acc. ↑ (%) Translation Acc. ↑ (%)
R@5° R@10° R@20° t@0.1 t@0.3 t@0.5
Synthetic 19.3 35.4 91.5 12.0 62.4 92.9
Human Description 30.1 56.9 100.0 19.5 74.8 95.1

Main results on perspective-shift reasoning

Model Overall (Avg.) Attribute Visibility Distance Order Relative Relation Count
Qwen2.5-VL-7B 71.06 93.00 84.31 71.43 59.73 55.29 55.29
+MindJourney 72.83 92.00 84.31 80.22 65.75 50.59 50.59
+CamCue 80.12 92.00 88.24 83.52 78.52 60.00 60.00
Qwen2.5-VL-3B 67.52 97.00 80.39 62.64 57.05 50.59 50.59
+MindJourney 70.28 95.00 76.47 69.23 64.64 50.59 50.59
+CamCue 75.92 94.12 82.35 80.43 67.97 63.53 63.53
InternVL-2.5-8B 68.11 89.00 76.47 80.22 58.56 52.94 52.94
+MindJourney 74.21 93.00 82.35 85.71 64.64 55.29 55.29
+CamCue 77.36 94.00 80.39 89.01 72.38 54.12 54.12

Citation

@article{zhang2025camcue,
  title={Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning},
  author={Zhang, Xuejun and Tiwari, Aditi and Wang, Zhenhailong and Ji, Heng},
  journal={arXiv preprint},
  year={2025}
}