Towards Consistent Video Geometry Estimation

Zhu Yu^1,*,† Jingnan Gao^3,* Runmin Zhang¹ Lingteng Qiu²

Zhengyi Zhao² Rui Peng² Yichao Yan³ Kejie Qiu² Siyu Zhu⁴

Zilong Dong² Si-Yuan Cao¹ Hui-Liang Shen¹

¹Zhejiang University ²Tongyi Lab, Alibaba Group ³Shanghai Jiao Tong University ⁴Fudan University

^†Internship at Tongyi Lab ^*Equal contribution

Paper Code

ViGeo teaser showing video frames, consistent point maps, depth, normals, and streaming inference. — ViGeo is a unified feed-forward foundation model for video geometry estimation. It predicts temporally consistent depth, surface normals, and dense point maps from raw video frames. With dynamic chunking attention, the same trained model seamlessly switches between full-sequence reconstruction and streaming inference without retraining.

Abstract

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining.

To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

Benchmark Overview

ViGeo improves relative error across video depth, streaming depth, long-video depth, point-map estimation, and monocular depth benchmarks.

Bar chart comparing ViGeo with previous state of the art on video depth, streaming depth, long-video depth, video point map, and monocular depth.

Video Depth Estimation

Monocular Depth Comparison

BibTeX

@article{yu2026vigeo,
  title={Towards Consistent Video Geometry Estimation},
  author={Yu, Zhu and Gao, Jingnan and Zhang, Runmin and Qiu, Lingteng and Zhao, Zhengyi and Peng, Rui and
          Yan, Yichao and Qiu, Kejie and Zhu, Siyu and Dong, Zilong and Cao, Si-Yuan and Shen, Hui-Liang},
  journal={arXiv preprint arXiv:2605.30060},
  url={https://arxiv.org/abs/2605.30060},
  year={2026}
}

Abstract

Benchmark Overview

Video Depth Estimation

Monocular Depth Comparison

Data Refinement Pipeline

BibTeX