SpatialTrackerV2: 3D Point Tracking Made Easy

ICCV 2025

Yuxi Xiao¹, Jianyuan Wang², Nan Xue³, Nikita Karaev^2,4, Yuri Makarov⁴, Bingyi Kang⁵

Xing Zhu³, Hujun Bao¹, Yujun Shen³, Xiaowei Zhou^1,✉

¹CAD&CG, Zhejiang University, ²University of Oxford, ³Ant Research, ⁴Pixelwise AI, ⁵Bytedance Seed

SpatialTrackerV2 is the first unified, end-to-end 3D point tracking model which estimates Camera Motion, Consistent Geometry and Pixel-wise 3D Trajectories at once.

Abstract

SpatialTrackerV2 is a novel framework for 3D point tracking that estimates world-space 3D trajectories for arbitrary 2D pixels in monocular videos. Unlike previous methods that rely on offline depth and pose estimators, our approach decomposes 3D motion into scene geometry, camera ego-motion, and fine-grained point-wise motion, all within a fully differentiable, end-to-end architecture. This unified design enables scalable training across diverse data sources, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By jointly learning geometry and motion, SpatialTrackerV2 achieves significant improvements—outperforming all prior 3D tracking methods by a clear margin, while also delivering strong results in 2D tracking and dynamic 3D reconstruction.

Method

Our method consists of two main components. First, a VGGT-style network extracts high-level semantic features from the input video to initialize consistent scene geometry and camera motion. Then, a track refiner iteratively updates all 4D attributes, including 2D and 3D point tracking, trajectory-wise dynamic probabilities, and camera poses.

Pipeline Overview

Interactive 4D Results

We present qualitative results of SpatialTrackerV2 across diverse scenarios. All results are generated by our model in a purely feed-forward manner, with inference taking only 10–20 seconds per sequence.

"Passing a basketball in Pstudio"

"A turtle swimming in the sea"

"The protagonist in Mission: Impossible rides a motorcycle."

"The dancer is performing a breakdance."

Acknowledgments

Yuxi Xiao was partially supported by Ant Research Program during this work. The visualization style is inspired by TAPIP3D. We thank the authors for their excellent visualizations and valuable concurrent work. We also appreciate all researchers working on advancing 3D point tracking.