Stereo4D

Learning How Things Move in 3D from Internet Stereo Videos

¹Google DeepMind ²University of Michigan ³New York University ⁴UC Berkeley (*: equal contribution)

Dataset gallery

How is it made?

Stereoscopic fisheye videos from the internet are an untapped source of high-quality 4D data: (1) there are hundreds of thousands of them, and (2) since they're designed to capture immersive VR experiences, they have wide field-of-view stereo imagery with a standardized stereo baseline, precisely the kind of information that's useful in reconstructing pseudo-metric 4D scenes. These videos contain a pretty even spread of what we might see in everyday life—some examples are shown here:

We process these videos with a careful combination of state-of-the-art methods for (1) stereo depth estimation, (2) 2D point tracking, as well as (3) a stereo structure-from-motion system optimized for dynamic stereo videos. From these methods, we can extract per-frame camera poses, per-pixel pseudo-metric depth from stereo disparity, and long-term 2D point tracks.

Camera poses Per frame depth 2D tracks

We then fuse these quantities into 4D reconstructions, by lifting the 2D tracks into 3D with their depth, and aligning all the scene content by compensating for the known camera motion. This results in temporally consistent, high-quality dynamic reconstructions, with long-term correspondence over time.

This gives us pretty reasonable 4D scenes, but there's still work left to be done. The precision of stereo depth predictions may be limited by the content of the scene (e.g., distant objects seen with small parallax may have a large variance of possible depth values—and can therefore vary signifcantly from one frame to another). This results in jittery or noisy 3D tracks. To compensate for this, we additionally perform an optimization process over the 3D trajectories that removes this noise.

Projected Tracks

(Before and after track optimization, try moving the slider across.)

Using this data to learn how things move

To validate that our dataset is useful for learning about the structure and motion of real-world scenes, we use it to train a variant of DUSt3R. DUSt3R usually takes as input a pair of images, and predicts a 3D point for each pixel in both images (in a shared coordinate frame)—but it fails when the scene is dynamic, e.g., when there is scene motion between the two images. This comes largely as a consequence of the fact that DUSt3R is trained on static data (since there aren't very many sources of ground-truth dynamic 3D scenes). We extend DUSt3R by adding a notion of time—into a model that we call DynaDUSt3R, and train it on our Stereo4D dataset. Given a pair of frames from any real-world video, DynaDUSt3R predicts per-pixel 3D points for each frame, as well as the 3D motion trajectories that connect them in time.

Click on the different examples below to see predicted shape and motion from various image pairs.

Input image pair	Reconstruction

Acknowledgements

Thanks to Jon Barron, Ruiqi Gao, Kyle Genova, Philipp Henzler, Andrew Liu, Erika Lu, Ben Poole, Qianqian Wang, Rundi Wu, Richard Szeliski, and Stan Szymanowicz for their helpful proofreading, comments, and discussions. Thanks to Carl Doersch, Skanda Koppula, and Ignacio Rocco for their assistance with TAPVid-3D and BootsTAP. Thanks to Carlos Hernandez, Dominik Kaeser, Janne Kontkanen, Ricardo Martin-Brualla, and Changchang Wu for their help with VR180 cameras and videos.

@article{jin2024stereo4d, title={Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos}, author={Jin, Linyi and Tucker, Richard and Li, Zhengqi and Fouhey, David and Snavely, Noah and Holynski, Aleksander}, journal={arXiv preprint}, year={2024}, }

Stereo4D

Learning How Things Move in 3D from Internet Stereo Videos

CVPR 2025 (Oral)

TL;DR: Use stereo videos from the internet to create a dataset of over 100,000 real-world 4D scenes with metric scale and long-term 3D motion trajectories.

Dataset gallery

How is it made?

Using this data to learn how things move

Acknowledgements

BibTeX