Ten papers by CSE researchers at CVPR 2025

CSE-affiliated authors are presenting new research in the area of computer vision, from humanoid robotics to 3D reconstruction.
Six color map visualizations comparing how the CLIP and OAK models categorize shape, color, and texture in images. OAK categorizations in each category are much more defined relative to CLIP.
Visualization of CLIP visual features and nearest neighbor examples from CLIP (row 1) and OAK (row 2) on shape, color, and texture. From “Open Ad-hoc Categorization with Contextualized Feature Learning” by Zilin Wang, Sangwoo Mo, Stella X. Yu, and coauthors.

Researchers affiliated with Computer Science and Engineering at the University of Michigan are presenting ten papers in the main track of the 2025 IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR). CVPR is the premier international conference for new research in computer vision and related topics. This year’s event is taking place June 11-15 in Nashville, TN. 

Topics covered by CSE researchers at the conference include humanoid robotics, cross-platform mobile agents, large-scale 3D data for language models, multimodal graph learning, dynamic camera pose estimation, and more. The papers being presented are as follows, the names of authors affiliated with CSE in bold:

Let Humanoids Hike! Integrative Skill Development on Complex Trails
Kwan-Yee Lin, Stella X. Yu

Abstract: Hiking on complex trails demands balance, agility, and adaptive decision-making over unpredictable terrain. Current humanoid research remains fragmented and inadequate for hiking: locomotion focuses on motor skills without long-term goals or situational awareness, while semantic navigation overlooks real-world embodiment and local terrain variability. We propose training humanoids to hike on complex trails, driving integrative skill development across visual perception, decision making, and motor execution. We develop a learning framework, LEGO-H, that enables a vision-equipped humanoid robot to hike complex trails autonomously. We introduce two technical innovations: 1) A temporal vision transformer variant – tailored into Hierarchical Reinforcement Learning framework – anticipates future local goals to guide movement, seamlessly integrating locomotion with goal-directed navigation. 2) Latent representations of joint movement patterns, combined with hierarchical metric learning – enhance Privileged Learning scheme – enable smooth policy transfer from privileged training to onboard execution. These components allow LEGO-H to handle diverse physical and environmental challenges without relying on predefined motion patterns. Experiments across varied simulated trails and robot morphologies highlight LEGO-H’s versatility and robustness, positioning hiking as a compelling testbed for embodied autonomy and LEGO-H as a baseline for future humanoid development.

A grayscale rendering of a humanoid robot navigating uneven terrain. Different parts of the robot’s path are labeled, including goal anticipation, trail end, and the robot’s visual input. A side bar of five smaller images shows the robot performing diverse motor skills on different terrains.
The humanoid robot (H1) a) equipped with vision, learns to b) anticipate near-future local goals to guide locomotion along the trail with self-autonomy. Bubble size (large to small) indicates anticipated goal direction; color shows temporal order (orange to green). Left: The authors’ LEGO-H framework is universal to different humanoid robots (e.g., G1, a smaller robot) to adaptively c) emerge diverse motor skills, and d) develop embodied path exploration strategies to hike on trails with varied terrains and obstacles.

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
Yunseok Jang, Yeda Song, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Dong-Ki Kim, GyungHoon Bae, Honglak Lee

Abstract: Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving an average performance gain of 18.11%p on an unseen mobile OS platform. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework comprises robust OCR-based scene detection (95.04% F1score), near-perfect UI element detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.

Open Ad-hoc Categorization with Contextualized Feature Learning
Zilin Wang, Sangwoo Mo, Stella X. Yu, Sima Behpour, Liu Ren

Abstract: Unlike common categories for plants and animals, ad-hoc categories such as things to sell at a garage sale are created to help people achieve a certain task. Likewise, AI agents need to adaptively categorize visual scenes in response to changing tasks. We thus study open ad-hoc categorization, where we learn to infer novel concepts and name images according to a varying categorization purpose, a few labeled exemplars, and many unlabeled images.We develop a simple method that combines top-down text guidance (CLIP) with bottom-up image clustering (GCD) to learn contextualized visual features and align visual clusters with CLIP semantics, enabling predictions for both known and novel classes. Benchmarked on multi-label datasets Stanford and Clevr-4, our so-called OAK significantly outperforms baselines in providing accurate predictions across contexts and identifying novel concepts, e.g., it achieves 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. OAK offers interpretable saliency maps, focusing on hands, faces, and backgrounds for Action, Mood, and Location contexts, respectively.

An image of a garage sale with various household items arranged on a table and in a yard. Different items are outlined in various colors demonstrating OAK’s ability to identify items using a labeled exemplar, semantic categories, and visual clusters.
The authors study open ad-hoc categorization (OAK), such as things to sell at a garage sale to achieve a specific goal (selling unwanted items). Given the context garage sale, labeled exemplars such as shoes, we need to recognize all items in the scene that can be sold at the garage sale, including novel ones. Supervised models like CLIP focus on 1) closed-world generalization, recognizing other shoes. 2) Novel semantic categories can be discovered by contextual expansion from shoes to hats. Unsupervised methods like GCD discover 3) novel visual clusters, identifying suitcases.

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David Fouhey, Joyce Chai

Abstract: The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs.

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
Jianing Yang, Alexander Sax, Kevin Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli

Abstract: Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. Fast3R’s Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.

Eight example images demonstrating how Fast3R’s reconstruction of 3D scenes, including a laptop keyboard, an apple, a piece of a cake on a plate, a yellow “wet floor” sign in a building lobby, a living room, a teddy bear, and a conference room.
Qualitative examples of Fast3R’s output.

Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning
Jing Zhu, Yuhang Zhou, Shengyi Qian, Zhongmou He, Tong Zhao, Neil Shah, Danai Koutra

Abstract: Graph machine learning has made significant strides in recent years, yet the integration of visual information with graph structure and its potential for improving performance in downstream tasks remains an underexplored area. To address this critical gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), a pioneering benchmark that incorporates both visual and textual information into graph learning tasks. MM-GRAPH extends beyond existing text-attributed graph benchmarks, offering a more comprehensive evaluation framework for multimodal graph learning Our benchmark comprises seven diverse datasets of varying scales (ranging from thousands to millions of edges), designed to assess algorithms across different tasks in real-world scenarios. These datasets feature rich multimodal node attributes, including visual data, which enables a more holistic evaluation of various graph learning frameworks in complex, multimodal environments. To support advancements in this emerging field, we provide an extensive empirical study on various graph learning frameworks when presented with features from multiple modalities, particularly emphasizing the impact of visual information. This study offers valuable insights into the challenges and opportunities of integrating visual data into graph learning.

SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model
Yucheng Mao, Boyang Wang, Nilesh Kulkarni, Jeong Joon Park

Abstract: The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.

Four grids of images showing how SIR-DIFF improves 3D reconstructions. Two grids show blurry photos of a classroom. The bottom two grids show blurry photos of a video game controller. Both are reconstructed to produce clear 3D images.
Sparse View Image Restoration. Our diffusion model takes multi-view images and jointly enhances their visual quality while maintaining 3D consistency. (Top) Four motion-blurred input images are processed by our method, resulting in sharp outputs that significantly outperform single-view restoration methods as shown in the corresponding purple boxes. (Bottom) Our method can consistently restore multi-view images (4 out of 50 shown), leading to accurate 3D reconstructions.

3D-MVP: 3D Multiview Pretraining for Manipulation
Shengyi Qian, Kaichun Mo, Valts Blukis, David Fouhey, Dieter Fox, Ankit Goyal

Abstract: Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D Multi-View Pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT’s multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. Our results suggest that 3D-aware pretraining is a promising approach to improve generalization of vision-based robotic manipulation policies.

Dynamic Camera Poses and Where to Find Them
Chris Rockwell, Joseph Tung, Tsung-Yi Lin, Ming-Yu Liu, David Fouhey, Chen-Hsuan Lin

Abstract: Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-the-art methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, Aleksander Holynski

Abstract: Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes.

CVPR Workshops

In addition to the above papers appearing in the main track of the conference, CSE researchers are involved in a variety of other workshops co-located with CVPR this year. 

Rada Mihalcea is an invited speaker at the Demographic Diversity in Computer Vision workshop and is giving a talk titled “Bridging the Digital Divide in Language-Vision Models”; Joyce Chai is among the lead organizers of the 3D-LLM/VLA workshop on Bridging Language, Vision and Action in 3D environments; and CSE researchers are presenting the following paper at the Computer Vision in the Wild workshop: 

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai

Abstract: Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.