Yuhao Su

Ph.D. Candidate in Computer Science

NOVA Lab

Khoury College of Computer Sciences

Northeastern University

Boston, MA

su.yuh@northeastern.edu

Yuhao Su is a Ph.D. Candidate in Computer Science at Khoury College of Computer Sciences at Northeastern University. He completed his first year of coursework remotely during the pandemic (2020-2021) before beginning research in Boston in 2021. His research spans multimodal LLMs, video understanding, and data-efficient & interactive AI. His doctoral work focuses on temporal action segmentation, object correspondence, active learning, feedback learning under the supervision of Prof. Ehsan Elhamifar.

During his Ph.D., he enriched his expertise in multimodal LLMs and video understanding through a research internship at UII America, where he developed MedVidBench, a large-scale multi-task medical video understanding dataset, and MedGRPO, a multi-task reinforcement learning framework.

Before Northeastern, he earned his B.A. in Mathematics and Computer Science from the University of Minnesota.


Selected Works

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu
Task: The work introduces the MedVidBench benchmark (531K video-instruction pairs) and the MedGRPO RL framework, which utilizes cross-dataset reward normalization and a medical LLM judge to stabilize training and advance medical video understanding.
Results: Supervised fine-tuning on MedVidBench outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, with MedGRPO further improving performance over the SFT baseline on multiple tasks.

RegionAligner: Bridging Ego-Exo Views for Object Correspondence via Unified Text-Visual Learning

Yuhao Su, Ehsan Elhamifar
Task: RegionAligner is proposed as a unified text-visual framework that uses VLMs to filter distractors and applies region-guided supervision to solve the challenging problem of cross-view object correspondence between egocentric and exocentric videos.
Results: RegionAligner significantly outperforms baselines on Ego-Exo4D, achieving IoU improvements of 10.16% (ego-to-exo) and 6.04% (exo-to-ego), while also demonstrating adaptation to unsupervised settings.
WACV 2026

Two-Stage Active Learning for Efficient Temporal Action Segmentation

Yuhao Su, Ehsan Elhamifar
Task: To efficiently overcome the costly frame-level annotation for Temporal Action Segmentation (TAS), a two-stage active learning framework is introduced that leverages contrastive prototype learning for intelligent video and frame selection.
Results: The framework achieves 95% of full-supervision performance using only 0.35% of labeled frames, significantly reducing annotation effort and marking the first active learning work for TAS.
ECCV 2024

Under Review

Research on interactive and training-free temporal action segmentation, with focus on multimodal reasoning and feedback-driven learning.