Shihao Wang
I am a second-year Ph.D. student in the Department of Computing at The Hong Kong Polytechnic University, advised by Prof. Lei Zhang. My research spans 3D perception and planning for autonomous vehicles and robotics, multimodal foundation models, streaming video understanding, and test-time adaptation for agentic systems, with publications at top conferences including CVPR, ICCV, and AAAI. I also collaborate closely with NVIDIA Research and have contributed to both foundational multimodal models and autonomous driving systems.

My work includes the Eagle‑VLM series, powering NVIDIA’s commercial multimodal models and the Isaac GR00T humanoid robotics platform, as well as OmniDrive and Hydra‑MDP, which connect 3D perception with multimodal reasoning for end‑to‑end autonomous driving and have earned top awards at the CVPR 2024 Autonomous Driving Challenge. I am also the author of StreamPETR (ICCV’23), a streaming paradigm for camera‑based 3D perception that reached #1 among online methods on nuScenes and has been widely adopted in both academia and industry.

Looking ahead, I am developing memory‑centric, self‑evolving AI agents capable of persistent long‑horizon reasoning across virtual and physical domains, aiming to push beyond task‑specific solutions toward truly reliable general‑purpose autonomy.

🔥 News

2025.09: 🎉 Two papers have been accepted to NeurIPS 2025.
2025.06: 🎉 Contributed to the development and public release of Groot-N1.5.
2025.06: 🎉 Hydra-Next accepted to ICCV 2025.
2025.02: 🎉 OmniDrive accepted to CVPR 2025.
2024.11: 🎉 Eagle accepted to ICLR 2025.
2024.06: 🏆 1st Place in End-to-End Driving at Scale, 2nd Place in Driving with Language, CVPR 2024 Autonomous Driving Grand Challenge.
2023.11: 🎉 Far3D accepted to AAAI 2024.
2023.10: 🎉 Joined NVIDIA AV Applied Research Group as Research Intern.
2023.02: 🎉 StreamPETR accepted to ICCV 2023.
2022.10: 🎉 Joined MEGVII Technology Foundation Model Group as Research Intern.

📝 Publications

arXiv

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Liu, Jose M. Alvarez, Lei Zhang and Zhiding Yu.

arXiv 2025

Blog 2025

GR00T N1.5 An Improved Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Valts Blukis, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi “Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Xiaowei Jiang, Kaushil Kundalia, Jan Kautz, Zhiqi Li, Kevin Lin, Zongyu Lin, Loic Magne, Yunze Man, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Shihao Wang, Jiannan Xiang, Yuqi Xie, Yinzhen Xu, Seonghyeon Ye, Zhiding Yu, Yizhou Zhao, Zhe Zhang, Ruijie Zheng, Yuke Zhu

Blog 2025

ICCV 2025

Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training

Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu and Jose M. Alvarez.

ICCV 2025

NIPS 2025

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu

NIPS 2025

arXiv

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, Nadine Chang, Karan Sapra, Amala Sanjay Deshmukh, Tuomas Rintamaki, Matthieu Le, Ilia Karmanov, Lukas Voegtle, Philipp Fischer, De-An Huang, Timo Roman, Tong Lu, Jose M. Alvarez, Bryan Catanzaro, Jan Kautz, Andrew Tao, Guilin Liu and Zhiding Yu.

arXiv 2025

ICLR 2025

Eagle: Exploring the design space for multimodal llms with mixture of encoders

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu and Guilin Liu.

ICLR 2025

CVPR 2025

OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li and Jose M. Alvarez.

CVPR 2025

AAAI 2024

Far3D: Expanding the horizon for surround-view 3d object detection

Xiaohui Jiang, Shuanglin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han and Xiangyu Zhang.

AAAI 2024

ICCV 2023

StreamPETR: End-to-End Multi-View 3D Object Detection with Transformers

Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li and Xiangyu Zhang.

ICCV 2023

🎖 Honors and Awards

2024.06 Winner, CVPR24 Challenge on End-to-End Driving at Scale
2024.06 2nd Place, CVPR24 Challenge on Driving with Language
2023.11 1st Place, nuScenes leaderboard on camera-only 3D object tracking

📖 Educations

2024.09 - Present, Ph.D. in Computer Vision, Department of Computing, Hong Kong Polytechnic University (PolyU), Hong Kong
2021.09 - 2024.06, M.Sc. in Vehicle Engineering, Beijing Institute of Technology (BIT), Beijing, GPA: 89.5/100
2017.09 - 2021.06, B.Sc. in Vehicle Engineering, Beijing Institute of Technology (BIT), Beijing, GPA: 87.5/100

💬 Invited Talks

💻 Internships

2023.10 - 2025.01, NVIDIA, Beijing, China.
- Research Intern, AV Applied Research Group. Contributed to Eagle VLM family, VideoITG, HydraMDP, OmniDrive, GR00T N1.5.
2022.04 - 2023.07, MEGVII Technology, Beijing, China.
- Research Intern, Foundation Model Group (PI: Xiangyu Zhang). Developed StreamPETR, Far3D, etc.