VLM Daily Papers

Daily papers related to VLA/AV/3D Understanding from cs.CV

November 04, 2025

UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs

Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION

TLDR: The paper introduces UniLION, a unified autonomous driving model using linear group RNNs for efficient processing of large-scale LiDAR, multi-view images, and temporal sequences, achieving competitive performance across various tasks without explicit fusion modules.

TLDR: 该论文介绍了UniLION,一种统一的自动驾驶模型,它使用线性组RNN有效地处理大规模LiDAR点云、多视角图像和时间序列,无需显式的融合模块即可在各种任务中实现有竞争力的性能。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (8/10)
Potential Impact: (9/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Zhe Liu, Jinghua Hou, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai

3EED: Ground Everything Everywhere in 3D

Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.

TLDR: The paper introduces 3EED, a large-scale, multi-platform 3D grounding benchmark designed for embodied agents operating in diverse outdoor environments, addressing limitations of existing datasets.

TLDR: 该论文介绍了3EED,一个大规模、多平台的3D视觉定位基准,专为在各种户外环境中运行的具身智能体设计,旨在解决现有数据集的局限性。

Relevance: (10/10)
Novelty: (9/10)
Clarity: (9/10)
Potential Impact: (9/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Rong Li, Yuhao Dong, Tianshuai Hu, Ao Liang, Youquan Liu, Dongyue Lu, Liang Pan, Lingdong Kong, Junwei Liang, Ziwei Liu

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act -- reading text and images and producing future images and actions. However, these models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. Our core philosophy is to optimize generation and action jointly through a synchronous denoising process, where the iterative refinement enables actions to evolve from initialization, under constant and sufficient visual guidance. We ground this philosophy in our proposed Unified Diffusion VLA and Joint Discrete Denoising Diffusion Process (JD3P), which is a joint diffusion process that integrates multiple modalities into a single denoising trajectory to serve as the key mechanism enabling understanding, generation, and acting to be intrinsically synergistic. Our model and theory are built on a unified tokenized space of all modalities and a hybrid attention mechanism. We further propose a two-stage training pipeline and several inference-time techniques that optimize performance and efficiency. Our approach achieves state-of-the-art performance on benchmarks such as CALVIN, LIBERO, and SimplerEnv with 4$\times$ faster inference than autoregressive methods, and we demonstrate its effectiveness through in-depth analysis and real-world evaluations. Our project page is available at https://irpn-eai.github.io/UD-VLA.github.io/.

TLDR: This paper introduces a Unified Diffusion VLA model with a Joint Discrete Denoising Diffusion Process (JD3P) for vision-language-action tasks, achieving SOTA results with faster inference by jointly optimizing generation and action.

TLDR: 本文介绍了一种统一扩散VLA模型,该模型具有联合离散去噪扩散过程(JD3P),用于视觉-语言-动作任务,通过联合优化生成和动作,实现了最先进的结果,并具有更快的推理速度。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li

OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA

Vision-language-action (VLA) models have shown strong generalization for action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception is needed to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

TLDR: OmniVLA introduces a unified multi-sensor VLA model using a sensor-masked image representation, significantly improving performance on real-world manipulation tasks compared to RGB-only and raw-sensor baselines.

TLDR: OmniVLA 引入了一个统一的多传感器 VLA 模型,使用传感器掩码图像表示,与仅使用 RGB 和原始传感器基线相比,显着提高了在现实世界操作任务中的性能。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (9/10)
Read Paper (PDF)

Authors: Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qi

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.

TLDR: This paper introduces Viewpoint Learning and the Viewpoint-100K dataset to enhance the spatial reasoning capabilities of MLLMs, using a two-stage fine-tuning approach (SFT + RL) and a hybrid cold-start initialization method.

TLDR: 该论文介绍了视点学习和 Viewpoint-100K 数据集,旨在通过两阶段微调方法(SFT + RL)和混合冷启动初始化方法来提升 MLLM 的空间推理能力。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo

MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence

Multimodal large language models (MLLMs) have shown remarkable capabilities in cross-modal understanding and reasoning, offering new opportunities for intelligent assistive systems, yet existing systems still struggle with risk-aware planning, user personalization, and grounding language plans into executable skills in cluttered homes. We introduce MARS - a Multi-Agent Robotic System powered by MLLMs for assistive intelligence and designed for smart home robots supporting people with disabilities. The system integrates four agents: a visual perception agent for extracting semantic and spatial features from environment images, a risk assessment agent for identifying and prioritizing hazards, a planning agent for generating executable action sequences, and an evaluation agent for iterative optimization. By combining multimodal perception with hierarchical multi-agent decision-making, the framework enables adaptive, risk-aware, and personalized assistance in dynamic indoor environments. Experiments on multiple datasets demonstrate the superior overall performance of the proposed system in risk-aware planning and coordinated multi-agent execution compared with state-of-the-art multimodal models. The proposed approach also highlights the potential of collaborative AI for practical assistive scenarios and provides a generalizable methodology for deploying MLLM-enabled multi-agent systems in real-world environments.

TLDR: The paper introduces MARS, a multi-agent robotic system using MLLMs for assistive intelligence in smart homes, focusing on risk-aware planning and personalized assistance. It demonstrates improved performance in experiments involving risk-aware planning and multi-agent execution.

TLDR: 该论文介绍了MARS,一个使用MLLM的多智能体机器人系统,用于智能家居中的辅助智能,重点关注风险感知规划和个性化辅助。实验表明,在风险感知规划和多智能体执行方面,该系统表现出优越的性能。

Relevance: (8/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Renjun Gao, Peiyan Zhong

PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.

TLDR: PixelVLA is a new VLA model designed for pixel-level reasoning and multimodal prompting, trained on a novel pixel-level annotated dataset. It achieves improved manipulation success rates while significantly reducing pretraining costs.

TLDR: PixelVLA是一种新的VLA模型,专为像素级推理和多模态提示而设计,并使用新的像素级标注数据集进行训练。它在显著降低预训练成本的同时,提高了操作成功率。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai, Ivan Laptev, Salman Khan, Yang Cong

Driving scenario generation and evaluation using a structured layer representation and foundational models

Rare and challenging driving scenarios are critical for autonomous vehicle development. Since they are difficult to encounter, simulating or generating them using generative models is a popular approach. Following previous efforts to structure driving scenario representations in a layer model, we propose a structured five-layer model to improve the evaluation and generation of rare scenarios. We use this model alongside large foundational models to generate new driving scenarios using a data augmentation strategy. Unlike previous representations, our structure introduces subclasses and characteristics for every agent of the scenario, allowing us to compare them using an embedding specific to our layer-model. We study and adapt two metrics to evaluate the relevance of a synthetic dataset in the context of a structured representation: the diversity score estimates how different the scenarios of a dataset are from one another, while the originality score calculates how similar a synthetic dataset is from a real reference set. This paper showcases both metrics in different generation setup, as well as a qualitative evaluation of synthetic videos generated from structured scenario descriptions. The code and extended results can be found at https://github.com/Valgiz/5LMSG.

TLDR: This paper introduces a structured five-layer model and foundational models for generating diverse and original driving scenarios, with metrics for evaluation. They also provide code and extended results.

TLDR: 本文介绍了一种结构化的五层模型和基础模型,用于生成多样化和原始的驾驶场景,并提供评估指标。他们还提供了代码和扩展结果。

Relevance: (9/10)
Novelty: (7/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Arthur Hubert, Gamal Elghazaly, Raphaël Frank

Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.

TLDR: This paper introduces a novel unsupervised depth and ego-motion learning framework (DiMoDE) that discriminatively treats motion components, improving robustness and accuracy by leveraging geometric constraints and achieving state-of-the-art results.

TLDR: 本文介绍了一种新的无监督深度和自我运动学习框架 (DiMoDE),该框架有区别地处理运动分量,通过利用几何约束来提高鲁棒性和准确性,并达到最先进的结果。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Mengtan Zhang, Zizhan Guo, Hongbo Zhao, Yi Feng, Zuyi Xiong, Yue Wang, Shaoyi Du, Hanli Wang, Rui Fan

SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation

Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a novel probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints or guiding grasp synthesis in an uncertainty-aware manner.

TLDR: This paper introduces SE(3)-PoseFlow, a novel probabilistic framework using flow matching on the SE(3) manifold to estimate 6D object pose distributions, achieving state-of-the-art results and enabling uncertainty-aware robotic manipulation.

TLDR: 本文介绍了SE(3)-PoseFlow,一种新颖的概率框架,它利用SE(3)流形上的流匹配来估计6D物体姿态分布,实现了最先进的结果,并实现了不确定性感知的机器人操作。

Relevance: (8/10)
Novelty: (9/10)
Clarity: (8/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Yufeng Jin, Niklas Funk, Vignesh Prasad, Zechu Li, Mathias Franzius, Jan Peters, Georgia Chalvatzaki

Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop

LiDAR semantic segmentation degrades in adverse weather because refraction, scattering, and point dropouts corrupt geometry. Prior work in weather simulation, mixing-based augmentation, domain randomization, and uncertainty or boundary regularization improves robustness but still overlooks structural vulnerabilities near boundaries, corners, and sparse regions. We present a Light Geometry-aware adapter. The module aligns azimuth and applies horizontal circular padding to preserve neighbor continuity across the 0~360 degree wrap-around boundary. A local-window K-Nearest Neighbors gathers nearby points and computes simple local statistics, which are compressed into compact geometry-aware cues. During training, these cues drive region-aware regularization that stabilizes predictions in structurally fragile areas. The adapter is plug and play, complements augmentation, and can be enabled only during training with negligible inference cost. We adopt a source-only cross-weather setup where models train on SemanticKITTI and are evaluated on SemanticSTF without target labels or fine-tuning. The adapter improves mIoU by 7.9 percentage points over the data-centric augmentation baseline and by 0.6 points over the class-centric regularization baseline. These results indicate that geometry-driven regularization is a key direction for all-weather LiDAR segmentation.

TLDR: This paper introduces a geometry-aware adapter for LiDAR semantic segmentation that improves robustness in adverse weather conditions by focusing on structural vulnerabilities. It achieves this through region-aware regularization during training and a plug-and-play design for negligible inference cost.

TLDR: 本文提出了一种几何感知适配器,用于激光雷达语义分割,通过关注结构脆弱性来提高在恶劣天气条件下的鲁棒性。它通过训练期间的区域感知正则化和可忽略推理成本的即插即用设计来实现这一点。

Relevance: (8/10)
Novelty: (7/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (8/10)
Read Paper (PDF)

Authors: YoungJae Cheong, Jhonghyun An

LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping

Reconstructing large-scale colored point clouds is an important task in robotics, supporting perception, navigation, and scene understanding. Despite advances in LiDAR inertial visual odometry (LIVO), its performance remains highly sensitive to extrinsic calibration. Meanwhile, 3D vision foundation models, such as VGGT, suffer from limited scalability in large environments and inherently lack metric scale. To overcome these limitations, we propose LiDAR-VGGT, a novel framework that tightly couples LiDAR inertial odometry with the state-of-the-art VGGT model through a two-stage coarse- to-fine fusion pipeline: First, a pre-fusion module with robust initialization refinement efficiently estimates VGGT poses and point clouds with coarse metric scale within each session. Then, a post-fusion module enhances cross-modal 3D similarity transformation, using bounding-box-based regularization to reduce scale distortions caused by inconsistent FOVs between LiDAR and camera sensors. Extensive experiments across multiple datasets demonstrate that LiDAR-VGGT achieves dense, globally consistent colored point clouds and outperforms both VGGT-based methods and LIVO baselines. The implementation of our proposed novel color point cloud evaluation toolkit will be released as open source.

TLDR: This paper presents LiDAR-VGGT, a novel framework that tightly integrates LiDAR inertial odometry with VGGT using a coarse-to-fine fusion pipeline to achieve globally consistent and metric-scale dense colored point clouds, outperforming existing methods.

TLDR: 本文提出了LiDAR-VGGT,一种新颖的框架,通过粗到精的融合流程将激光雷达惯性里程计与VGGT紧密结合,以实现全局一致且具有度量比例的密集彩色点云,性能优于现有方法。

Relevance: (9/10)
Novelty: (8/10)
Clarity: (9/10)
Potential Impact: (8/10)
Overall: (8/10)
Read Paper (PDF)

Authors: Lijie Wang, Lianjie Guo, Ziyi Xu, Qianhao Wang, Fei Gao, Xieyuanli Chen

Terrain-Enhanced Resolution-aware Refinement Attention for Off-Road Segmentation

Off-road semantic segmentation suffers from thick, inconsistent boundaries, sparse supervision for rare classes, and pervasive label noise. Designs that fuse only at low resolution blur edges and propagate local errors, whereas maintaining high-resolution pathways or repeating high-resolution fusions is costly and fragile to noise. We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision. Most computation occurs at a low-resolution bottleneck; a gated cross-attention injects fine-scale detail, and only a sparse, uncertainty-selected set of pixels is refined. The components are co-designed and tightly integrated: global self-attention with lightweight dilated depthwise refinement restores local coherence; a gated cross-attention integrates fine-scale features from a standard high-resolution encoder stream without amplifying noise; and a class-aware point refinement corrects residual ambiguities with negligible overhead. During training, we add a boundary-band consistency regularizer that encourages coherent predictions in a thin neighborhood around annotated edges, with no inference-time cost. Overall, the results indicate competitive performance and improved stability across transitions.

TLDR: This paper introduces a resolution-aware token decoder for off-road semantic segmentation that balances global semantics, local consistency, and boundary fidelity using a gated cross-attention mechanism and boundary consistency regularization.

TLDR: 本文提出了一种用于越野语义分割的、分辨率感知的 Token 解码器,该解码器利用门控交叉注意力机制和边界一致性正则化来平衡全局语义、局部一致性和边界保真度。

Relevance: (8/10)
Novelty: (7/10)
Clarity: (8/10)
Potential Impact: (7/10)
Overall: (7/10)
Read Paper (PDF)

Authors: Seongkyu Choi, Jhonghyun An

Saliency-Guided Domain Adaptation for Left-Hand Driving in Autonomous Steering

Domain adaptation is required for automated driving models to generalize well across diverse road conditions. This paper explores a training method for domain adaptation to adapt PilotNet, an end-to-end deep learning-based model, for left-hand driving conditions using real-world Australian highway data. Four training methods were evaluated: (1) a baseline model trained on U.S. right-hand driving data, (2) a model trained on flipped U.S. data, (3) a model pretrained on U.S. data and then fine-tuned on Australian highways, and (4) a model pretrained on flipped U.S. data and then finetuned on Australian highways. This setup examines whether incorporating flipped data enhances the model adaptation by providing an initial left-hand driving alignment. The paper compares model performance regarding steering prediction accuracy and attention, using saliency-based analysis to measure attention shifts across significant road regions. Results show that pretraining on flipped data alone worsens prediction stability due to misaligned feature representations, but significantly improves adaptation when followed by fine-tuning, leading to lower prediction error and stronger focus on left-side cues. To validate this approach across different architectures, the same experiments were done on ResNet, which confirmed similar adaptation trends. These findings emphasize the importance of preprocessing techniques, such as flipped-data pretraining, followed by fine-tuning to improve model adaptation with minimal retraining requirements.

TLDR: This paper explores a domain adaptation technique for autonomous driving, specifically adapting a steering model (PilotNet and ResNet) to left-hand driving by pretraining on flipped right-hand driving data and then fine-tuning on Australian highway data, demonstrating improved performance and attention on left-side cues.

TLDR: 本文探讨了一种自动驾驶领域的领域自适应技术,通过在翻转的右手驾驶数据上进行预训练,然后在澳大利亚高速公路数据上进行微调,将转向模型(PilotNet和ResNet)调整为左手驾驶,结果表明性能得到了改善,并且更加关注左侧提示。

Relevance: (7/10)
Novelty: (6/10)
Clarity: (9/10)
Potential Impact: (7/10)
Overall: (7/10)
Read Paper (PDF)

Authors: Zahra Mehraban, Sebastien Glaser, Michael Milford, Ronald Schroeter