car DriveQA: Passing the Driving Knowledge Test

Boston University1 Washington University in St. Louis2
ICCV 2025
*Equal Contribution
Driver Llama

Can LLMs Pass a Driving Knowledge Test?
We introduce a multimodal dataset to evaluate the traffic rule-following capabilities of MLLMs.

DriveQA Teaser

DriveQA: A comprehensive multimodal benchmark that evaluates and improves MLLMs' driving knowledge through 474K QA pairs covering traffic rules, signs, and right-of-way scenarios.

Abstract

If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in DriveQA-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on DriveQA enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and BDD, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.



📊 Large-Scale Benchmark

474K QA pairs
26K text-based questions
448K vision-based tasks
220 traffic sign types

🔍 Comprehensive Evaluation

19 question categories
Controlled variations
Explanations included
Real + synthetic data

đźš— Real-World Transfer

Improved nuScenes performance
Better BDD-OIA results
Sim-to-real generalization
Downstream task gains

DriveQA Related Work

Dataset Statistics


DriveQA-T
Distribution of Question Type in DriveQA-T. The benchmark covers five key domains and 19 sub-class types.
DriveQA-WordCloud
Word Cloud of Questions in DriveQA. The figure statistically summarizes the language terms in the introduced DriveQA benchmark.

Main Results


Text QA Results Chart
Text QA Results
Challenging Categories on DriveQA-T. We show the results of most difficult 10 types: Limits: Speed and Distance Limits, Alcohol: Blood Alcohol Limits and DUI Laws, Passing: Passing Rules and Lane Usage in Restricted Situations, Penalties: Driver's License Penalties, Parking: Parking and Wheel Positioning, Highway: Passing Rules and Lane Usage in Highway, Turning: Turning Rules, Signs: Traffic Signs and Signals, Headlight: Headlight Usage, Intersection: Right-of-Way and Lane Selection. The Average is the summary based on all 19 types of questions. We denote with green the top method, and light green second best.
Image QA Results
Summarized Results on DriveQA-V. We show model performance (accuracy %) for VQA. The dataset is divided into two main categories: intersections and signs (categorized into camera perspective and type).
Image QA Results
Role of Difficult Questions and Distractors. The accuracy degradation on a hard subset of DriveQA-T and on a challenging set of DriveQA-V with negative sampling shows the limitations of current models, including GPT-4o, in accurately understanding complex traffic rules and signs.

Sim-to-Real Transferability and Downstream Task Performance


Sim-to-Real Generalization
Sim-to-Real Generalization. We pre-train on synthetic DriveQA (DQA) and evaluate on real-world Mapillary images. The Mapillary dataset comprises challenging scenarios with various traffic sign placements, occlusion, and illumination.
End-to-End Trajectory Planning Results on nuScenes
End-to-End Trajectory Planning Results on nuScenes. We compute the L2 error at different prediction horizons (1s, 2s, and 3s). Lower L2 error shows our DriveQA (DQA) dataset can transfer from simulation to real-world driving tasks.
Evaluation on BDD-OIA Dataset
Evaluation on BDD-OIA Dataset. We report mean F1 score (mF1) and overall F1 score (F1 all) for both action and explanation tasks. The results show that fine-tuning on DriveQA improves performance on both tasks.

BibTeX

@inproceedings{wei2025driveqa,
        title={Passing the Driving Knowledge Test},
        author={Wei, Maolin and Liu, Wanzhou and Ohn-Bar, Eshed},
        booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
        year={2025}
      }

Please cite DriveQA if you find it helpful!