PREPRINT

1

Towards Knowledge-driven Autonomous Driving

arXiv:2312.04316v1 [cs.RO] 7 Dec 2023

Xin Li∗ , Yeqi Bai∗ , Pinlong Cai∗† , Licheng Wen, Daocheng Fu, Bo Zhang, Xuemeng Yang, Xinyu Cai,
Tao Ma, Jianfei Guo, Xing Gao, Min Dou, Botian Shi† , Yong Liu, Liang He and Yu Qiao
Abstract—This paper explores the emerging knowledge-driven autonomous driving technologies. Our investigation highlights the
limitations of current autonomous driving systems, in particular their sensitivity to data bias, difficulty in handling long-tail scenarios,
and lack of interpretability. Conversely, knowledge-driven methods with the abilities of cognition, generalization and life-long learning
emerge as a promising way to overcome these challenges. This paper delves into the essence of knowledge-driven autonomous driving
and examines its core components: dataset & benchmark, environment, and driver agent. By leveraging large language models, world
models, neural rendering, and other advanced artificial intelligence techniques, these components collectively contribute to a more
holistic, adaptive, and intelligent autonomous driving system. The paper systematically organizes and reviews previous research efforts
in this area, and provides insights and guidance for future research and practical applications of autonomous driving. We will continually
share the latest updates on cutting-edge developments in knowledge-driven autonomous driving along with the relevant valuable opensource resources at: https://github.com/PJLab-ADG/awesome-knowledge-driven-AD.
Index Terms—Knowledge-driven, Autonomous driving, Simulation, Driver agent

✦

1

I NTRODUCTION

I

N recent years, autonomous driving has undergone substantial
development, primarily propelled by continuous advancements
in sensor technology [1]–[3], rapid progress in machine learning
and artificial intelligence (AI) [4]–[6], as well as innovations
in high-precision mapping and positioning technologies [7], [8],
etc. The positive influence of regulations and policies has further
contributed to this progress. Despite noteworthy advancements in
autonomous driving, persistent challenges remain. An overreliance
on data-driven approaches exposes systems to data bias, resulting
in overfitting on training data [9], [10]. This challenge impedes
existing autonomous driving systems from effectively addressing
long-tail and cross-domain issues [3], [11], thereby limiting their
adaptability in new environments. Moreover, existing autonomous
driving systems lack interpretability [12]–[14]. Data-driven algorithms are often perceived as black boxes, presenting a challenge
to provide human-understandable explanations for their decisions.
This challenge impedes the ability to confirm whether the model
genuinely makes intelligent decisions and restricts the potential
for guiding further optimization of the system. Despite numerous
attempts to address these issues [15]–[17], no universally reliable
method can satisfactorily resolve them. Consequently, addressing
challenges such as data bias, long-tail issues, cross-domain problems, and the lack of interpretability remains a critical focus for
ongoing research and development in autonomous driving.
Contemporary autonomous driving methodologies involve
training models through the accumulation of extensive datasets

•
•
•
•
•

X. Li, Y. Bai, P. Cai, L. Wen, D. Fu, B. Zhang, X. Yang, X. Cai, T. Ma,
J. Guo, X. Gao, M. Dou, B. Shi and Y. Qiao are with Shanghai Artificial
Intelligence Laboratory.
X. Li and L. He are also with East China Normal University.
T. Ma is also with the Chinese University of Hong Kong.
Y. Liu is with Zhejiang University.
∗ indicates equal contribution. † denotes corresponding authors: Pinlong
Cai (caipinlong@pjlab.org.cn) and Botian Shi (shibotian@pjlab.org.cn)

to impart proficient driving capabilities [18], [19]. Data-driven
models tend to prioritize common cases while overlooking rare
corner cases. This constraint is rooted in the assumption of
independence and identical distribution (i.i.d.) underlying datadriven methodologies, which proves challenging to meet in realworld scenarios [20]–[22]. Despite the expanding scale of data
collection, the inherent limitation arises from the inadequacy of
limited data to encompass an infinite array of corner cases [13],
[23], [24]. To make fundamental strides in autonomous driving, it
is crucial to explore technological changes and replicate human
learning patterns in driving through modeling [25], [26]. As
underscored by Yann LeCun [27], human proficiency in mastering
fundamental driving skills and adeptly adapting to diverse and
unpredictable scenarios, such as navigating complex traffic conditions and changing weather, requires merely dozens of hours of
professional practice. This accentuates the efficient learning and
knowledge summarization capabilities inherent in humans.
Knowledge is the concretization and generalization of human
representation of scenes and events in the real world, representing
a summary of experiences and causal reasoning [28]. The foundational concepts and significant implications of knowledge-driven
approaches can be elucidated through the evolutionary trends in
AI. Fig. 1 illustrates different technological paradigms. (1) The
rule-driven paradigm depends on meticulous logical reasoning
or thorough empirical validation using manually crafted rules.
These methods aim to encapsulate specific observed phenomena
in the real world to facilitate an understanding of driving scenarios
from the semantic space. However, handcrafted rules cannot cope
with highly complex learning tasks. Moreover, the complexity and
diversity of the real world impose evident limitations on these
methods, unable to tolerate the fuzziness of continuous space and
noisy data [29]. (2) The data-driven paradigm is to establish
connectivity-based systems supported by massive data and computational power, capable of emulating the thought processes and
world exploration of humans. However, the learned representation
space processed by data-driven models differs significantly from
the scenario semantic space of the human cognitive system,

PREPRINT

2

Summarize

Model

Guide

Infer

Scenario
Semantic Space

Driving Scenarios

Driving Scenarios

Rule-driven Paradigm

Driving Scenarios

Representation Space

Data-driven Paradigm

⓵ Induce

⓶ Deduce

⓸ Infer

⓷ Reflect

Knowledge-augmented
Representation Space

Scenario
Semantic Space

Knowledge-driven Paradigm
3

Fig. 1. Comparison of three technical paradigms to autonomous driving. (1) The rule-based paradigm utilizes the understanding of driving scenarios
that are summarized in the scenario semantic space to guide driving. (2) The data-based paradigm tends to model the driving scenarios into the
representation space, which is subsequently inferred to the real world to accomplish driving tasks. (3) The knowledge-driven paradigm induces
information of driving scenarios into knowledge-augmented representation space, which can be deduced to generalized knowledge in the scenario
semantic space, subsequently inferring the scenarios to guide the drive with the knowledge reflection.

lacking the composability of knowledge and the interpretability of
logic [29]. Moreover, data-driven models inevitably encounter the
data bias or catastrophic forgetting phenomenon [30], [31]. (3) The
knowledge-driven paradigm aims to integrate the characteristics
of rule-driven and data-driven paradigms, which is a crucial
support for propelling significant advancements in the current
AI field [29], [32]. Knowledge-driven methods aim to induce
information of driving scenarios into knowledge-augmented representation space and deduce to the generalized driving semantic
space. It enables the emulation of human understanding of the real
world and the acquisition of learning and reasoning capabilities
from experience. Thus, knowledge-driven approaches will be an
indispensable pathway for the evolution of the next generation of
autonomous driving systems.
Currently, knowledge-driven methods are gradually emerging,
with early research endeavors seeking to incorporate knowledge
to enhance system performance, particularly in the realm of
autonomous driving [33]–[37]. However, these studies have not yet
been systematically organized and summarized. The knowledgedriven paradigm typically comprises the following key components:
Dataset & Benchmark. Datasets are digitized perceptions of
the real world gathered through various sensors, represented in
forms such as images [12], [38], point clouds [39], [40], etc. The
datasets can be endowed with semantic information through manual or automated annotations to construct mechanistic connections
between different objects aligning to human cognition [41]–[46].
The benchmarks established on the datasets serve as evaluation
metrics for assessing model performance. It is not only a crucial
step in developing data-driven methods but also a prerequisite for
constructing large models with general understanding capabilities.
However, overemphasizing the inference capabilities of models
on datasets may result in the “overfitting” dilemma, thereby
significantly constraining the models’ generalization abilities.
Environment. Environments always serve as cradles for the

intelligent agents, providing necessary resource conditions for
their survival. The natural world constitutes the only real environment. In contrast to the extended iteration cycles and high trialand-error costs of the real environment, AI agents can engage in
rapid learning and continuous iteration within closed-loop virtual
environments. Emerging neural rendering technologies facilitate
extensive 3D scene reconstruction at a low cost, creating highly
realistic road scenes to robustly support closed-loop environment
construction [47]–[51]. The World Model, designed to model
the environment, has the potential to enhance the authentic understanding of driving scenarios, facilitating the progression of
autonomous driving from perception to cognition [52]–[55]. Both
neural rendering technologies and world models can facilitate
the realization of closed-loop virtual simulations to effectively
generate rare corner cases that are difficult to capture in the real
world [55], [56].
Driver Agent. Knowledge-driven methods shift from passive,
data-centric learning to active, cognition-based understanding of
the world by systematically applying domain knowledge and
reasoning capabilities [57]–[59]. This transformation enables autonomous driving to effectively understand and adapt to unseen
driving scenarios [60]–[62]. As possessing rich human driving experience and common sense, Large Language Models (LLMs) are
commonly employed as foundation models for knowledge-driven
autonomous driving nowadays to actively understand, interact,
acquire knowledge, and reason from driving scenarios [46], [63]–
[65]. Similar to embodied AI’s standpoint, true intelligence can
only be achieved by curiosity-driven first-person intelligence in
the environment [66], [67]. Intelligent agents can continually explore and comprehend their surroundings to support autonomous
decision-making and creativity. Analogous to embodied AI, the
driving agent should possess the ability to interact with the driving
environment, engaging in exploration, understanding, memory,
and reflection to achieve genuine intelligence [65], [68].

PREPRINT

The objective of this paper is to comprehensively summarize the emerging technological trend involved with
knowledge-driven autonomous driving. We delve into the system framework and core components of knowledge-driven autonomous driving, subsequently analyzing the opportunities and
challenges in this field. This paper seeks to provide valuable insights for future research and practical application of autonomous
driving, striving to steer its development towards greater safety,
reliability, and efficiency.

2

W HAT IS AND W HY K NOWLEDGE - DRIVEN AU TONOMOUS D RIVING ?

This section delineates the advantages of knowledge-driven approaches over data-driven methods, illustrated through examples
drawn from the evolution of Computer Vision (CV) technologies. Subsequently, we discuss the surge in the development of
knowledge-driven techniques driven by generative models like
LLMs, and emphasize the significance of data-driven methods in
the advancement of autonomous driving.
2.1

Paradigm: Data-driven vs. Knowledge-driven

Limitations of Data-Driven Paradigm. While existing autonomous driving systems have achieved success in many aspects
under the data-driven paradigm, they still struggle to adapt to
new driving situations, suffer from overfitting issues caused by
data bias, and cannot explain their decisions, ultimately failing to
reach a satisfactory level of autonomous driving. The main reason
behind these limitations is that data-driven methods emphasize
training for specific domains and typically result in systems
that excel at the training datasets [69]–[71], but exhibit weak
generalization and scalability [72]–[74]. This inherent limitation
presents a formidable obstacle for autonomous driving systems
in coping with the diverse and unpredictable corner cases that
frequently arise in real-world driving scenarios [11], [75].
Advantages of Knowledge-driven Paradigm. In contrast to
traditional data-driven methods, knowledge-driven autonomous
driving enables vehicles to have a comprehensive understanding
of their surroundings. Essentially, knowledge-driven autonomous
driving involves a reasoned, knowledge-based understanding of
the real world, enabling it to handle various complex driving
scenarios and adapt to ever-changing environments. This understanding involves not only object detection but also semantics understanding [57] and context-aware relationship reasoning
within the environment [76], to solve complex problems such
as multitasking learning and end-to-end learning [77], [78]. Furthermore, the recent emergence of research related to the world
model is an advanced form of scenario understanding [27], [52],
[53], [79], [80], which is capable of understanding the world
and even generating predictions of future world content. The
knowledge-driven paradigm can improve system interpretability
and trustworthiness, making it easier for human to comprehend
the decisions and actions of autonomous driving.
The prevailing belief asserts that attaining human-like driving
capabilities is pivotal in realizing autonomous driving [81], [82].
Data-driven approaches tend to learn different driving abilities
from various driving scenarios, whereas their performances are
constrained by the size of the collected dataset [83]. Data-driven
methods only fit the inputs and outputs of the dataset for a few
specific tasks, which makes the acquired capabilities only able

3

to deal with driving scenarios that are closely related to the
collected dataset, and cannot generalize and scale to other unseen
scenarios. Nevertheless, as the volume of collected data grows,
the coverage possibility for new corner cases diminishes, and the
marginal effect of capability enhancement becomes increasingly
pronounced [84], [85]. In contrast, knowledge-driven methods
incorporate human knowledge and common sense into the autonomous driving system, facilitating the establishment of interconnections between different driving domains derived from realworld driving scenarios. Analogous to how a human only needs
to have seen an ostrich in a zoo to recognize an ostrich running
on a road, the knowledge-driven methods enable understanding
and decision reasoning for complex autonomous driving scenarios
through generalized scenario understanding capabilities acquired
in other domains [13], [86]. Therefore, this approach is anticipated
to bridge the gap between various driving domains, ultimately
resulting in more generalized driving capabilities. Remarkably, the
capabilities derived from this paradigm demonstrate the capacity
to drive in broader domains compared to those obtained through
data-driven approaches. This concept is further elucidated in Fig.
2: although data-driven approaches can acquire driving capability
by extracting features from datasets, both single-domain learning
and multi-domain learning are abstractions in high-dimensional
spaces, ci or C ′ , with limited generalization capabilities. While
knowledge-driven approaches can compress the driving capability
space Cˆ into a low-dimensional manifold space [87] by summarizing the experiences from multi-domain data to construct
foundational models with general comprehension capabilities. The
driving scenario corresponding to this space not only includes the
data collected during training, but also covers a lot of unseen data,
including a large number of corner cases.
2.2

Exploring the Knowledge-driven Trend in CV Tasks

In recent years, the traditional computer vision community has
witnessed a significant transformation, shifting from the perceptive paradigm to the cognitive paradigm. In the earlier phase,
data-driven methods predominantly focused on task completion
without a profound understanding of the underlying semantics.
This resulted in models that were effective at discrimination tasks
but lacked true comprehension of the data, like Image Classification has historically been a cornerstone of CV. Traditional datadriven methods, such as Convolutional Neural Networks (CNNs),
focused on training models to recognize and categorize images.
These methods excelled at specific tasks including handwritten
digits classification [88] and pioneering classification task research
on ImageNet [89]. Data-driven approaches for 2D Object Detection [90] aimed to locate and classify objects within images.
Methods like Faster R-CNN [91] and YOLO [92] were widely
adopted for this purpose. However, these methods primarily emphasized task performance without deep semantic understanding.
And Semantic/Instance Segmentation involves identifying object
boundaries and their categories. Techniques like U-Net [93] and
Mask R-CNN [94] are representative of data-driven approaches
that excelled at segmentation tasks but did not emphasize semantic
comprehension.
In contrast, knowledge-driven approaches aim to empower CV
tasks with a deeper understanding of semantics and recognition.
To address the issue that traditional methods fail to genuinely
understand data, some research has shifted towards training generative models or combining multimodal data to learn more robust

4

Driving Capability

Driving Scenario

PREPRINT

di

d1

di′

d2

d1′

ddii

d n′

c2

Data-driven paradigm (Single domain)

c2

Data-driven paradigm (Multiple domain)

dn

d2

d1

cn

ci

c1

di

ddnn
dd22

dd11

d 2′

cn

ci

c1

dn

cn

ci

c1

c2

Knowledge-driven paradigm

Fig. 2. Comparison between the single-domain data-driven paradigm (left), cross-domain data-driven paradigm (center), and the knowledge-driven
paradigm (right). The gray × in the driving scenario represents corner cases, while it transitions to green × , indicating that the method can
handle them respectively. Data-driven approaches focus on collecting domain-specific data di and obtaining driving capabilities ci that are limited
to handling only similar or corresponding domains d′i . Even if implementing multiple domains data-driven approaches, it only can learn the driving
capability C ′ for processing the union of datasets D′ . In contrast, knowledge-driven approaches aim to understand coherent features across
domains by incorporating human knowledge or common sense and to establish relationships between features, which achieve a broader range
of driving capabilities Ĉ that far exceed the performances of single-domain data-driven and cross-domain data-driven methods, i.e., D̂ ≫ D′ >
{d1 , d2 , . . . , dn }.

data representations. For instance, Image Captioning [95] attempts
to make models comprehend the content of images and generate
descriptive text, thereby demonstrating the model’s true understanding of the image content. The Visual Question Answering
(VQA) [96] verifies the model’s reasoning ability by constructing
complex question-answer pairs related to image content. There are
even datasets like Visual Genome [97] that can perform multiple
complex tasks such as object detection, image description, and
object relationship inference simultaneously. Moreover, with the
increase in computational power, research in this domain has
expanded from images to videos. Until now, research in the
field of CV remains dynamic. The emergence of Generative
Adversarial Networks (GAN) [98], [99] and Variational Autoencoders (VAE) [100] validates the potential of generative models,
while the Diffusion Model [101]–[103] has elevated cross-modal
understanding to a new level.
2.3 LLM: A Milestone for Knowledge-driven Approaches
Recently, LLMs have achieved remarkable performance. These
models have achieved remarkable performance by leveraging
extensive training on massive text datasets, showcasing powerful text generation and comprehension capabilities. LLMs have
demonstrated their competence in understanding natural language
and tackling diverse complex tasks [104], emerging as a milestone in the development of knowledge-driven methods. Some
notable examples of LLMs include GPT-3 [105], PaLM [106],
LLaMA [107], and GPT-4 [108]. Notably, the emergent capability
in LLMs is one of their most distinguishing features compared to
smaller language models. Specifically, capabilities such as contextual learning [109], instruction following [110], [111], and chain
of thought reasoning [112] are three typical emergent abilities in
LLMs. Specifically, ChatGPT [113] and GPT-4 [108] represent
significant advancements in LLM capabilities, especially in natural
language understanding and generation. It’s worth noting that
LLMs are seen as equipped with human-like intelligence and
common sense to hold the potential to bring us closer to the field

of Artificial General Intelligence (AGI) [104], [114]. Remarkable
breakthroughs in LLMs underscore the critical importance of highquality data. These models exhibit robust reasoning capabilities
and also possess emergent capacity, which lays a solid foundation
for the development of knowledge-driven autonomous driving.
2.4 Significance of Knowledge-driven Methods to Autonomous Driving
Data is critical to the development of autonomous driving technology, which relies on massive amounts of data to optimize
algorithmic models to be able to recognize and understand the
road environment to make the right decisions and actions [6],
[115]. For example, the huge amount of data and driving scenarios
accumulated by Tesla is an important reason for being able to
stay ahead of the curve in autonomous driving algorithms. As
the autonomous driving task is evolving from a single perception
task to an integrated multi-task of perception and decision-making
[116], the diversity and richness of autonomous driving data
modalities are becoming critical. However, models trained solely
on large amounts of collected data can only be third-person
intelligence [66], [67], which refers to an AI system that observes, analyzes, and evaluates human behaviors and performances
from a bystander’s perspective. However, the ultimate form of
autonomous driving will be the realization of a generalized AI
for the driving domain [117], [118], which makes the shift from
the data-driven paradigm to the knowledge-driven paradigm an
inevitable requirement for the evolution of autonomous driving.
The knowledge-driven paradigm does not completely detach
from the original data-driven approaches but adds the design of
knowledge or common sense based on the data-driven approaches,
such as common sense judgment, empirical induction, logical
reasoning, etc. Knowledge-driven methods rely on AI agents
to explore the environment and acquire general knowledge, as
opposed to the implementation of predefined human rules or
the portrayal of abstract characteristics from collected data [27],
[79]. Specifically, the iterative updating of the knowledge-driven
approach requires the continuous summarization of data from the

PREPRINT

5

Dataset & Benchmark
Traditional

Driver Agent
Pipeline

Knowledge-augmented

…

Environment
World Model

Reconstruction

…
Simulator

Detection

Prediction

Plan

Control

Life-long learning

Cognition

Domain-agnostic

nuPlan

HighwayEnv

Knowledge-driven
AD System

Fig. 3. Key components in knowledge-driven autonomous driving.

Agent’s interaction with the environment to form new specialized
domain knowledge to enhance the specialized capabilities [32],
[119]. Recent advancements in autonomous driving reflect this
shift from purely data-driven methodologies to those that derivation from knowledge-driven.
Transformation of Perception Module. Previous autonomous driving perception modules usually perform open-loop
fitting on a dataset to recognize and localize semantic information
in the scene, including 3D object detection [69], [120], [121],
lane detection [122], [123], semantic segmentation [124], [125],
etc. The inputs to the perception module are usually pictures
captured by cameras and point cloud information collected by
LiDAR. Correspondingly, there are camera-only [120], [126],
LiDAR-only [69], [127], [128], and LiDAR-camera fusion [121],
[129], [130] schemes for perception methods. Recently, many
scholars have realized that a full understanding of the environment
requires a shift from perception to cognition. Since the setup of
in-vehicle sensors is often a comprehensive coverage of multiple
types and perspectives, the multimodal data collected needs to be
semantically aligned in a high-dimensional space to realize a true
understanding of the driving scene [131], [132].
Knowledge-embedded Decision-making and Planning.
Early automated driving decision planning was usually done by
building explicit mathematical models for fitting driving data,
including the classical car-following and lane-changing models
[133]–[135]. To improve the applicability of the models in different scenarios, these explicit mathematical models need to be
continuously improved based on expert knowledge and increase
the complexity of the models. However, the diversity of real-world
scenarios makes such improvements increasingly challenging.
As a result, researchers often resort to manually designed state
machines to address as many corner cases encountered in realvehicle testing as possible [136]–[138]. Contrastingly, another
category of modeling concepts aims to harness the exploratory
capabilities of heuristic search methods and the approximation
power of deep learning, with the goal of surmounting challenges
associated with manual design. Despite these efforts, these approaches continue to encounter difficulties in complex scenarios. Heuristic search methods heavily rely on human-designed
heuristic functions, and the dimension explosion also poses a
challenge in achieving approximate optimal solutions within finite
time [139], [140]. Reinforcement learning methods require closedloop training in simulation engines or even real environments
at a high cost and expense, and the convergence conditions of

the model often depend on the reasonableness of the manually
designed reward function [141], [142]. Although it is possible to
obtain the reward function from the data by inverse reinforcement
learning methods [143], it also means that the model is less
capable of generalizing to different environments. Incorporating
human knowledge to support autonomous driving also presents a
significant challenge for decision planning. Compared with the
insurmountable limitations of other decision planning models,
including social force-based models [144], [145], risk field-based
models [146], [147], etc., the powerful knowledge utilization and
reasoning capabilities recently demonstrated by LLMs are more
suitable for understanding, reasoning, and decision making for
autonomous driving [13], [148], [149].
The Trend towards Modular Convergence. The end-to-end
technology route was also the plain idea of early research in
autonomous driving. For example, CMU’s Navlab implemented
an autonomous driving system based on an end-to-end model
as early as the 1980s [150], which used visual sensor data as
inputs and directly outputted steering wheel angle, brake pedal
strength, and other in-line signals to control the vehicle. However,
this was limited by the uncertainty brought by the arithmetic
conditions and black-box system at that time. With the diversified and uneven development of autonomous driving perception,
planning, control, and other technologies, emerging autonomous
driving companies represented by Tesla and Waymo have gradually constructed a modular-based autonomous driving pipeline
[151], [152], which has become prevalent autonomous driving solutions. Subsequently, perception, planning, and decision-making
have shown a trend of convergence, including the integration of
prediction and decision-making, and even end-to-end autonomous
driving [5], [78], [153], [154]. Researchers generally realize that
autonomous driving is oriented to the ultimate goal of vehicle
performance such as safety and efficiency [155], [156]. End-toend autonomous driving can avoid overall performance degradation due to heterogeneous optimization directions and cascading
information transfer errors [116].
From a knowledge-driven perspective, perception, prediction,
planning, and control have a sequential causal relationship, which
is easily evidenced in common driving scenarios. For instance, a
cyclist turning back in a non-motorized lane could indicate an intention to make a turn, and a vehicle activating its turn signal while
proceeding straight may signify an upcoming lane change within
a few seconds. The separate perception modules, which merely
convey bounding boxes to the prediction and decision-making

PREPRINT

modules, present challenges in ensuring the subsequent modules’
performance effectiveness. In contrast, the end-to-end frameworks
based on module fusion can efficiently extract and convey features
closely associated with the driving task. However, existing end-toend frameworks still represent only a high level of abstraction of
knowledge and are unable to articulate the utilization of driving
knowledge manifested in the model output [5], [157]. Therefore,
the textualized explanation of scene understanding and logical
reasoning provided by the LLMs is anticipated to enhance the
credibility and robustness of the existing end-to-end framework.
In summary, the knowledge-driven paradigm stands at the forefront of recent advancements in autonomous driving technology.
When equipped with high-quality data and a suitable environmental platform, the pivotal question becomes the design of effective
knowledge-driven modeling solutions. This entails integrating
human driving experience and common sense into the system,
developing knowledge models endowed with the capability to
reason and solve intricate driving challenges. Knowledge-driven
modeling approaches empower autonomous driving systems to
adeptly navigate evolving traffic and road scenarios, thereby
enhancing system performance, interpretability, and safety. In the
following sections, we will introduce the knowledge-driven system
framework synthesized with key components, as illustrated in Fig.
3. This includes the development of datasets and benchmarks,
how to construct high-quality environments, and how to acquire
knowledge-driven driver agents for autonomous driving.

3

DATASETS & B ENCHMARKS

The safety and reliability of autonomous driving systems have
always been crucial evaluation factors. For the evaluation of
knowledge-driven autonomous driving, researchers develop and
assess these systems using appropriate datasets, benchmarks, and
metrics. Traditional data-driven autonomous driving datasets [43],
[158]–[163] provide mappings from sensor data to perception,
prediction, and planning labels. Accompanied by the emergence of knowledge-driven autonomous driving, various groups
of researchers augment preexisting [41], [164]–[169] or recently acquired datasets [45], [170]–[173] with different types
of knowledge, mainly in the modality of natural language and
gaze heatmap. By incorporating external knowledge, the intelligence level of autonomous driving models gradually evolves
from perception level to cognition level, ensuring stronger reliability and interpretability. This section first introduces traditional autonomous driving datasets and then delves into existing
knowledge-augmented autonomous driving datasets and corresponding benchmarking tasks. Finally, this section presents commonly used tasks and evaluation metrics in knowledge-oriented
autonomous driving benchmarks.
3.1

Traditional Datasets

This section provides a detailed introduction to traditional autonomous driving datasets, which are also visualized in Fig. 4(a).
KITTI dataset [158] is a collection of sensor data recorded
in and around Karlsruhe, Germany, with the main purpose of
advancing computer vision and robotic algorithms for autonomous
driving. It includes camera images, laser scans, high-precision
GPS measurements, and IMU accelerations. The dataset provides
precise instructions for accessing the data and insights into sensor
limitations and common challenges. The sensor setup consists of
grayscale and color cameras, a 3D laser scanner, and an inertial

6

and GPS navigation system. The dataset is divided into categories
such as ’Road’, ’City’, ’Residential’, ’Campus’, and ’Person’, and
includes raw data, object annotations in the form of 3D bounding
boxes, tracklets, and calibration data.
Cityscapes dataset [159] is a large-scale benchmark suite
for semantic urban scene understanding. It consists of stereo
video sequences captured from a moving vehicle in 50 different
cities, primarily in Germany. The dataset includes 5,000 images
with high-quality pixel-level annotations and an additional 20,000
images with coarse annotations. The data recording and annotation
methodology were designed to capture the variability of outdoor
street scenes. The dataset provides both fine and coarse pixellevel annotations for 30 visual classes, including instance-level
labels for humans and vehicles. The annotations were carefully
quality controlled, and the dataset includes vehicle odometry,
outside temperature, and GPS tracks. The Cityscapes dataset
surpasses previous attempts in terms of size, annotation quality,
scene variability, and complexity.
Berkeley DeepDrive Video dataset (BDDV) [160] is a large
and diverse dataset consisting of real driving videos and GPS/IMU
data. It covers various driving scenarios such as cities, highways,
towns, and rural areas in major US cities. Compared to earlier
datasets like KITTI and Cityscapes, BDDV stands out in terms
of its scale, providing over 10,000 hours of driving videos. Additionally, BDDV includes smartphone sensor data such as GPS,
IMU, gyroscope, and magnetometer readings, which can be used
to analyze vehicle trajectory and dynamics. The dataset aims to
capture the diversity of driving scenes, car makes and models,
and driving behaviors. This makes BDDV suitable for learning a
generic driving model.
Honda Research Institute Driving dataset (HDD) [161] is a
collection of sensor data recorded from an instrumented vehicle in
the San Francisco Bay Area. The dataset includes video from three
cameras, 3D LiDAR data, GPS signals, and signals from the vehicle’s CAN bus. The data collection aimed to capture diverse traffic
scenes and driver behaviors. The dataset consists of 104 hours of
video, with annotations based on a 4-layer representation of driver
behavior. The annotation methodology incorporates both objective
criteria and subjective judgment. The dataset provides insights into
driver behavior, including goal-oriented actions, stimulus-driven
actions, causes, and attention. The dataset is around 150GB in size
including 137 sessions with an average duration of 45 minutes.
nuScenes dataset [43] is a collection of driving data from
Boston and Singapore, featuring diverse locations, weather conditions, and driving scenarios. The dataset includes 84 logs with
15 hours of driving data, captured using Renault Zoe electric
cars equipped with various sensors. The data is carefully synchronized, and localization is achieved through a robust LiDAR-based
method. Highly accurate human-annotated semantic maps and
baseline routes are provided. The dataset contains 1000 interesting
scenes manually selected, covering high traffic density, rare events,
and challenging situations. Expert annotators provide detailed annotations for 23 object classes, including pedestrians and vehicles.
The dataset encourages research on long-tail problems and offers
high-frequency sensor frames.
Waymo Open dataset (WOD) [162] provides sensor data collected using five LiDAR sensors and five high-resolution pinhole
cameras. The LiDAR data includes the first two returns of each
laser pulse, while the camera images are captured using rolling
shutter scanning. The dataset offers ground truth annotations for
both LiDAR and camera data, including 3D bounding boxes for

PREPRINT

7

Fig. 4. (a) Traditional and (b) knowledge-augmented autonomous driving datasets. The arrow indicates that the knowledge-augmented datasets
are derived from the corresponding source dataset through secondary annotation.

objects in LiDAR data and 2D bounding boxes for objects in
camera images. Multiple coordinate systems are used, such as
global, vehicle, sensor, and image frames. The dataset covers
suburban and urban areas, with approximately 12 million labeled
3D LiDAR objects and 12 million labeled 2D image objects.
3.2

Knowledge-augmented Datasets

This section provides a detailed introduction to knowledgeaugmented autonomous driving datasets, which are also visualized
in Fig. 4(b). Additionally, Table 1 presents key attributes of
existing knowledge-augmented datasets.
Berkeley DeepDrive eXplanation (BDD-X) dataset [166]
contains over 77 hours of driving videos with accompanying
textual justifications for driving actions. It includes diverse driving
conditions and activities, such as lane changes and turns, annotated
by human annotators familiar with US driving rules. It consists
of a training set, a validation set, and a test set with a total
of 6,984 videos. BDD-X dataset aims to improve the trust and
user-friendliness of self-driving cars by providing explanations for
their decisions. To fulfill this goal, the dataset utilizes three benchmarking tasks, namely vehicle control, explanation generation, and
scene captioning.
Cityscapes-Ref dataset [165] focuses on object referring in
videos, incorporating language descriptions and human gaze. It
includes 5,000 stereo video sequences from the Cityscapes dataset,
with annotations for object descriptions, bounding boxes, and
gaze recordings. The dataset aims to address the limitations of
previous datasets by providing temporal, spatial context, and gaze
information. Cityscapes-Ref dataset employs the task of object
referring for benchmarking.
DR(eye)VE dataset [170], [174] consists of 555,000 frames
from 74 sequences, captured during a driving experiment with
eight drivers in various contexts and weather conditions. The

dataset includes eye-tracking data from SMI ETG glasses and carcentric views from a roof-mounted camera. The dataset enables
the analysis of driver behavior and attention in real-life driving
scenarios. Fixation maps are computed using a temporal sliding
window, and attention drifts are labeled for evaluation purposes.
Multiple baselines are tested on DR(eye)VE dataset for the task of
gaze prediction.
Honda Research Institute-Advice dataset. Honda Research
Institute-Advice Dataset (HAD) [167] consists of 5,675 driving
video clips with human-annotated textual advice. The videos are
collected from the HDD dataset [161] and include various driving
activities in urban settings. Annotators describe the driver’s actions
and provide attention descriptions from a driving instructor’s perspective. The dataset contains a total of 25,549 action descriptions
and 20,080 attention descriptions. The advice covers topics such
as speed, driving maneuvers, traffic conditions, and road elements.
Multiple baseline methods are evaluated on HAD dataset for the
task of vehicle control.
Talk2Car dataset [41] is built upon the nuScenes dataset
and includes 850 videos with written commands for autonomous
driving. The dataset covers different cities, weather conditions, and
times of day. Each video has annotations for six cameras, LIDAR,
GPS, IMU, RADAR, and 3D bounding boxes for 23 object classes.
The dataset contains 11,959 commands, with an average of 11.01
words per command. The dataset provides a wide distribution
of commands, object distances, and object categories. Talk2Car
dataset employs the task of object referring for benchmarking.
DADA-2000 dataset [171], [175] is a collection of accident videos obtained from various video websites. It consists
of 658,476 frames from 2000 videos, covering a duration of
6.1 hours. The dataset includes diverse accident categories and
provides annotations for spatial crash objects, temporal accident
windows, and attention maps. It offers a comprehensive represen-

PREPRINT

8

TABLE 1
Key attributes of existing knowledge-augmented datasets. C, L, and R stand for Camera, LiDAR, and Radar respectively.

Dataset

Sensors

Knowledge Form

Tasks

Metrics

BDD-X [166]

C

Explanation

Vehicle Control, Explanation Generation, Scene Captioning

MAE, MDC, BLEU-4, METEOR,
CIDEr-D

Cityscapes-Ref [165]

C

Object Referral, Gaze Heatmap

Object Referring

Acc@1

DR(eye)VE [170]

C

Gaze Heatmap

Gaze Prediction

CC, KLD, IG

HAD [167]

C

Advice

Vehicle Control

MAE, MDC

Talk2Car [41]

C+L+R

Object Referral

Object Referring

IoU@0.5

C

Gaze Heatmap, Crash Objects, Accident Window

Gaze Prediction

CC, KLD, NSS, SIM

HDBD [172]

C

Gaze Heatmap, Takeover Intention

Driver Takeover Detection

AUC

Refer-KITTI [164]

C+L

Object Referral

Object Referring, Object Tracking

HOTA

DRAMA [173]

C

Advice, Risk Localization

Motion Planning

L2 Error, Collision Rate

Rank2Tell [45]

C+L

Object Referral, Importance Ranking

Importance Estimation, Scene Captioning

F1 Score, Accuracy, BLEU-4, METEOR, ROUGE, CIDER

DriveLM [169]

C+L+R

Scene Captioning, Question Answering

Scene Captioning, Question Answering

-

NuScenes-QA [168]

C+L+R

Question Answering

Question Answering

Exist, Count, Object, Status, Comparison, Acc

DADA-2000 [171]

tation of accident situations in driving scenes and is more complex
compared to previous datasets for driving accident analysis. This
dataset utilizes gaze prediction as the primary benchmarking task.

visual attributes, and reasoning descriptions. DRAMA dataset
utilizes motion planning as the primary benchmarking task.

HRI Driver Behavior dataset (HDBD) [172] contains driver
behavior data collected from simulator and real scene videos.
The dataset includes behavioral and physiological signals from
28 participants, along with environmental and vehicle sensory
information. The data was collected using eye-tracking devices,
physiological sensors, and vehicle/driving simulator sensory data.
The dataset includes human-AV interaction data from 32 participants, focusing on monitoring L2 automated driving through
intersections. The dataset provides information on takeover intention, HMI transparency levels, maneuvers, weather conditions,
and synchronized signals for analysis. Authors evaluate multiple
baseline methods on HDBD dataset for driver takeover detection
task.

Rank2Tell dataset [45] consists of 116 video clips captured at
intersections using multiple cameras, LiDAR sensors, and GPS in
diverse traffic scenes. The dataset focuses on identifying and ranking important agents that can influence the ego vehicle’s driving.
Annotations are provided by five annotators, considering agent
identification, localization, ranking, and captioning. The dataset
emphasizes explainability by providing captions that explain why
agents are deemed significant. The dataset enables the evaluation
of agent importance perception and caption diversity in traffic
scenes. This dataset employs two benchmarking tasks, namely
importance estimation and scene captioning.

Refer-KITTI dataset [164] is a dataset constructed based on
the public KITTI dataset [158], aimed at referring understanding.
It utilizes instance-level box annotations from KITTI and a labeling tool to efficiently annotate referent objects across frames. The
dataset features diverse scenes and provides descriptive statistics
on object numbers and temporal dynamics. Refer-KITTI includes
818 expressions and is split into 15 training videos and 3 testing
videos, offering flexibility and temporal challenges for referent object association. This dataset utilizes object referring and tracking
as the primary benchmarking task.
DRAMA dataset [173] is designed for evaluating visual
reasoning capabilities in driving scenarios. It consists of 17,785
interactive driving scenarios recorded from urban roads in Tokyo.
The dataset includes synchronized videos, CAN signals, and IMU
information. Annotations are provided through object-level and
video-level questions and answers, focusing on identifying important objects and generating associated attributes and captions. The
dataset statistics highlight the distribution of labels, object types,

DriveLM dataset [169] is an autonomous driving dataset that
connects LLMs and autonomous driving systems. It incorporates
linguistic information and reasoning abilities to facilitate perception, prediction, and planning (P3) in autonomous driving. The
dataset includes frame-based QA pairs connected in a graph-style
structure, covering perception, prediction, and planning tasks. It is
based on the nuScenes dataset and aims to enhance the reasoning
and decision-making capabilities of autonomous driving systems.
Scene captioning and question answering tasks are incorporated
for benchmarking.
nuScenes-QA dataset [168] is constructed for 3D question answering in driving scenarios. It combines scene graphs generated
from 3D annotations with manually designed question templates
to generate question-answer pairs. The dataset contains 459,941
pairs based on 34,149 visual scenes, with a wide range of question
types and lengths. It is the largest 3D-related question answering
dataset, providing balanced distributions of questions and answers.
The dataset poses challenges for models due to its complexity and
diverse visual semantics.

PREPRINT

3.3

Benchmarking Tasks and Evaluation Metrics

This section offers an in-depth overview of various benchmarking
tasks and associated evaluation metrics specific to knowledgedriven autonomous driving.
Motion Prediction and Planning involves forecasting the
trajectories of various traffic participants (vehicles, pedestrians,
etc.), and planning the future movements of an ego vehicle in
both open-loop and closed-loop manners. Key metrics for motion
prediction include Average Displacement Error (ADE), Final
Displacement Error (FDE), Miss Rate, Overlap Rate, Average
Heading Error (AHE), and Mean Average Precision (mAP). ADE
assesses the average displacement error of the closest prediction to
the ground truth trajectory, while FDE evaluates the displacement
error at a specific future time step. Miss Rate is determined by
whether the model’s predictions for traffic participants fall within
certain thresholds of the ground truth trajectory. Overlap Rate
examines the incidence of predicted trajectories overlapping with
other objects in the scenario. AHE is defined as the average of
the heading angle differences between the predicted trajectory and
the ground truth. mAP provides a comprehensive evaluation by
categorizing trajectories and measuring the precision and recall of
the predictions against the ground truth. For the task of open-loop
planning, metrics are similar to those of motion prediction, as they
involve predicting the ego vehicle’s future trajectory. In contrast,
closed-loop ego vehicle planning tasks entail following the output
trajectory from the method and continuously interacting with traffic participants in the dynamic scene. Key metrics for closed-loop
planning tasks typically include No at-fault Collisions, Drivable
Area Compliance, Speed Limit Compliance, Comfort, and Time
to Collision (TTC) within bounds. These metrics ensure that
the ego vehicle’s trajectory avoids collisions with other vehicles,
drives within the mapped drivable area, and obeys speed limits
at all times. Comfort is measured by evaluating the minimum
and maximum longitudinal and lateral accelerations of the ego
vehicle’s driven trajectory.
Scene Captioning and Explanation Generation. Given a
stream or a frame of sensory data, e.g. camera and/or LiDAR
data, these two tasks require the captioning model and explaining
model to generate description and reasoning texts. To evaluate the
performance of captioning and explaining models, metrics including BLEU [176], METEOR [177], ROUGE [178], CIDEr [179],
CIDEr-D [180] are adopted, whose details are discussed below.
BLEU [176] is an automatic evaluation metric that measures the
similarity between a machine-generated translation and reference
translations based on n-gram precision. It calculates the precision
of n-grams up to a 4-gram level by counting matching n-grams.
The modified precisions for each n-gram length are combined using a weighted geometric mean to compute the BLEU score, which
ranges from 0 to 1. METEOR [177] is an automatic evaluation
metric used to assess the quality of machine-generated translations
or text generation systems. It captures overall quality, fluency, and
adequacy. METEOR calculates the number of matching unigrams
between the machine-generated translation and reference translations, considering exact word matches, stemming, and synonymy.
Precision, recall, alignment, and ordering scores are combined
using a weighted harmonic mean to obtain the final METEOR
score. ROUGE [178] is an automatic evaluation metric commonly
used in NLP to assess text summarization systems. It quantifies the
overlap between the generated summary and reference summaries.
ROUGE involves preprocessing, n-gram matching, calculation

9

of recall and precision, and computation of the F-measure as
the harmonic mean of recall and precision. ROUGE scores are
typically computed for multiple n-gram lengths and aggregated
to obtain an overall score. Consensus-based Image Description
Evaluation (CIDEr) [179] is an evaluation metric used in the field
of computer vision and image captioning to assess the quality
of automatically generated image captions. It aims to capture
both the relevance and diversity of the generated captions. CIDEr
measures the consensus between the generated captions and the
human-generated reference captions. The CIDEr metric computes
the similarity between the generated captions and the reference
captions based on n-gram matching and term frequency-inverse
document frequency (TF-IDF) weighting. CIDEr with Diversity
(CIDEr-D) [180] is an extension of the CIDEr metric that incorporates diversity into the evaluation. It encourages the generation
of diverse and informative captions by penalizing captions that are
similar to each other. CIDEr-D achieves this by adding a diversity
term to the original CIDEr score, which measures the uniqueness
of the generated captions.
Object Referring involves referring to specific objects within
images or scenes using natural language descriptions. In object
referring, a typical scenario involves an image or a scene accompanied by a textual description that refers to a particular
object or region of interest within that visual input. The goal is
to develop models that can comprehend the textual description
and effectively map it to the corresponding object or region in the
image. Commonly used metrics include Acc@1 and IoU@0.5.
Acc@1 metric is a commonly used evaluation measure to assess
the performance of models in accurately localizing or identifying
referred objects. Formally, let N denote the total number of object
referral instances in the evaluation dataset. For each instance, the
model generates a ranked list of predictions, typically consisting of
bounding boxes or class labels. The Acc@1 metric measures the
percentage of instances where the ground truth annotation for the
referred object aligns with the top-ranked prediction made by the
model. IoU@0.5 metric is a commonly used evaluation measure to
assess the accuracy of object localization. Formally, let N denote
the total number of object referral instances in the evaluation
dataset. For each instance, the model generates a predicted bounding box for the referred object, and there is a corresponding ground
truth bounding box provided. The IoU@0.5 metric calculates the
percentage of instances where the Intersection over Union between
the predicted bounding box and the ground truth bounding box
exceeds or equals 0.5.
Gaze Prediction involves predicting the spatial probability
distribution of a person’s gaze within a given visual scene in
autonomous driving. Commonly used evaluation metrics include
Pearson’s Correlation Coefficient (CC) [181], Kullback–Leibler
Divergence (KLD) [182], Information Gain (IG) [183], Normalized Scanpath Saliency (NSS), and Similarity Metric (SIM). To
be specific, CC [181] measures the linear relationship between
two variables. It quantifies the strength and direction of the linear
association between the predicted attention map and the groundtruth fixations. Pearson’s correlation coefficient ranges from -1 to
1, where a value of 1 indicates a perfect positive linear relationship, 0 indicates no linear relationship, and -1 indicates a perfect
negative linear relationship. KLD [182] quantifies the amount of
information lost when comparing the probability distribution of
the predicted attention maps to the ground-truth distribution. A
smaller KLD value indicates a lower amount of information loss,
meaning that the predicted maps closely resemble the ground-truth

PREPRINT

10

Perceive

Assemble

Real World

Simulation

Graphics engine
Collect

Model

Environment

Real data

Gain

Render

Implicit reconstruction

Environment

Trigger
Generalize

Guide

Vehicle

Knowledge

World model

Simulate

Synthetic data

Vehicle

Enhance
Fig. 5. From the real-world environment to the virtual simulation environment. The utilization of graphics engines enables the perception of realworld environments and the assembly of virtual simulated environments, while this approach incurs high costs. Implicit reconstruction methods,
which render simulated environments by collecting data from multiple sources, emerge as a promising and cost-effective solution. Integrating
knowledge and data to construct world models facilitates a genuine understanding of the environment, enabling the accomplishment of diverse
tasks, particularly in synthesizing data to support closed-loop simulations.

distribution. NSS calculates the mean value of positive positions
in the predicted attention map. It measures how well the predicted
attention map aligns with the ground-truth fixations, with higher
values indicating better alignment. SIM evaluates the similarity
between the predicted attention map and the ground-truth distribution. A larger SIM value indicates a better approximation of the
ground-truth distribution by the predicted attention map.
Question Answering. In autonomous driving scenarios, the
Question Answering task refers to the process of answering
questions related to the visual perception of the autonomous
vehicle. It involves analyzing the visual data captured by the
vehicle’s sensors, such as cameras, LiDAR, or radar, and providing
meaningful answers to questions about the environment. The
questions in NuScenes-QA [168] can be categorized into five
groups based on their query formats. The first category is “Exist”,
which involves querying whether a particular object exists in
the scene. The second category is “Count”, where the model is
asked to count objects in the scene that meet specific conditions
mentioned in the question. The third category is “Object”, which
tests the model’s ability to recognize objects in the scene based
on language descriptions. The fourth category is “Status”, which
involves querying the status of a specified object. Lastly, the
fifth category is “Comparison”, where the model is requested to
compare specified objects or their statuses.

4

followed by deploying the model on vehicles for on-road testing.
New issues identified during road testing prompt engineers to
repeat the entire process. However, this process involves considerable human and material resources, as several stages incur
significant costs, including data collection, annotation, and model
training [18]. Thus, some researchers have shifted focus towards
“virtual testing” [184], [187]. Shadow mode [188] represents a
typical virtual testing approach to self-supervised training by
constructing supervisory signals based on the real environment
and human driver decisions. Shadow mode enables cloud-based
training through data feedback or on-vehicle training through
federated learning. Testing on simulation engines is another highly
anticipated approach [189]–[191]. The self-training and iteration
processes within simulated environments can reduce the costs
of data collection and annotation and more closely with human
learning skills: observation, interaction, and imitation [192]. This
methodology is expected to play a crucial role in the knowledgedriven autonomous driving paradigm. Additionally, the emergence
of world models enables us to contemplate key issues in scene
understanding and construction from the perspective of generative
models. As shown in Fig. 5, we demonstrate the combination of
the real-world environment and the virtual simulation.
This section shows the role of the environment in knowledgedriven autonomous driving from three perspectives: (1) simulation
engines; (2) high-fidelity sensor simulation; (3) world models.

E NVIRONMENT

Similar to other AI agent systems, autonomous driving systems
require continuous iteration through training to enhance performance, thereby strengthening their adaptability in the environment. Training can utilize collected datasets from real-world
environments or take place within constructed closed-loop simulation environments [184]–[186]. Previous autonomous driving
algorithms predominantly rely on the former approach. This
involves initial offline training and testing using collected data,

4.1

Simulation Engines

Simulation engines for autonomous vehicles refer to computerbased simulations of real-world scenarios, including urban roads,
highways, various weather conditions, and traffic situations, to
facilitate improved training and evaluation of algorithm performance [193]. Compared to traditional on-road testing, simulation
engines offer several advantages. Firstly, simulation engines provide a safer and more controlled environment, mitigating potential

PREPRINT

11

Real-world
Dataset
Multi-view images

Data clean & labeling

Point cloud

GPS & posture

Full scene
Reconstruction
Neural rendering

Decoupling of fore-background scenes

Generation of dynamic trajectory

Sensor type

Sensor model and noise simulation

Sensor deployment

Sensor
Simulation

Fig. 6. Generalized environment based on neural rendering. (1) Real-world data: Processing and annotating multi-view images and point clouds, and
comprehending scenes through information derived from GPS and poses; (2) Full scene reconstruction: Neural rendering technology can decouple
and reconstruct foreground and background separately, and various generalized scenes can be generated using dynamic trajectory generation
technology; (3) Sensor simulation: Exploring different types of in vehicle sensors, different layout schemes, and simulations under weather and
other disturbances.

risks associated with testing autonomous vehicles on real roads.
Secondly, simulation engines can generate large-scale annotated
datasets, which are crucial for training deep learning models.
Additionally, simulation engines assist development teams in
faster iteration and debugging, enabling anomaly detection and algorithm optimization, thereby enhancing development efficiency.
Lastly, simulation engines can generate diverse scenarios in a
well-controlled environment, ensuring that the system can respond
correctly to various challenges.
In existing autonomous driving systems, distinct simulation
tools are available for each stage, including perception, decisionmaking, planning and control. For example, SUMO [190] and
LimSim [191] can simulate traffic flows and model the motion
interactions between vehicles; HighwayEnv [194], nuPlan [195],
and waymax [196] provide closed-loop simulation for decisionmaking; CarSim [197] provides simulation for vehicle dynamics. However, comprehensive testing for autonomous driving
necessitates simulation engines that encompass various stages,
creating a simulation environment closely resembling the real
world. Therefore, Virtual Test Drive [198] and CARLA [189] are
designed based on game engines, such as Unreal Engine [199] and
Unity Engine [200], aiming to establish three-dimensional endto-end closed-loop simulation environments. Nevertheless, this
construction method still demands substantial human and material
resources for manually designing road structures and creating
three-dimensional objects, posing challenges for large-scale applications. Moreover, simulators based on these game engines still
exhibit significant deficiencies, contributing to domain gaps that
impair the accuracy of algorithms trained in simulated scenarios
when applied in the real world.
The current research trend involves integrating real-world data
into simulation engines [201]. Firstly, the realism of simulation
engines can be heightened by leveraging knowledge gained from
real-world data. Furthermore, datasets often contain precise anno-

tation information, facilitating a more comprehensive evaluation
of the capabilities of the autonomous driving algorithm across
different perception and decision-making modules within the simulation engine. Lastly, real-world data may encompass challenging
scenarios, and incorporating these scenarios into simulation engines aids in testing the algorithms’ robustness when confronted
with various challenges. It is noteworthy that, as datasets cannot
cover all conceivable scenarios, simulation engines also need to
synthesize data to encompass diverse scenarios.

4.2

High-Fidelity Sensor Simulation

Although synthesizing large-scale data through simulation engines
is advantageous for training autonomous driving systems [202],
achieving high-fidelity sensor simulation within these engines is
a current research challenge. Due to the poor rendering realism
of autonomous driving simulators based on game engines [203],
meeting the requirements of end-to-end closed-loop simulation for
sensor simulation becomes difficult. Consequently, models trained
in these closed-loop simulators struggle to reflect their real-world
capabilities. Rendering quality has thus become a focal area of
research in simulation engine technology.
In recent years, the emergence of neural rendering technologies has shed light on this direction, like Neural Radiance Fields
(NeRF) [204]. Neural rendering models objects through implicit
representations, calculating the difference between the rendering
result and ground truth and using backpropagation to refine the
representation, ultimately achieving high-quality 3D reconstruction and rendering. Following the introduction of neural rendering
technology, it rapidly expanded from single-object reconstruction
to applications in indoor environments [205]–[208], static scenes
(BlockNeRF [209]), and dynamic scenarios (NeuRAD [210]).
Subsequently, UniSim [51] achieved decoupled 3D reconstruction of foreground objects, demonstrating generalization and the

PREPRINT

ability to generate new data. StreetSurf [50] achieved decoupled reconstruction of close-range, mid-range (streets), and farrange (sky) scenes, further enhancing the quality of street scene
reconstruction. MARS [48] also utilized NeRF technology to
construct an autonomous driving simulation engine. Additionally,
ReSimAD [72] validated the performance improvement brought
about by applying data generated by neural rendering to perception
algorithm training, demonstrating the importance of high-fidelity
sensor simulation.
Despite the widespread attention neural rendering technology
has garnered in academia and industry, and ongoing efforts to
better apply this technology to autonomous driving scenarios,
challenges persist in constructing simulation engines based on
neural rendering technology. Firstly, neural rendering technology
fundamentally remains a 3D reconstruction algorithm, demanding
high-quality reconstruction data and sensitivity to motion blur,
pose errors, lighting changes, lens flares, and other factors in
input data. Secondly, the pursuit of high-fidelity in 3D reconstruction compromises its generalization, making it challenging to
generate photorealistic virtual scenes like diffusion models [211],
GANs [98], and other generative models. Thirdly, large-scale
scene reconstruction and rendering pose significant computational
challenges, impacting the feasibility of constructing sensor-level
high-fidelity simulation engines using neural rendering in terms of
reconstruction speed and real-time rendering.
Due to the limited generalization capabilities of neural rendering for scenes, using it for environment simulation can only
originate from reconstruction data, making it difficult to contribute
to the generation of corner cases that are relevant to autonomous
driving simulation. Fig. 6 demonstrates a promising technical
framework for the generalized environment based on neural rendering. Drawing upon multi-view images, LiDAR-collected point
clouds, precise GPS coordinates, sensor pose, and multi-sensor
calibrations [212], neural rendering technology exhibits the capability to independently reconstruct the foreground and background
within a given scene. The foreground reconstruction encapsulates
the nuanced portrayal of movements and interactions among traffic
participants. The latest dynamic trajectory generation techniques
[213], [214] can facilitate the generation of varied traffic flows
distinct from the original scene. The achievement of high-fidelity
sensor simulation necessitates a thorough consideration of diverse
sensor types, placements [215], and potential environmental perturbations, including those induced by varying weather conditions.

4.3

Environment Understanding by World Model

The world model aims to simulate and understand physical laws
and phenomena in the real world, or can be considered as an
abstract representation of the environment [222]. The main idea
of the model is to build an abstract representation of the real
world by learning data acquired from multiple sensors, such as
images, sounds and sensor data. The model can then use this
abstract representation to make inferences and predictions in order
to make decisions about unknown situations. Such models have
a wide range of applications in areas such as robot control,
autonomous driving, and game AI. Currently world model is
usually built as an end-to-end deep learning framework that can
train using self-supervised or weakly supervised methods directly
from raw sensor data without extracting features manually. The
advantage of this model is that it can handle complex nonlinear

12

relationships between different objects in the scene and adaptively
fit different environments and tasks. This makes the world model
a universal way of understanding the real world, similar to the
way humans think [79]. The JEPA model [27] aims to construct
mapping relationships between different inputs in the encoding
space by minimizing input information and prediction errors. The
world model can enhance the ability of autonomous driving to
understand the environment and support large-scale high-quality
driving video generation [217]–[221], as shown in Table 2. For
example, DriveDreamer [53] uses a diffusion model to construct a
comprehensive representation of complex environments, enabling
recognition of structured traffic constraints and the ability to
predict the future. GAIA-1 [52] is a fully end-to-end generative
model that utilizes video, text, and action inputs to generate real
driving scenarios, and also enables prediction of future tokenized
sequences. Differing from the aforementioned approaches, Zhang
et al. [80] propose an unsupervised world model on sensor data
derived from point clouds, it tokenizes point clouds using a
vector quantized variational autoencoder (VQVAE) [223] combined with a PointNet and adopts a combination of generative
masked modeling and discrete diffusion for learning a world
model. OccWorld [224] can forecast future scene evolutions and
ego movements jointly based on the given past 3D occupancy
observations in a self-supervised manner.
The predictive capability of world models involves inferring
the relative positions and movement trends of other vehicles based
on current and past scene information, enabling the modeling
of potential effects of various actions and informed decisionmaking [80], [225]. Beyond merely predicting original sensor
signals, world models are intended to emulate human thinking and
comprehension of the real world. To achieve this, world models
need to incorporate expert experience embedding and interactive
learning [226]–[228], enhancing their multitasking capabilities
and establishing them as foundational models for knowledgedriven autonomous driving [229].

5

D RIVER AGENTS

The section initially delves into the development of embodied
AI and its connection to autonomous driving. Following that,
it succinctly summarizes recent studies focusing on LLMs in
autonomous driving, leveraging their robust reasoning and interpretable capabilities. Ultimately, a generalized knowledge-driven
framework is introduced, spotlighting crucial components such as
cognition, memory, planning, and reflection, with the overarching
goal of enhancing scene understanding and decision-making.
5.1

Embodied AI

Embodied AI [230]–[232] is a facet of intelligence emphasizing
the direct interaction between an intelligent system and its environment, involving perception, understanding, and action. Notably, advancements in embodied intelligence have concentrated
on humanoid robots and embodied AGI. As the ideal form of
embodied AI, humanoid robots have been improving their autonomy, flexibility and intelligence [233], such as the Optimus
humanoid robot introduced by Tesla, whose motion control ability
has been evolving, providing a strong hardware foundation for
the development of embodied AI. Meanwhile, embodied AGI is
also considered an important way to realize advanced AI and has
attracted the attention of many scholars [234].

PREPRINT

13

TABLE 2
Overall comparison with existing methods to generate realistic driving scenarios.

Method

Priors
Box

HD Map

BEVGen [216]
BEVControl [217]
MagicDrive [218]
DrivingDiffusion [219]
WoVoGen [220]
Drive-WM [55]

✗
✗
✓
✗
✓
✓

✓
✓
✓
✓
✓
✓

GAIA-1 [55]
DriveDreamer [53]
DrivingDiffusion [219]
Drive-WM [55]
WoVoGen [220]
ADriver-I [221]

✓
✓
✓
✓
✓
✗

✓
✓
✓
✓
✓
✗

Outputs
Mutil-view
Video
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓

Metrics†
FID↓
FVD↓
25.54
24.85
16.20
15.85
27.60
12.99

-

52.6
15.8
15.8
5.5

452.0
332.0
122.7
417.7
97.0

† means that the generation quality are evaluated on the nuScenes [43].

LLMs are anticipated to elevate natural and human-like text
and image interactions within the domain of embodied AI [235].
They play a pivotal role in assisting embodied AI systems in
comprehending and perceiving their surroundings, interpreting
intricate task descriptions, formulating task plans, collaborating
seamlessly with other system modules, adapting to dynamic
environments, and facilitating social interactions with humans
through natural language exchanges [119], [236]. Despite these
advantages, it is imperative to address potential drawbacks, such
as decision uncertainty. The uncertainties of LLMs also bring
risks to embodied AI, potentially resulting in biases or errors
in information processing, thereby compromising the systems’
functionality and the reliability of task completion.
Autonomous driving can be considered within the realm of
embodied AI, whereas the open and dynamic traffic environment
faced by autonomous driving necessitates a heightened focus on
system reliability and generalization [237]. While autonomous
driving can rely on the common sense understanding and logical reasoning ability of LLMs, they cannot completely rely on
LLMs’ output as final decisions. Therefore, adopting a knowledgedriven paradigm can enhance autonomous driving by integrating
mechanisms for long-term learning and knowledge accumulation,
facilitating prompt adaptation to environmental changes through
immediate feedback and adjustment.
5.2

Applying LLMs to Enhance Autonomous Driving

As shown in Table 3, with the rapid advancement of LLMs,
provides a foundation for injecting human knowledge and common sense into Driver Agents, sparking numerous new research
endeavors. This learning ability is of significance in the perception module of the autonomous driving system, which greatly
improves the system’s adaptability and generalization capabilities
in changing and complex driving environments. Talk2BEV [42]
augment BEV maps with language to enable general-purpose
linguistic reasoning for driving scenarios. LanguagePrompt [44]
uses language prompts as semantic cues and combines LLMs with
3D detection tasks and tracking tasks. Although it achieves better
performance compared to other methods, the advantages of LLMs
do not directly affect the tracking task. Rather, the tracking task

serves as a query to assist LLMs in performing 3D detection tasks.
As for planning, decision-making and control in autonomous
driving, numerous studies aim to harness the robust commonsense comprehension and reasoning capabilities of LLMs to aid
drivers [13], [240]. Some works seek to emulate and even fully
replace drivers [61], [68], [241], [245]. When employing LLMs
for closed-loop control in autonomous driving, the majority of
research efforts [13], [68], [245] incorporate a memory module to
capture driving scenarios, experiences, and other crucial driving
information. As well known, an end-to-end autonomous driving
system takes raw sensor data as input and generates a plan
and/or low-level control actions as output. We recognize endto-end autonomous vehicle aligns seamlessly with the structure
in multimodal input-to-text in LLMs. this inherent compatibility,
several studies are now exploring the viability of integrating LLMs
into end-to-end autonomous driving. In contrast to conventional
end-to-end autonomous driving systems [5], [252], end-to-end
autonomous driving systems based on LLMs exhibit robust interpretability, trustworthiness, and advanced scene comprehension
capabilities, which opens up avenues for practical application and
implementation of end-to-end autonomous driving [61], [246],
[247], [249], [250]. Understanding driving scenes like visual
question answering or captioning tasks at a correct and high
level is crucial for ensuring driving safety. DrivingLLM [229]
evaluate the model’s performance in the driving scene with a
visual and spatial understanding based on visual question answering or captioning tasks. More recently, in showcasing the
proficiency of GPT-4V [108], On The Road With GPT-4V [253]
provide comprehensive tests on GPT-4V in both diverse traffic
scenarios and span from basic scene understanding to complex
causal reasoning. Various exploratory efforts have utilized Vision
Language Models (VLMs) to comprehend traffic scenes through
specific downstream tasks. As mentioned in Sec 4.1, simulation
is pivotal in the advancement of autonomous driving. Yet, existing
simulation platforms face constraints in replicating the realism and
diversity of agent behaviors, hindering the effective translation of
simulation results into real-world applications. SurrealDriver [64]
introduces a novel generative driver agent simulation framework,
leveraging LLMs. It demonstrates the ability to perceive intricate
driving scenarios and generate realistic driving maneuvers.

PREPRINT

14

TABLE 3
Knowledge-driven methods based on LLMs in Autonomous Driving.

Category

Methods
Language Prompt [44]
Can You Text What
[238]
Is Happening

Image, Text

LLM (GPT-3.5 [113]), language prompts, tracking

Perception

Image, Text

LLM (DistilBERT [239]), trajectory prediction

Drive Like A Human [13]

2D BEV, Text

LLM (GPT-3.5), closed-loop system, decision-making and control.

Drive as You Speak [240]

2D BEV, Map, GNSS,
Radar, LiDAR, Image

LLM (GPT-4 [108]), decision-making

DiLu [68]

Text

LLM (GPT-3.5), agent, memory module, knowledge, reasoning, decisionmaking and control

LanguageMPC [241]

Text

LLM (GPT-3.5), decision-making and control

Talk2BEV [42]

Image, Text

large vision language model (LVLM) BLIP-2 [242] and LLaVA [243],
augmented bird’s-eye view (BEV) maps

TrafficGPT [244]

Text

LLM (GPT-3.5), analyze, decision-making

Receive Reason React [245]

Text

LLM (GPT-4), reasoning, decision-making

DriveGPT4 [61]

Image, Text, Action

LLM (LLaMA2 [107]), action, reasoning.

GPT-Driver [246]

Image, Text, Action

LLM (GPT-3.5), motion planner, trajectory generation and control

Drive Any where [247]

Image, Text

LLM (BLIP) open set learning, ViT [248], perception,

Agent-Driver [249]

Image, Text, Action

LLM (GPT-3.5), agent, tool library, reasoning, cognitive memory

DESIGN-Agent [250]

Image, Text

LLM (GPT-3.5), agent, reasoning

Driving with LLMs [229]

Text

LLM (LLaMa [107], GPT-3.5) questions answering

Dolphins [132]

Image, Text

LLM (OpenFlamingo [251]), Vision Language Action, Grounded Chain
of Thought (GCoT), reflection

LINGO-1 [131]

Image, Text, Action

LLM (GPT-3.5), Vision Language Action, reasoning

SurrealDriver [64]

Text

LLM (GPT-3.5), generative simulation, human-like driving behaviors

Decision-making
&
Planning
&
Control

End-to-End

VQA
&
Captioning

Simulation

Modalities

The common sense understanding and logical reasoning of
LLMs are vital for autonomous driving. However, direct applications of LLMs decision-maker may face challenges [13], [240].
To address this, adopting few-shot prompts guides the model
in understanding unknown scenarios, considering interpretability and reasonability [68]. Despite the advantages of few-shot
prompts, challenges exist, especially in complex tasks where the
number of prompts may be insufficient. Building powerful autonomous driving systems involves fine-tuning generalized models
for specific driving scenarios [254], leveraging deep learning
on extensive driving data. Autonomous driving systems need to
comprehensively understand traffic environments, road structures,
and human behavior, integrating text and image information
for enhanced perception. Incorporating interaction processes and
competitive games enables systems to grasp behaviors with other
traffic participants and learn complex decision-making strategies.
Large-scale training in simulators improves generalization, while
iterative optimization, real-time feedback, and emphasis on safety
standards lead to continuous improvement in model performance.
5.3

Generalized Knowledge-driven Framework

A generalized knowledge-driven framework, inspired by recent
advancements like Smallville [65], Dilu [68], LLM-Brain [255],
etc., is essential for autonomous driving. This framework integrates various components and technologies, as depicted in Fig. 7,
encompassing cognition, planning, reflection, memory, and more.
Cognitive understanding transcends traditional detection and seg-

Characteristics

mentation tasks, demanding a profound comprehension of specific
task environments. Crucially, planning correct actions based on
object relationships becomes pivotal, with autonomous reflection
necessary in the face of decision failures leading to anomalies. The
memory module is enriched by both positive and negative samples,
contributing to knowledge distillation. In a closed-loop continuous
learning system, accumulated knowledge guides decision-making
and reflection processes. Despite the general domain knowledge
provided by rapidly advancing LLMs, precise performance in
autonomous driving tasks mandates the empowerment and enhancement of knowledge-driven frameworks.
Cognition. Various sensors such as cameras, radar, and LiDAR
are employed to capture environmental information, subsequently
transformed into semantic representations of the environment
[256]–[259]. This information can be processed by leveraging
LLMs, enabling semantic understanding and logical reasoning
[42], [260], [261]. LLM-based systems demonstrate the capability to identify objects on roads and comprehend traffic signs
[262], [263]. However, to enhance scene understanding, LLMs
necessitate closed-loop environments, incorporating positive and
negative feedback, overcoming illusions, and continuously expanding knowledge through lifelong learning [264], [265]. Cognition, involving the comprehension of objects and their interconnections, demands continuous fine-tuning of cognitive models
in autonomous driving to address scenarios from simplistic to
sophisticated through interaction with the environment.
Memory. The outcomes of semantic understanding are stored

PREPRINT

15

Plan

Cognize

Understanding

Environment

Form

Success
Record

Plan

Recall

Data
Reflect

Knowledge

Guide

Memory

Form

Success

Failure
Distill

Fig. 7. Generalized knowledge-driven framework.

in the internal memory, constructing a dynamic perception of
the environment [266]. This enables the system to retain and
continually update its understanding of the surroundings. Furthermore, historical driving experiences and knowledge are archived
in the internal memory. When confronted with a comparable
situation, the system retrieves past semantic understanding and
driving decisions to adeptly address similar scenarios. Distinctions between long-term and short-term memory are essential.
Memory cultivated through numerous similar scenes fine-tunes
the foundation model, fostering a rapid reasoning ability akin to
human unconditional reflection. Contrasely, short-term memory
only preserves recent and unfamiliar scenarios, ensuring swift
adaptation to diverse environments.
Planning. By amalgamating sensing results, historical knowledge, and LLMs’ reasoning capabilities, the system formulates
dections for path planning, speed control, and obstacle avoidance [241], [246], [267]. Ensuring planned behaviors align with
traffic rules and safety standards is crucial for achieving secure autonomous driving. While LLMs serve as a means of
knowledge extraction and utilization, they function as a linguistic
bridge between existing human knowledge and machine execution processes, facilitating interpretive reasoning and decisionmaking. However, LLMs, as carriers of general knowledge, require artificially designed prompts and feed shots for application
in vehicle manipulation. Moreover, relying solely on LLMs for
driving decisions is a transitional approach; developing large-scale
symbolic models tailored to autonomous driving represents a more
specialized avenue.
Reflection. The driving decisions undergo interpretation using
LLMs, contributing to an understanding of systems’ behaviors.
Analyzing the LLMs’ outputs allows for the evaluation of the
system’s decision rationality, facilitating continuous optimization
and learning to enhance performance and robustness [60], [268],
[269]. Additionally, reflection can incorporate expert systems,
leveraging accident cases from datasets or human-derived lessons
to swiftly identify and localize potential issues, thereby finding
suitable solutions for knowledge-driven systems.

6

O PPORTUNITIES AND C HALLENGES

Knowledge embedding dataset. Ensuring dataset richness involves covering daily driving situations, emergencies, and extreme

weather conditions. This diversity enhances the model’s ability
to understand and adapt to various realistic driving scenarios
comprehensively. The use of natural language annotation, closely
resembling a driver’s thought and decision-making process, improves the model’s understanding of human behavior and aligns
it with real driving cognition. Annotators with ample driving experience ensure accurate annotation of diverse driving situations,
focusing on scenario understanding for enhanced accuracy and
quality. While language has demonstrated impressive proficiency
in knowledge-embedding datasets, it cannot be conclusively stated
that language is the only way to represent knowledge. Therefore,
delving into more suitable methods of knowledge representation
presents a worthy research direction.
Efficient and realistic virtual environment. Virtual environments need to overcome challenges through refined neural
rendering technology, achieving efficiency and realism in simulations. Optimizing 3D reconstruction algorithms strikes a balance
between high fidelity and generalization, focusing on adaptability.
Diverse and realistic virtual landscapes result from independently
reconstructing foreground and background using various data
sources. Techniques like Gaussian Splatting [270] offer efficiency
in the handling of large-scale scenes, enabling real-time, highperformance virtual driving environments. Proactive exploration
in environment understanding aims to construct intelligent models simulating real-world physical laws. Leveraging data from
multiple sensors establishes an abstract representation of the
environment. Incorporating such environments enhances training
and testing for autonomous driving systems, fostering continuous
advancements in the field.
VLMs. VLMs offer enhanced integration compared to LLMs,
aiming to approach human-level perception and understanding.
Crucial for decision-making and behavior planning, VLMs excel
in surrounding perception and scene understanding [271]–[274].
VLMs outperform in traffic scenario understanding by fusing
visual and linguistic information, and comprehending complex
situations like road signs, traffic signs, and pedestrians. Their
multimodal semantic understanding ensures reliable interpretation
of traffic participants’ states and behaviors, particularly excelling
in deep understanding and reasoning in intricate scenes. However,
it’s essential to note that specialized learning is required for
VLMs’ 3D spatial understanding and driving skills, presenting
a focus for future research and development.

PREPRINT

16

Requirements and Validation of Knowledge-Driven Approaches. Knowledge-driven autonomous driving demands enhanced cognitive and understanding capabilities, necessitating
comprehension of common objects and intricate relationships
between them based on physical laws and traffic rules. This
involves understanding vehicle movements, interactions with other
traffic participants, and ensuring maneuvers comply with traffic
regulations. Knowledge-driven approaches extend beyond traditional performance metrics, requiring comprehensive validation of
the entire process, from scenario understanding to vehicle maneuvering. Such validation enhances system transparency, aligns
decision-making processes with intuitive human knowledge, and
ultimately strengthens the credibility and safety of autonomous
driving systems, reducing the risk of generating hallucinatory
decisions [275], [276].

[2]

7

[7]

C ONCLUSION

Knowledge-driven autonomous driving is the revolutionary
paradigm that is promising to break through the current bottlenecks of autonomous driving. It emphasizes life-long learning, iterative revolution, and the integration of multimodal data,
promising improved performance, safety, and interpretability in
autonomous driving systems. The transition towards knowledgedriven autonomous driving reflects a pivotal evolution in technology development, emphasizing scenario understanding and
reasoned decision-making. First, we introduce the foundational
components of knowledge-driven autonomous driving: Dataset
& Benchmark, Environment, and Driver Agent. These components, especially when synergized with advanced technologies like
LLMs, world models, and neural rendering, collectively enhance
the intelligence of autonomous systems. This integration facilitates
a deeper and more holistic interaction with the driving environment, thereby augmenting the system’s overall capabilities. Next,
we present a comprehensive knowledge-driven framework for
autonomous driving, including critical components like cognition,
planning, reflection, and memory, aiming to empower autonomous
driving systems with scenario understanding, strategic decisionmaking, and life-long learning. Finally, we also highlight opportunities and challenges in knowledge-driven autonomous driving,
including the importance of diverse datasets for comprehensive
model training, the incorporation of natural language annotation
for alignment with human thought processes, the creation of
efficient virtual environments through refining neural rendering
and optimizing 3D reconstruction, and the integration of LLMs for
decision-making and behavior planning in complex driving scenarios, concluding with an emphasis on the verification measures
for autonomous vehicles. Nevertheless, the journey towards fully
realizing the potential of knowledge-driven autonomous driving is
not devoid of challenges. This paper aims to highlight the significance of adopting knowledge-driven approaches in the evolving
landscape of autonomous driving technologies. Our objective is
to steer future research and practical applications in the direction
of creating more intelligent, adaptable, and robust autonomous
driving systems.

R EFERENCES
[1]

Y. Li and J. Ibanez-Guzman, “Lidar for autonomous driving: The
principles, challenges, and trends for automotive lidar and perception
systems,” IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 50–61,
2020.

[3]

[4]

[5]

[6]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

J. Van Brummelen, M. O’Brien, D. Gruyer, and H. Najjaran, “Autonomous vehicle perception: The technology of today and tomorrow,”
Transportation Research Part C: Emerging Technologies, vol. 89, pp.
384–406, 2018.
C. Xiang, C. Feng, X. Xie, B. Shi, H. Lu, Y. Lv, M. Yang, and
Z. Niu, “Multi-sensor fusion and cooperative perception for autonomous
driving: A review,” IEEE Intelligent Transportation Systems Magazine,
2023.
Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and
J. Lu, “BEVerse: Unified perception and prediction in birds-eye-view for
vision-centric autonomous driving,” arXiv preprint arXiv:2205.09743,
2022.
Y. Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du,
T. Lin, W. Wang et al., “Planning-oriented autonomous driving,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2023, pp. 17 853–17 862.
L. Chen, Y. Li, C. Huang, B. Li, Y. Xing, D. Tian, L. Li, Z. Hu, X. Na,
Z. Li et al., “Milestones in autonomous driving and intelligent vehicles:
Survey of surveys,” IEEE Transactions on Intelligent Vehicles, vol. 8,
no. 2, pp. 1046–1056, 2022.
Z. Bao, S. Hossain, H. Lang, and X. Lin, “High-definition map generation technologies for autonomous driving: a review,” arXiv preprint
arXiv:2206.05400, 2022.
J. Cheng, L. Zhang, Q. Chen, X. Hu, and J. Cai, “A review of visual slam
methods for autonomous driving vehicles,” Engineering Applications of
Artificial Intelligence, vol. 114, p. 104992, 2022.
Z. Cao, X. Li, K. Jiang, W. Zhou, X. Liu, N. Deng, and D. Yang,
“Autonomous driving policy continual learning with one-shot disengagement case,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2,
pp. 1380–1391, 2022.
S. Huang, B. Zhang, B. Shi, H. Li, Y. Li, and P. Gao, “SUG: Singledataset unified generalization for 3D point cloud classification,” in
Proceedings of the 31st ACM International Conference on Multimedia,
2023, pp. 8644–8652.
J. Wang, X. Wang, T. Shen, Y. Wang, L. Li, Y. Tian, H. Yu, L. Chen,
J. Xin, X. Wu et al., “Parallel vision for long-tail regularization: Initial
results from IVFC autonomous driving testing,” IEEE Transactions on
Intelligent Vehicles, vol. 7, no. 2, pp. 286–299, 2022.
É. Zablocki, H. Ben-Younes, P. Pérez, and M. Cord, “Explainability
of deep vision-based autonomous driving systems: Review and challenges,” International Journal of Computer Vision, vol. 130, no. 10, pp.
2425–2452, 2022.
D. Fu, X. Li, L. Wen, M. Dou, P. Cai, B. Shi, and Y. Qiao, “Drive like
a human: Rethinking autonomous driving with large language models,”
arXiv preprint arXiv:2307.07162, 2023.
J. Zhang, J. Pu, J. Chen, H. Fu, Y. Tao, S. Wang, Q. Chen, Y. Xiao,
S. Chen, Y. Cheng et al., “DSiV: Data science for intelligent vehicles,”
IEEE Transactions on Intelligent Vehicles, 2023.
H. Shao, L. Wang, R. Chen, H. Li, and Y. Liu, “Safety-enhanced
autonomous driving using interpretable sensor fusion transformer,” in
Conference on Robot Learning. PMLR, 2023, pp. 726–737.
T. Jing, H. Xia, R. Tian, H. Ding, X. Luo, J. Domeyer, R. Sherony, and
Z. Ding, “Inaction: Interpretable action decision making for autonomous
driving,” in European Conference on Computer Vision. Springer, 2022,
pp. 370–387.
Y. Guan, Y. Ren, Q. Sun, S. E. Li, H. Ma, J. Duan, Y. Dai, and
B. Cheng, “Integrated decision and control: Toward interpretable and
computationally efficient driving intelligence,” IEEE Transactions on
Cybernetics, vol. 53, no. 2, pp. 859–873, 2022.
B. Yu, C. Chen, J. Tang, S. Liu, and J.-L. Gaudiot, “Autonomous
vehicles digital twin: A practical paradigm for autonomous driving
system development,” Computer, vol. 55, no. 9, pp. 26–34, 2022.
L. Masello, B. Sheehan, F. Murphy, G. Castignani, K. McDonnell, and
C. Ryan, “From traditional to autonomous vehicles: A systematic review
of data availability,” Transportation Research Record, vol. 2676, no. 4,
pp. 161–193, 2022.
P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y. Qiao, “Trajectory-guided
control prediction for end-to-end autonomous driving: A simple yet
strong baseline,” Advances in Neural Information Processing Systems,
vol. 35, pp. 6119–6132, 2022.
L. Fantauzzo, E. Fanı̀, D. Caldarola, A. Tavera, F. Cermelli, M. Ciccone,
and B. Caputo, “Feddrive: Generalizing federated learning to semantic
segmentation in autonomous driving,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2022, pp. 11 504–
11 511.
V. P. Chellapandi, L. Yuan, S. H. Zak, and Z. Wang, “A survey

PREPRINT

[23]

[24]
[25]

[26]

[27]
[28]
[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

of federated learning for connected and automated vehicles,” arXiv
preprint arXiv:2303.10677, 2023.
D. Bogdoll, J. Breitenstein, F. Heidecker, M. Bieshaar, B. Sick, T. Fingscheidt, and M. Zöllner, “Description of corner cases in automated
driving: Goals and challenges,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 2021, pp. 1023–
1028.
H. X. Liu and S. Feng, ““curse of rarity” for autonomous vehicles,”
arXiv preprint arXiv:2207.02749, 2022.
W. Wang, L. Wang, C. Zhang, C. Liu, L. Sun et al., “Social interactions
for autonomous driving: A review and perspectives,” Foundations and
Trends® in Robotics, vol. 10, no. 3-4, pp. 198–376, 2022.
A. Sestino, A. M. Peluso, C. Amatulli, and G. Guido, “Let me drive
you! the effect of change seeking and behavioral control in the artificial
intelligence-based self-driving cars,” Technology in Society, vol. 70, p.
102017, 2022.
Y. LeCun, “A path towards autonomous machine intelligence version
0.9.2, 2022-06-27,” Open Review, vol. 62, 2022.
H. J. Levesque, “Knowledge representation and reasoning,” Annual
Review of Computer Science, vol. 1, no. 1, pp. 255–287, 1986.
W. Wang, Y. Yang, and F. Wu, “Towards data-and knowledge-driven
artificial intelligence: A survey on neuro-symbolic computing,” arXiv
preprint arXiv:2210.15889, 2022.
I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio,
“An empirical investigation of catastrophic forgetting in gradient-based
neural networks,” arXiv preprint arXiv:1312.6211, 2013.
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,
A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska
et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp.
3521–3526, 2017.
B. Zhang, J. Zhu, and H. Su, “Toward the third generation artificial
intelligence,” Science China Information Sciences, vol. 66, no. 2, p.
121101, 2023.
C. Tang, N. Srishankar, S. Martin, and M. Tomizuka, “Grounded
relational inference: Domain knowledge driven explainable autonomous
driving,” arXiv preprint arXiv:2102.11905, 2021.
L. Sur, C. Tang, Y. Niu, E. Sachdeva, C. Choi, T. Misu, M. Tomizuka,
and W. Zhan, “Domain knowledge driven pseudo labels for interpretable
goal-conditioned interactive trajectory prediction,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2022,
pp. 13 034–13 041.
M. Bahari, I. Nejjar, and A. Alahi, “Injecting knowledge in datadriven vehicle trajectory predictors,” Transportation Research Part C:
Emerging Technologies, vol. 128, p. 103010, 2021.
Q. Lan and Q. Tian, “Instance, scale, and teacher adaptive knowledge
distillation for visual detection in autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 3, pp. 2358–2370, 2022.
A. Khan, “A framework for autonomous process design: Towards datadriven and knowledge-driven systems,” Ph.D. dissertation, University of
Cambridge, 2023.
K. Huang, B. Shi, X. Li, X. Li, S. Huang, and Y. Li, “Multi-modal
sensor fusion for auto driving perception: A survey,” arXiv preprint
arXiv:2202.02703, 2022.
R. Abbasi, A. K. Bashir, H. J. Alyamani, F. Amin, J. Doh, and
J. Chen, “Lidar point cloud compression, processing and learning for
autonomous driving,” IEEE Transactions on Intelligent Transportation
Systems, vol. 24, no. 1, pp. 962–979, 2022.
B. Fei, W. Yang, L. Liu, T. Luo, R. Zhang, Y. Li, and Y. He, “Selfsupervised learning for pre-training 3D point clouds: A survey,” arXiv
preprint arXiv:2305.04691, 2023.
T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M. F.
Moens, “Talk2Car: Taking control of your self-driving car,” in Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on
Natural Language Processing, 2019, pp. 2088–2098.
V. Dewangan, T. Choudhary, S. Chandhok, S. Priyadarshan, A. Jain,
A. K. Singh, S. Srivastava, K. M. Jatavallabhula, and K. M. Krishna,
“Talk2bev: Language-enhanced bird’s-eye view maps for autonomous
driving,” arXiv preprint arXiv:2310.02251, 2023.
H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A
multimodal dataset for autonomous driving,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020, pp. 11 621–11 631.

17

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

D. Wu, W. Han, T. Wang, Y. Liu, X. Zhang, and J. Shen, “Language
prompt for autonomous driving,” arXiv preprint arXiv:2309.04379,
2023.
E. Sachdeva, N. Agarwal, S. Chundi, S. Roelofs, J. Li, B. Dariush,
C. Choi, and M. Kochenderfer, “Rank2tell: A multimodal driving
dataset for joint importance ranking and reasoning,” arXiv preprint
arXiv:2309.06597, 2023.
T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models
can teach themselves to use tools,” arXiv preprint arXiv:2302.04761,
2023.
X. Hu, G. Xiong, Z. Zang, P. Jia, Y. Han, and J. Ma, “PC-NeRF: Parentchild neural radiance fields under partial sensor data loss in autonomous
driving environments,” arXiv preprint arXiv:2310.00874, 2023.
Z. Wu, T. Liu, L. Luo, Z. Zhong, J. Chen, H. Xiao, C. Hou,
H. Lou, Y. Chen, R. Yang et al., “MARS: An instance-aware, modular and realistic simulator for autonomous driving,” arXiv preprint
arXiv:2307.15058, 2023.
Z. Li, L. Li, and J. Zhu, “READ: Large-scale neural scene rendering for
autonomous driving,” Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 37, no. 2, pp. 1522–1529, 2023.
J. Guo, N. Deng, X. Li, Y. Bai, B. Shi, C. Wang, C. Ding, D. Wang, and
Y. Li, “Streetsurf: Extending multi-view implicit surface reconstruction
to street views,” arXiv preprint arXiv:2306.04988, 2023.
Z. Yang, Y. Chen, J. Wang, S. Manivasagam, W.-C. Ma, A. J. Yang,
and R. Urtasun, “Unisim: A neural closed-loop sensor simulator,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2023, pp. 1389–1399.
A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall,
J. Shotton, and G. Corrado, “Gaia-1: A generative world model for
autonomous driving,” arXiv preprint arXiv:2309.17080, 2023.
X. Wang, Z. Zhu, G. Huang, X. Chen, and J. Lu, “Drivedreamer: Towards real-world-driven world models for autonomous driving,” arXiv
preprint arXiv:2309.09777, 2023.
C. Min, D. Zhao, L. Xiao, Y. Nie, and B. Dai, “Uniworld: Autonomous driving pre-training via world models,” arXiv preprint
arXiv:2308.07234, 2023.
Y. Wang, J. He, L. Fan, H. Li, Y. Chen, and Z. Zhang, “Driving into the
future: Multiview visual forecasting and planning with world model for
autonomous driving,” arXiv preprint arXiv:2311.17918, 2023.
Y. Li, F. Liu, L. Xing, Y. He, C. Dong, C. Yuan, J. Chen, and L. Tong,
“Data generation for connected and automated vehicle tests using deep
learning models,” Accident Analysis & Prevention, vol. 190, p. 107192,
2023.
K. Muhammad, T. Hussain, H. Ullah, J. Del Ser, M. Rezaei, N. Kumar,
M. Hijji, P. Bellavista, and V. H. C. de Albuquerque, “Vision-based
semantic segmentation in scene understanding for autonomous driving:
Recent achievements, challenges, and outlooks,” IEEE Transactions on
Intelligent Transportation Systems, 2022.
L. Fan, D. Cao, C. Zeng, B. Li, Y. Li, and F.-Y. Wang, “Cognitivebased crack detection for road maintenance: An integrated system in
cyber-physical-social systems,” IEEE Transactions on Systems, Man,
and Cybernetics: Systems, 2022.
L. Li, T. Zhou, W. Wang, J. Li, and Y. Yang, “Deep hierarchical
semantic segmentation,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2022, pp. 1246–1257.
Y. Cui, S. Huang, J. Zhong, Z. Liu, Y. Wang, C. Sun, B. Li, X. Wang,
and A. Khajepour, “DriveLLM: Charting the path toward full autonomous driving with large language models,” IEEE Transactions on
Intelligent Vehicles, 2023.
Z. Xu, Y. Zhang, E. Xie, Z. Zhao, Y. Guo, K. K. Wong, Z. Li, and
H. Zhao, “DriveGPT4: Interpretable end-to-end autonomous driving via
large language model,” arXiv preprint arXiv:2310.01412, 2023.
D. I. Mikhailov, “Optimizing national security strategies through
llm-driven artificial intelligence integration,” arXiv preprint
arXiv:2305.13927, 2023.
S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial
general intelligence: Early experiments with GPT-4,” arXiv preprint
arXiv:2303.12712, 2023.
Y. Jin, X. Shen, H. Peng, X. Liu, J. Qin, J. Li, J. Xie, P. Gao,
G. Zhou, and J. Gong, “SurrealDriver: Designing generative driver
agent simulation framework in urban contexts based on large language
model,” arXiv preprint arXiv:2309.13193, 2023.
J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and
M. S. Bernstein, “Generative agents: Interactive simulacra of human
behavior,” arXiv preprint arXiv:2304.03442, 2023.

PREPRINT

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]
[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

Y. Peng, J. Han, Z. Zhang, L. Fan, T. Liu, S. Qi, X. Feng, Y. Ma,
Y. Wang, and S.-C. Zhu, “The tong test: Evaluating artificial general intelligence through dynamic embodied physical and social interactions,”
Engineering, 2023.
S. Gildert and G. Rose, “Building and testing a general intelligence
embodied in a humanoid robot,” arXiv preprint arXiv:2307.16770,
2023.
L. Wen, D. Fu, X. Li, X. Cai, T. Ma, P. Cai, M. Dou, B. Shi, L. He, and
Y. Qiao, “Dilu: A knowledge-driven approach to autonomous driving
with large language models,” arXiv preprint arXiv:2309.16292, 2023.
T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3D object detection
and tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 784–11 793.
Z. Guo, X. Gao, J. Zhou, X. Cai, and B. Shi, “SceneDM: Scene-level
multi-agent trajectory generation with consistent diffusion models,”
arXiv preprint arXiv:2311.15736, 2023.
X. Li, B. Shi, Y. Hou, X. Wu, T. Ma, Y. Li, and L. He, “Homogeneous
multi-modal feature fusion and interaction for 3D object detection,” in
European Conference on Computer Vision. Springer, 2022, pp. 691–
707.
B. Zhang, X. Cai, J. Yuan, D. Yang, J. Guo, R. Xia, B. Shi, M. Dou,
T. Chen, S. Liu et al., “ReSimAD: Zero-shot 3D domain transfer for
autonomous driving with source reconstruction and target simulation,”
arXiv preprint arXiv:2309.05527, 2023.
X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to real reinforcement
learning for autonomous driving,” arXiv preprint arXiv:1704.03952,
2017.
D. Li, L. Meng, J. Li, K. Lu, and Y. Yang, “Domain adaptive state representation alignment for reinforcement learning,” Information Sciences,
vol. 609, pp. 1353–1368, 2022.
D. Bogdoll, S. Guneshka, and J. M. Zöllner, “One ontology to rule
them all: Corner case scenarios for autonomous driving,” in European
Conference on Computer Vision. Springer, 2022, pp. 409–425.
R. Fernandez-Rojas, A. Perry, H. Singh, B. Campbell, S. Elsayed,
R. Hunjet, and H. A. Abbass, “Contextual awareness in humanadvanced-vehicle systems: a survey,” IEEE Access, vol. 7, pp. 33 304–
33 328, 2019.
K. Ishihara, A. Kanervisto, J. Miura, and V. Hautamaki, “Multi-task
learning with attention for end-to-end autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021, pp. 2902–2911.
S. Casas, A. Sadat, and R. Urtasun, “MP3: A unified model to map,
perceive, predict and plan,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2021, pp. 14 403–14 412.
M. Mitchell, “AI’s challenge of understanding the world,” p. eadm8175,
2023.
L. Zhang, Y. Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun, “Learning unsupervised world models for autonomous driving via discrete
diffusion,” arXiv preprint arXiv:2311.01017, 2023.
W. Schwarting, A. Pierson, J. Alonso-Mora, S. Karaman, and D. Rus,
“Social behavior for autonomous vehicles,” Proceedings of the National
Academy of Sciences, vol. 116, no. 50, pp. 24 972–24 978, 2019.
Z.-X. Xia, W.-C. Lai, L.-W. Tsao, L.-F. Hsu, C.-C. H. Yu, H.-H. Shuai,
and W.-H. Cheng, “A human-like traffic scene understanding system: A
survey,” IEEE Industrial Electronics Magazine, vol. 15, no. 1, pp. 6–15,
2020.
D. Dubois, P. Hájek, and H. Prade, “Knowledge-driven versus datadriven logics,” Journal of Logic, Language and Information, vol. 9, pp.
65–89, 2000.
M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi,
“Scalable end-to-end autonomous vehicle testing via rare-event simulation,” Advances in Neural Information Processing Systems, vol. 31,
2018.
X. Yan, Z. Zou, S. Feng, H. Zhu, H. Sun, and H. X. Liu, “Learning
naturalistic driving environment with statistical realism,” Nature Communications, vol. 14, no. 1, p. 2037, 2023.
S. Kothawade, V. Khandelwal, K. Basu, H. Wang, and G. Gupta,
“AUTO-DISCERN: autonomous driving using common sense reasoning,” arXiv preprint arXiv:2110.13606, 2021.
L. K. Saul and S. T. Roweis, “Think globally, fit locally: unsupervised
learning of low dimensional manifolds,” Journal of machine learning
research, vol. 4, no. Jun, pp. 119–155, 2003.
L. Deng, “The mnist database of handwritten digit images for machine
learning research [best of the web],” IEEE signal processing magazine,
vol. 29, no. 6, pp. 141–142, 2012.

18

[89]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in IEEE Conference on
Computer Vision and Pattern Recognition. Ieee, 2009, pp. 248–255.
[90] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20
years: A survey,” Proceedings of the IEEE, 2023.
[91] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
real-time object detection with region proposal networks,” Advances
in Neural Information Processing Systems, vol. 28, 2015.
[92] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp.
779–788.
[93] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings,
Part III 18. Springer, 2015, pp. 234–241.
[94] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
Proceedings of the IEEE International onference on computer vision,
2017, pp. 2961–2969.
[95] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning
with semantic attention,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 4651–4659.
[96] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and
D. Parikh, “Vqa: Visual question answering,” in Proceedings of the
IEEE International Conference on Computer vision, 2015, pp. 2425–
2433.
[97] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,
Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”
International journal of computer vision, vol. 123, pp. 32–73, 2017.
[98] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
Advances in Neural Information Processing Systems, vol. 27, 2014.
[99] X. Pan, A. Tewari, T. Leimkühler, L. Liu, A. Meka, and C. Theobalt,
“Drag your gan: Interactive point-based manipulation on the generative
image manifold,” in ACM SIGGRAPH 2023 Conference Proceedings,
2023, pp. 1–11.
[100] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
preprint arXiv:1312.6114, 2013.
[101] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic
models,” Advances in Neural Information Processing Systems, vol. 33,
pp. 6840–6851, 2020.
[102] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image
synthesis,” Advances in Neural Information Processing Systems, vol. 34,
pp. 8780–8794, 2021.
[103] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Highresolution image synthesis with latent diffusion models,” in Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695.
[104] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang,
J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv
preprint arXiv:2303.18223, 2023.
[105] L. Floridi and M. Chiriatti, “GPT-3: Its nature, scope, limits, and
consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020.
[106] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts,
P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “PaLM: Scaling
language modeling with pathways,” arXiv preprint arXiv:2204.02311,
2022.
[107] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al.,
“Llama: Open and efficient foundation language models,” arXiv preprint
arXiv:2302.13971, 2023.
[108] OpenAI, “GPT-4 technical report,” arXiv preprint arXiv:2303.08774,
2023.
[109] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing
systems, vol. 33, pp. 1877–1901, 2020.
[110] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language
models to follow instructions with human feedback,” Advances in
Neural Information Processing Systems, vol. 35, pp. 27 730–27 744,
2022.
[111] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot
learners,” arXiv preprint arXiv:2109.01652, 2021.

PREPRINT

[112] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,
D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large
language models,” Advances in Neural Information Processing Systems,
vol. 35, pp. 24 824–24 837, 2022.
[113] OpenAI, “Introducing ChatGPT,” https://openai.com/blog/chatgpt/,
2023.
[114] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “MiniGPT-4:
Enhancing vision-language understanding with advanced large language
models,” arXiv preprint arXiv:2304.10592, 2023.
[115] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of
autonomous driving: Common practices and emerging technologies,”
IEEE Access, vol. 8, pp. 58 443–58 469, 2020.
[116] L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “Endto-end autonomous driving: Challenges and frontiers,” arXiv preprint
arXiv:2306.16927, 2023.
[117] Z. Liu, H. Jiang, H. Tan, and F. Zhao, “An overview of the latest
progress and core challenge of autonomous vehicle technologies,” in
MATEC Web of Conferences, vol. 308. EDP Sciences, 2020.
[118] F. Dou, J. Ye, G. Yuan, Q. Lu, W. Niu, H. Sun, L. Guan, G. Lu,
G. Mai, N. Liu et al., “Towards artificial general intelligence (AGI)
in the internet of things (IoT): Opportunities and challenges,” arXiv
preprint arXiv:2309.07438, 2023.
[119] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang,
S. Jin, E. Zhou et al., “The rise and potential of large language model
based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
[120] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and
J. Dai, “BEVFormer: Learning bird’s-eye-view representation from
multi-camera images via spatiotemporal transformers,” in European
Conference on Computer Vision. Springer, 2022, pp. 1–18.
[121] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han,
“BevFusion: Multi-task multi-sensor fusion with unified bird’s-eye view
representation,” in IEEE International Conference on Robotics and
Automation. IEEE, 2023, pp. 2774–2781.
[122] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane
detection cnns by self attention distillation,” in Proceedings of the
IEEE/CVF International Conference on Computer vision, 2019, pp.
1013–1021.
[123] L. Chen, C. Sima, Y. Li, Z. Zheng, J. Xu, X. Geng, H. Li, C. He,
J. Shi, Y. Qiao et al., “Persformer: 3D lane detection via perspective
transformer and the openlane benchmark,” in European Conference on
Computer Vision. Springer, 2022, pp. 550–567.
[124] L. Kong, Y. Liu, X. Li, R. Chen, W. Zhang, J. Ren, L. Pan, K. Chen,
and Z. Liu, “Robo3d: Towards robust and reliable 3D perception against
corruptions,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2023, pp. 19 994–20 006.
[125] Y. Liu, R. Chen, X. Li, L. Kong, Y. Yang, Z. Xia, Y. Bai, X. Zhu,
Y. Ma, Y. Li et al., “Uniseg: A unified multi-modal lidar segmentation
network and the openpcseg codebase,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2023, pp. 21 662–21 673.
[126] J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, “BEVDet: Highperformance multi-camera 3D object detection in bird-eye-view,” arXiv
preprint arXiv:2112.11790, 2021.
[127] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
“Pointpillars: Fast encoders for object detection from point clouds,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2019, pp. 12 697–12 705.
[128] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel RCNN: Towards high performance voxel-based 3D object detection,”
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35,
no. 2, pp. 1201–1209, 2021.
[129] X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai,
“TransFusion: Robust lidar-camera fusion for 3D object detection
with transformers,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022, pp. 1090–1099.
[130] X. Li, T. Ma, Y. Hou, B. Shi, Y. Yang, Y. Liu, X. Wu, Q. Chen, Y. Li,
Y. Qiao et al., “LoGoNet: Towards accurate 3D object detection with
local-to-global cross-modal fusion,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2023, pp.
17 524–17 534.
[131] Wayve, “Lingo-1: Exploring natural language for autonomous driving,”
https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/,
2023.
[132] Y. Ma, Y. Cao, J. Sun, M. Pavone, and C. Xiao, “Dolphins: Multimodal
language model for driving,” arXiv preprint arXiv:2312.00438, 2023.
[133] D. C. Gazis, R. Herman, and R. W. Rothery, “Nonlinear follow-theleader models of traffic flow,” Operations research, vol. 9, no. 4, pp.
545–567, 1961.

19

[134] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in
empirical observations and microscopic simulations,” Physical review
E, vol. 62, no. 2, p. 1805, 2000.
[135] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model
mobil for car-following models,” Transportation Research Record, vol.
1999, no. 1, pp. 86–94, 2007.
[136] T. Hülnhagen, I. Dengler, A. Tamke, T. Dang, and G. Breuel, “Maneuver
recognition using probabilistic finite-state machines and fuzzy logic,” in
2010 ieee intelligent vehicles symposium. IEEE, 2010, pp. 65–70.
[137] S.-H. Bae, S.-H. Joo, J.-W. Pyo, J.-S. Yoon, K. Lee, and T.-Y. Kuc,
“Finite state machine based vehicle system for autonomous driving in
urban environments,” in International Conference on Control, Automation and Systems. IEEE, 2020, pp. 1181–1186.
[138] J.-A. Bolte, A. Bar, D. Lipinski, and T. Fingscheidt, “Towards corner
case detection for autonomous driving,” in IEEE Intelligent vehicles
symposium, 2019, pp. 438–445.
[139] L. Ma, J. Xue, K. Kawabata, J. Zhu, C. Ma, and N. Zheng, “A fast
RRT algorithm for motion planning of autonomous road vehicles,” in
International IEEE Conference on Intelligent Transportation Systems.
IEEE, 2014, pp. 1033–1038.
[140] L. Wen, Z. Fu, P. Cai, D. Fu, S. Mao, and B. Shi, “TrafficMCTS: A
closed-loop traffic flow generation framework with group-based monte
carlo tree search,” arXiv preprint arXiv:2308.12797, 2023.
[141] Y. Guo, Q. Zhang, J. Wang, and S. Liu, “Hierarchical reinforcement
learning-based policy switching towards multi-scenarios autonomous
driving,” in 2021 International Joint Conference on Neural Networks
(IJCNN). IEEE, 2021, pp. 1–8.
[142] T. Rupprecht and Y. Wang, “A survey for deep reinforcement learning in
markovian cyber–physical systems: Common problems and solutions,”
Neural Networks, vol. 153, pp. 13–36, 2022.
[143] S. Arora and P. Doshi, “A survey of inverse reinforcement learning:
Challenges, methods and progress,” Artificial Intelligence, vol. 297, p.
103500, 2021.
[144] D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,” Physical review E, vol. 51, no. 5, p. 4282, 1995.
[145] D. Yang, Ü. Özgüner, and K. Redmill, “Social force based microscopic
modeling of vehicle-crowd interaction,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1537–1542.
[146] J. Wang, J. Wu, and Y. Li, “The driving safety field based on driver–
vehicle–road interactions,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 4, pp. 2203–2214, 2015.
[147] J. Wang, J. Wu, X. Zheng, D. Ni, and K. Li, “Driving safety field
theory modeling and its application in pre-collision warning system,”
Transportation research part C: emerging technologies, vol. 72, pp.
306–324, 2016.
[148] Y. Liu, F. Wu, Z. Liu, K. Wang, F. Wang, and X. Qu, “Can language
models be used for real-world urban-delivery route optimization?” The
Innovation, vol. 4, no. 6, 2023.
[149] C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu,
Z. Yang, K.-D. Liao et al., “A survey on multimodal large language
models for autonomous driving,” arXiv preprint arXiv:2311.12320,
2023.
[150] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural
network,” Advances in Neural Information Processing Systems, vol. 1,
1988.
[151] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decisionmaking for autonomous vehicles,” Annual Review of Control, Robotics,
and Autonomous Systems, vol. 1, pp. 187–210, 2018.
[152] Y. Ma, Z. Wang, H. Yang, and L. Yang, “Artificial intelligence applications in the development of autonomous vehicles: A survey,” IEEE/CAA
Journal of Automatica Sinica, vol. 7, no. 2, pp. 315–329, 2020.
[153] J. Daudelin, G. Jing, T. Tosun, M. Yim, H. Kress-Gazit, and M. Campbell, “An integrated system for perception-driven autonomy with modular robots,” Science Robotics, vol. 3, no. 23, p. eaat4983, 2018.
[154] A. Sadat, S. Casas, M. Ren, X. Wu, P. Dhawan, and R. Urtasun,
“Perceive, predict, and plan: Safe motion planning through interpretable
semantic representations,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
XXIII 16. Springer, 2020, pp. 414–430.
[155] A. Vahidi and A. Sciarretta, “Energy saving potentials of connected
and automated vehicles,” Transportation Research Part C: Emerging
Technologies, vol. 95, pp. 822–843, 2018.
[156] Y. Wang, P. Cai, and G. Lu, “Cooperative autonomous traffic organization method for connected automated vehicles in multi-intersection road
networks,” Transportation research part C: emerging technologies, vol.
111, pp. 458–476, 2020.

PREPRINT

[157] P. S. Chib and P. Singh, “Recent advancements in end-to-end autonomous driving using deep learning: A survey,” IEEE Transactions
on Intelligent Vehicles, 2023.
[158] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The kitti dataset,” The International Journal of Robotics Research,
vol. 32, no. 11, pp. 1231–1237, 2013.
[159] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset
for semantic urban scene understanding,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp.
3213–3223.
[160] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2017, pp.
2174–2182.
[161] V. Ramanishka, Y.-T. Chen, T. Misu, and K. Saenko, “Toward driving
scene understanding: A dataset for learning driver behavior and causal
reasoning,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018, pp. 7699–7707.
[162] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui,
J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception
for autonomous driving: Waymo open dataset,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020, pp. 2446–2454.
[163] W. K. Fong, R. Mohan, J. V. Hurtado, L. Zhou, H. Caesar, O. Beijbom,
and A. Valada, “Panoptic nuScenes: A large-scale benchmark for lidar
panoptic segmentation and tracking,” IEEE Robotics and Automation
Letters, vol. 7, no. 2, pp. 3795–3802, 2022.
[164] D. Wu, W. Han, T. Wang, X. Dong, X. Zhang, and J. Shen, “Referring
multi-object tracking,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2023, pp. 14 633–14 642.
[165] A. B. Vasudevan, D. Dai, and L. Van Gool, “Object referring in videos
with language and human gaze,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 4129–4138.
[166] J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual
explanations for self-driving vehicles,” in Proceedings of the European
conference on computer vision, 2018, pp. 563–578.
[167] J. Kim, T. Misu, Y.-T. Chen, A. Tawari, and J. Canny, “Grounding
human-to-vehicle advice for self-driving vehicles,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 591–10 599.
[168] T. Qian, J. Chen, L. Zhuo, Y. Jiao, and Y.-G. Jiang, “NuScenes-QA:
A multi-modal visual question answering benchmark for autonomous
driving scenario,” arXiv preprint arXiv:2305.14836, 2023.
[169] D. Contributors, “Drivelm: Drive on language,” https://github.com/
OpenDriveLab/DriveLM, 2023.
[170] S. Alletto, A. Palazzi, F. Solera, S. Calderara, and R. Cucchiara,
“Dr (eye) ve: a dataset for attention-based tasks with applications to
autonomous and assisted driving,” in Proceedings of the ieee conference
on computer vision and pattern recognition workshops, 2016, pp. 54–
60.
[171] J. Fang, D. Yan, J. Qiao, J. Xue, H. Wang, and S. Li, “DADA-2000:
Can driving accident be predicted by driver attentionƒ analyzed by a
benchmark,” in IEEE Intelligent Transportation Systems Conference.
IEEE, 2019, pp. 4303–4309.
[172] Y. Qiu, C. Busso, T. Misu, and K. Akash, “Incorporating gaze behavior
using joint embedding with scene context for driver takeover detection,”
in IEEE International Conference on Acoustics, Speech and Signal
Processing. IEEE, 2022, pp. 4633–4637.
[173] S. Malla, C. Choi, I. Dwivedi, J. H. Choi, and J. Li, “DRAMA: Joint risk
localization and captioning in driving,” in Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision, 2023, pp. 1043–
1052.
[174] A. Palazzi, D. Abati, F. Solera, R. Cucchiara et al., “Predicting the
driver’s focus of attention: the DR (eye) VE project,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1720–
1733, 2018.
[175] J. Fang, D. Yan, J. Qiao, J. Xue, and H. Yu, “DADA: driver attention
prediction in driving accident scenarios,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4959–4971, 2022.
[176] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method
for automatic evaluation of machine translation,” in Proceedings of the
40th Annual Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics, 2002, pp. 311–318.
[177] A. Lavie and A. Agarwal, “METEOR: An automatic metric for mt
evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation

20

Measures for Machine Translation and/or Summarization. Association
for Computational Linguistics, 2007, pp. 65–72.
[178] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out: Proceedings of the ACL04 Workshop, 2004, pp. 74–81.
[179] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based
image description evaluation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–
4575.
[180] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and
L. Zhang, “SPICE: Semantic propositional image caption evaluation,” in
European Conference on Computer Vision (ECCV), 2016, pp. 382–398.
[181] K. Pearson, “Note on regression and inheritance in the case of two
parents,” Proceedings of the Royal Society of London, vol. 58, no. 347352, pp. 240–242, 1895.
[182] S. Kullback and R. A. Leibler, “On information and sufficiency,” The
Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951.
[183] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1,
no. 1, pp. 81–106, 1986.
[184] A. Stocco, B. Pulfer, and P. Tonella, “Mind the gap! a study on
the transferability of virtual vs physical-world testing of autonomous
driving systems,” IEEE Transactions on Software Engineering, 2022.
[185] C. Zhang, R. Guo, W. Zeng, Y. Xiong, B. Dai, R. Hu, M. Ren, and
R. Urtasun, “Rethinking closed-loop training for autonomous driving,”
in European Conference on Computer Vision. Springer, 2022, pp.
264–282.
[186] S. Feng, H. Sun, X. Yan, H. Zhu, Z. Zou, S. Shen, and H. X. Liu, “Dense
reinforcement learning for safety validation of autonomous vehicles,”
Nature, vol. 615, no. 7953, pp. 620–627, 2023.
[187] L. Li, X. Wang, K. Wang, Y. Lin, J. Xin, L. Chen, L. Xu, B. Tian, Y. Ai,
J. Wang et al., “Parallel testing of vehicle intelligence via virtual-real
interaction,” Science robotics, vol. 4, no. 28, p. eaaw4106, 2019.
[188] K. Othman, “Exploring the implications of autonomous vehicles: A
comprehensive review,” Innovative Infrastructure Solutions, vol. 7,
no. 2, p. 165, 2022.
[189] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla:
An open urban driving simulator,” in Conference on Robot Learning.
PMLR, 2017, pp. 1–16.
[190] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Flötteröd,
R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner, “Microscopic traffic simulation using sumo,” in International Conference
on Intelligent Transportation Systems. IEEE, 2018, pp. 2575–2582.
[191] L. Wen, D. Fu, S. Mao, P. Cai, M. Dou, and Y. Li, “LimSim: A
long-term interactive multi-scenario traffic simulator,” arXiv preprint
arXiv:2307.06648, 2023.
[192] A. Zador, S. Escola, B. Richards, B. Ölveczky, Y. Bengio, K. Boahen, M. Botvinick, D. Chklovskii, A. Churchland, C. Clopath et al.,
“Toward next-generation artificial intelligence: Catalyzing the neuroai
revolution,” arXiv preprint arXiv:2210.08340, 2022.
[193] X. Zhao, Y. Gao, S. Jin, Z. Xu, Z. Liu, W. Fan, and P. Liu, “Development of a cyber-physical-system perspective based simulation platform
for optimizing connected automated vehicles dedicated lanes,” Expert
Systems with Applications, vol. 213, p. 118972, 2023.
[194] E. Leurent, “An environment for autonomous driving decision-making,”
https://github.com/eleurent/highway-env, 2018.
[195] H. Caesar, J. Kabzan, K. S. Tan, W. K. Fong, E. Wolff, A. Lang,
L. Fletcher, O. Beijbom, and S. Omari, “nuplan: A closed-loop mlbased planning benchmark for autonomous vehicles,” arXiv preprint
arXiv:2106.11810, 2021.
[196] C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y. Lu, J. Harb,
X. Pan, Y. Wang, X. Chen, J. D. Co-Reyes, R. Agarwal, R. Roelofs,
Y. Lu, N. Montali, P. Mougin, Z. Yang, B. White, A. Faust, R. McAllister, D. Anguelov, and B. Sapp, “Waymax: An accelerated, data-driven
simulator for large-scale autonomous driving research,” in Proceedings
of the Neural Information Processing Systems Track on Datasets and
Benchmarks, 2023.
[197] M. W. Sayers, “Vehicle models for rts applications,” Vehicle System
Dynamics, vol. 32, no. 4-5, pp. 421–438, 1999.
[198] Hexagon, “Virtual test drive: Complete tool-chain for driving simulation
applications,” https://hexagon.com/products/virtual-test-drive.
[199] Epic Games, “Unreal engine: The world’s most advanced real-time 3D
creation tool for photoreal visuals and immersive experiences.” https:
//www.unrealengine.com/.
[200] Unity Technologies, “Unity engine: Unity’s real-time 3D development
engine lets artists, designers, and developers collaborate to create amazing immersive and interactive experiences.” https://unity.com/products/
unity-engine/.

PREPRINT

[201] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “MetaDrive:
Composing diverse driving scenarios for generalizable reinforcement
learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3461–3475, 2022.
[202] W. Li, C. Pan, R. Zhang, J. Ren, Y. Ma, J. Fang, F. Yan, Q. Geng,
X. Huang, H. Gong et al., “AADS: Augmented autonomous driving
simulation using data-driven algorithms,” Science Robotics, vol. 4,
no. 28, p. eaaw0863, 2019.
[203] Z. Yang, Y. Chai, D. Anguelov, Y. Zhou, P. Sun, D. Erhan, S. Rafferty,
and H. Kretzschmar, “SurfelGAN: Synthesizing realistic sensor data for
autonomous driving,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2020, pp. 11 118–11 127.
[204] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields
for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp.
99–106, 2021.
[205] Z. Chen, C. Wang, Y.-C. Guo, and S.-H. Zhang, “StructNeRF: Neural
radiance fields for indoor scenes with structural hints,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[206] X. Wu, J. Xu, Z. Zhu, H. Bao, Q. Huang, J. Tompkin, and W. Xu, “Scalable neural indoor scene rendering,” ACM Transactions on Graphics,
vol. 41, no. 4, 2022.
[207] W. Chang, Y. Zhang, and Z. Xiong, “Depth estimation from indoor
panoramas with neural scene representation,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2023, pp. 899–908.
[208] Y. Wei, S. Liu, J. Zhou, and J. Lu, “Depth-guided optimization of neural
radiance fields for indoor multi-view stereo,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2023.
[209] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P.
Srinivasan, J. T. Barron, and H. Kretzschmar, “Block-NeRF: Scalable
large scene neural view synthesis,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp.
8248–8258.
[210] A. Tonderski, C. Lindström, G. Hess, W. Ljungbergh, L. Svensson, and
C. Petersson, “NeuRAD: Neural rendering for autonomous driving,”
arXiv preprint arXiv:2311.15260, 2023.
[211] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang,
B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey
of methods and applications,” ACM Computing Surveys, 2022.
[212] G. Yan, Z. Liu, C. Wang, C. Shi, P. Wei, X. Cai, T. Ma, Z. Liu,
Z. Zhong, Y. Liu et al., “OpenCalib: A multi-sensor calibration toolbox
for autonomous driving,” Software Impacts, vol. 14, p. 100393, 2022.
[213] C. Jiang, A. Cornman, C. Park, B. Sapp, Y. Zhou, D. Anguelov et al.,
“MotionDiffuser: Controllable multi-agent motion prediction using diffusion,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2023, pp. 9644–9653.
[214] Z. Zhong, D. Rempe, D. Xu, Y. Chen, S. Veer, T. Che, B. Ray, and
M. Pavone, “Guided conditional diffusion for controllable traffic simulation,” in IEEE International Conference on Robotics and Automation.
IEEE, 2023, pp. 3560–3566.
[215] X. Cai, W. Jiang, R. Xu, W. Zhao, J. Ma, S. Liu, and Y. Li, “Analyzing
infrastructure lidar placement with realistic lidar simulation library,”
in 2023 IEEE International Conference on Robotics and Automation
(ICRA). IEEE, 2023, pp. 5581–5587.
[216] A. Swerdlow, R. Xu, and B. Zhou, “Street-view image generation from
a bird’s-eye view layout,” arXiv preprint arXiv:2301.04634, 2023.
[217] K. Yang, E. Ma, J. Peng, Q. Guo, D. Lin, and K. Yu, “BEVControl:
Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout,” arXiv preprint arXiv:2308.01661, 2023.
[218] R. Gao, K. Chen, E. Xie, L. Hong, Z. Li, D.-Y. Yeung, and Q. Xu,
“MagicDrive: Street view generation with diverse 3D geometry control,” arXiv preprint arXiv:2310.02601, 2023.
[219] X. Li, Y. Zhang, and X. Ye, “DrivingDiffusion: Layout-guided multiview driving scene video generation with latent diffusion model,” arXiv
preprint arXiv:2310.07771, 2023.
[220] J. Lu, Z. Huang, J. Zhang, Z. Yang, and L. Zhang, “WoVoGen: World
volume-aware diffusion for controllable multi-camera driving scene
generation,” arXiv preprint arXiv:2312.02934, 2023.
[221] F. Jia, W. Mao, Y. Liu, Y. Zhao, Y. Wen, C. Zhang, X. Zhang, and
T. Wang, “ADriver-I: A general world model for autonomous driving,”
arXiv preprint arXiv:2311.13549, 2023.
[222] D. Ha and J. Schmidhuber, “World models,” arXiv preprint
arXiv:1803.10122, 2018.
[223] A. Van Den Oord, O. Vinyals et al., “Neural discrete representation
learning,” Advances in Neural Information Processing Systems, vol. 30,
2017.

21

[224] W. Zheng, W. Chen, Y. Huang, B. Zhang, Y. Duan, and J. Lu,
“OccWorld: Learning a 3D occupancy world model for autonomous
driving,” arXiv preprint arXiv:2311.16038, 2023.
[225] Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “Trafficbots:
Towards world models for autonomous driving simulation and motion
prediction,” arXiv preprint arXiv:2303.04116, 2023.
[226] A. Martino, M. Iannelli, and C. Truong, “Knowledge injection to
counter large language model (llm) hallucination,” in European Semantic Web Conference. Springer, 2023, pp. 182–185.
[227] D. Lenat and G. Marcus, “Getting from generative ai to trustworthy AI:
What llms might learn from Cyc,” arXiv preprint arXiv:2308.04445,
2023.
[228] G. Agrawal, T. Kumarage, Z. Alghami, and H. Liu, “Can knowledge
graphs reduce hallucinations in LLMs?: A survey,” arXiv preprint
arXiv:2311.07914, 2023.
[229] L. Chen, O. Sinavski, J. Hünermann, A. Karnsund, A. J. Willmott,
D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing
object-level vector modality for explainable autonomous driving,” arXiv
preprint arXiv:2310.01957, 2023.
[230] R. Pfeifer and F. Iida, “Embodied artificial intelligence: Trends and
challenges,” in Embodied Artificial Intelligence: International Seminar,
Dagstuhl Castle, Germany, July 7-11, 2003. Revised Papers. Springer,
2004, pp. 1–26.
[231] L. Smith and M. Gasser, “The development of embodied cognition: Six
lessons from babies,” Artificial life, vol. 11, no. 1-2, pp. 13–29, 2005.
[232] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied
AI: From simulators to research tasks,” IEEE Transactions on Emerging
Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022.
[233] X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang, G. Huang, B. Li,
L. Lu, X. Wang et al., “Ghost in the Minecraft: Generally capable agents
for open-world enviroments via large language models with text-based
knowledge and memory,” arXiv preprint arXiv:2305.17144, 2023.
[234] R. Law, K. J. Lin, H. Ye, and D. K. C. Fong, “Artificial intelligence
research in hospitality: a state-of-the-art review and future directions,”
International Journal of Contemporary Hospitality Management, 2023.
[235] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar,
P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman,
M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence,
“PaLM-E: An embodied multimodal language model,” in arXiv preprint
arXiv:2303.03378, 2023.
[236] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen,
J. Tang, X. Chen, Y. Lin et al., “A survey on large language model
based autonomous agents,” arXiv preprint arXiv:2308.11432, 2023.
[237] J. A. Oravec, “The future of embodied AI: Containing and mitigating
the dark and creepy sides of robotics, autonomous vehicles, and AI,”
in Good Robot, Bad Robot: Dark and Creepy Sides of Robotics,
Autonomous Vehicles, and AI. Springer, 2022, pp. 245–276.
[238] A. Keysan, A. Look, E. Kosman, G. Gürsun, J. Wagner, Y. Yu, and
B. Rakitsch, “Can you text what is happening? integrating pre-trained
language encoders into trajectory prediction models for autonomous
driving,” arXiv preprint arXiv:2309.05282, 2023.
[239] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled
version of bert: smaller, faster, cheaper and lighter,” arXiv preprint
arXiv:1910.01108, 2019.
[240] C. Cui, Y. Ma, X. Cao, W. Ye, and Z. Wang, “Drive as you speak:
Enabling human-like interaction with large language models in autonomous vehicles,” arXiv preprint arXiv:2309.10228, 2023.
[241] H. Sha, Y. Mu, Y. Jiang, L. Chen, C. Xu, P. Luo, S. E. Li,
M. Tomizuka, W. Zhan, and M. Ding, “LanguageMPC: Large language
models as decision makers for autonomous driving,” arXiv preprint
arXiv:2310.03026, 2023.
[242] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language
models,” arXiv preprint arXiv:2301.12597, 2023.
[243] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv
preprint arXiv:2304.08485, 2023.
[244] S. Zhang, D. Fu, Z. Zhang, B. Yu, and P. Cai, “TrafficGPT: Viewing, processing and interacting with traffic foundation models,” arXiv
preprint arXiv:2309.06719, 2023.
[245] C. Cui, Y. Ma, X. Cao, W. Ye, and Z. Wang, “Receive, reason, and react:
Drive as you say with large language models in autonomous vehicles,”
arXiv preprint arXiv:2310.08034, 2023.
[246] J. Mao, Y. Qian, H. Zhao, and Y. Wang, “GPT-Driver: Learning to drive
with GPT,” arXiv preprint arXiv:2310.01415, 2023.
[247] T.-H. Wang, A. Maalouf, W. Xiao, Y. Ban, A. Amini, G. Rosman,
S. Karaman, and D. Rus, “Drive anywhere: Generalizable end-to-

PREPRINT

end autonomous driving with multi-modal foundation models,” arXiv
preprint arXiv:2310.17642, 2023.
[248] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
“An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
[249] J. Mao, J. Ye, Y. Qian, M. Pavone, and Y. Wang, “A language agent for
autonomous driving,” arXiv preprint arXiv:2311.10813, 2023.
[250] Anonymous, “3D dense captioning beyond nouns: A middleware
for autonomous driving,” in Submitted to The Twelfth International
Conference on Learning Representations, 2023, under review. [Online].
Available: https://openreview.net/forum?id=8T7m27VC3S
[251] A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu,
K. Marathe, Y. Bitton, S. Gadre, S. Sagawa, J. Jitsev, S. Kornblith,
P. W. Koh, G. Ilharco, M. Wortsman, and L. Schmidt, “OpenFlamingo:
An open-source framework for training large autoregressive visionlanguage models,” arXiv preprint arXiv:2308.01390, 2023.
[252] X. Jia, Y. Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter:
Breaking the coupling barrier of perception and planning in end-to-end
autonomous driving,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2023, pp. 7953–7963.
[253] L. Wen, X. Yang, D. Fu, X. Wang, P. Cai, X. Li, T. Ma, Y. Li,
L. Xu, D. Shang et al., “On the road with GPT-4V (ision): Early
explorations of visual-language model on autonomous driving,” arXiv
preprint arXiv:2311.05332, 2023.
[254] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with
GPT-4,” arXiv preprint arXiv:2304.03277, 2023.
[255] J. Mai, J. Chen, B. Li, G. Qian, M. Elhoseiny, and B. Ghanem, “LLM
as a robotic brain: Unifying egocentric memory and control,” arXiv
preprint arXiv:2304.09349, 2023.
[256] J. Li, X. Zhang, J. Li, Y. Liu, and J. Wang, “Building and optimization of
3D semantic map based on lidar and camera fusion,” Neurocomputing,
vol. 409, pp. 394–407, 2020.
[257] J. S. Berrio, M. Shan, S. Worrall, and E. Nebot, “Camera-lidar integration: Probabilistic sensor fusion for semantic mapping,” IEEE
Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp.
7637–7652, 2021.
[258] C. Premebida and U. Nunes, “Fusing lidar, camera and semantic
information: A context-based approach for pedestrian detection,” The
International Journal of Robotics Research, vol. 32, no. 3, pp. 371–
384, 2013.
[259] S. Wang, W. Li, W. Liu, X. Liu, and J. Zhu, “LiDAR2Map: In defense of
lidar-based semantic map construction using online camera distillation,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2023, pp. 5186–5195.
[260] J. de Curtò, I. de Zarzà, and C. T. Calafate, “Semantic scene understanding with large language models on unmanned aerial vehicles,” Drones,
vol. 7, no. 2, p. 114, 2023.

22

[261] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-GPT: Any-to-any
multimodal llm,” arXiv preprint arXiv:2309.05519, 2023.
[262] A. Elhafsi, R. Sinha, C. Agia, E. Schmerling, I. A. Nesnas, and
M. Pavone, “Semantic anomaly detection with large language models,”
Autonomous Robots, pp. 1–21, 2023.
[263] X. Zhou, M. Liu, B. L. Zagar, E. Yurtsever, and A. C. Knoll, “Vision
language models in autonomous driving and intelligent transportation
systems,” arXiv preprint arXiv:2310.14414, 2023.
[264] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and
A. Anandkumar, “Voyager: An open-ended embodied agent with large
language models,” arXiv preprint arXiv:2305.16291, 2023.
[265] A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang,
“Expel: LLM agents are experiential learners,” arXiv preprint
arXiv:2308.10144, 2023.
[266] K. Zhang, F. Zhao, Y. Kang, and X. Liu, “Memory-augmented LLM
personalization with short-and long-term memory coordination,” arXiv
preprint arXiv:2309.11696, 2023.
[267] S. Wang, Y. Zhu, Z. Li, Y. Wang, L. Li, and Z. He, “ChatGPT as your
vehicle co-pilot: An initial attempt,” IEEE Transactions on Intelligent
Vehicles, 2023.
[268] N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao,
“Reflexion: Language agents with verbal reinforcement learning,” in
Thirty-seventh Conference on Neural Information Processing Systems,
2023.
[269] T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama,
“Demystifying GPT self-repair for code generation,” arXiv preprint
arXiv:2306.09896, 2023.
[270] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D gaussian
splatting for real-time radiance field rendering,” ACM Transactions on
Graphics (ToG), vol. 42, no. 4, pp. 1–14, 2023.
[271] OpenAI, “GPT-4V(ision) system card,” https://openai.com/research/
gpt-4v-system-card, 2023.
[272] F. Sammani, T. Mukherjee, and N. Deligiannis, “NLX-GPT: A model
for natural language explanations in vision and vision-language tasks,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2022, pp. 8322–8332.
[273] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu,
C. He, X. Yue, H. Li, and Y. Qiao, “LLaMA-Adapter V2: Parameterefficient visual instruction model,” arXiv preprint arXiv:2304.15010,
2023.
[274] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “VoxPoser:
Composable 3D value maps for robotic manipulation with language
models,” arXiv preprint arXiv:2307.05973, 2023.
[275] H. Ye, T. Liu, A. Zhang, W. Hua, and W. Jia, “Cognitive mirage:
A review of hallucinations in large language models,” arXiv preprint
arXiv:2309.06794, 2023.
[276] V. Rawte, A. Sheth, and A. Das, “A survey of hallucination in large
foundation models,” arXiv preprint arXiv:2309.05922, 2023.