Stronger, Fewer, & Superior: Harnessing Vision Foundation Models
for Domain Generalized Semantic Segmentation
Zhixiang Wei1 * Lin Chen1,2∗ Yi Jin1∗ Xiaoxiao Ma1 Tianle Liu1 Pengyang Lin1,2 Ben Wang1
Huaian Chen1† Jinjin Zheng1
University of Science and Technology of China

2

Shanghai AI Laboratory

{zhixiangwei,chlin,xiao xiao,tleliu,lpyang27,wblzgrsn,anchen}@mail.ustc.edu.cn
{jinyi08,jjzheng}@ustc.edu.cn
 *  &
    

    

     

 *  %
    

 * 6  0
    

    
     

     
    

     
     

    

    
    

     
    

    
    

 *  0

    

 * 6  %

    

    
    
    

    

 3 U L R U  6 2 7 $

 * 6  &
 & / , 3

 6 $ 0

 ' , 1 2 Y    2 X U V 
 ( 9 $     2 X U V 
 & / , 3   2 X U V 

 ' , 1 2 Y   ) X O O 

 ' , 1 2 Y 
 ( 9 $  

(a) Stronger pre-trained models

Input / GT

WildNet

VFM

Ours

 ( 9 $    ) X O O 

 6 $ 0   2 X U V 

 6 $ 0  ) X O O 

    

    
     

    

    
    

    
     

    

    

 P , R 8(average)

arXiv:2312.04265v1 [cs.CV] 7 Dec 2023

1

 3 U L R U  6 2 7 $  0 R ' L I \  7 / ' 5
 & / , 3  ) X O O 
 9 ) 0 V
 6 3 &
 9 ) 0 V    2 X U V
 : L O G 1 H W

   

   

 7 U D L Q D E O H  3 D U D P V   0 
(b) Fewer trainable parameters

(c) Superior generalization ability

Figure 1. Vision Foundation Models (VFMs) are stronger pre-trained models that serve as robust backbones, effortlessly outperforming
previous state-of-the-art Domain Generalized Semantic Segmentation (DGSS), as shown in (a). Yet, the extensive parameters of VFMs
make them challenging to train. To address this, we introduce a robust fine-tuning approach to efficiently harness VFMs for DGSS. As
illustrated in (b) and (c), the proposed methods achieve superior generalizability with fewer trainable parameters within backbones.

Abstract

within the frozen backbone, Rein achieves a mIoU of 68.1%
on the Cityscapes, without accessing any real urban-scene
datasets.

In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain
Generalized Semantic Segmentation (DGSS). Driven by the
motivation that Leveraging Stronger pre-trained models
and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely
“Rein”, to parameter-efficiently harness VFMs for DGSS.
Built upon a set of trainable tokens, each linked to distinct
instances, Rein precisely refines and forwards the feature
maps from each layer to the next layer within the backbone.
This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks,
surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that
Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters

1. Introduction
Prior works [26, 28, 30, 52, 60, 62, 68] in Domain Generalized Semantic Segmentation (DGSS) focus on improving
prediction accuracy across multiple unseen domains without accessing their data, thus enabling a high generalization for real applications. Since models are fine-tuned using
datasets [10, 53] that are either limited in scale or different
in image style from the target domain, complex data augmentation approaches [4, 49, 69] and domain invariant feature extraction strategies [8, 47, 59, 63] have been widely
explored in previous DGSS. These methods result in enhanced generalization when applied to classic backbones,
e.g., VGGNet [58], MobileNetV2 [55], and ResNet [18].
In recent years, large-scale Vision Foundation Models
(VFMs) like CLIP [51], MAE [19], SAM [33], EVA02 [14,

* indicates equal contributions.
† Corresponding authors.

1

Previous DGSS methods
Frozen backbone of VFMs
Methods
GTR[49] AdvStyle[69] WildNet[35] SPC[22] PASTA[4] TLDR[31] CLIP-ViT-L[51] MAE-L[19] SAM-H[33] EVA02-L[14] DINOv2-L[46]
Publications
TIP21
NIPS22
CVPR22 CVPR23 ICCV23 ICCV23
ICML21
CVPR22
ICCV23
arXiv23
arXiv23
mIoU (Citys)
43.7
43.4
45.8
46.7
45.3
47.6
53.7
43.3
57.0
56.5
63.3
mIoU (BDD)
39.6
40.3
41.7
43.7
42.3
44.9
48.7
37.8
47.1
53.6
56.1
mIoU (Map)
39.1
42.0
47.1
45.5
48.6
48.8
55.0
48.0
58.4
58.6
63.9
mIoU (Average) 40.8
41.9
44.9
45.3
45.4
47.1
52.4
43.0
54.2
56.2
61.1

Table 1. Performance benchmarking of multiple VFMs and previous DGSS methods under the GTAV → Cityscapes (Citys) + BDD100K
(BDD) + Mapillary (Map) generalization setting. Models are fine-tuned on GTAV and tested on Cityscapes, BDD100K and Mapillary.
The best results are highlighted. Without specialized design, frozen VFMs demonstrate significantly stronger performance.

15], and DINOv2 [46] have significantly advanced the
boundaries of performance in a variety of computer vision
challenges. Giving the remarkable generalization of these
VFMs across various unseen scenes, two intuitive questions
emerge: How do VFMs perform in the context of DGSS?
And How to harness VFMs for DGSS? We attempt to answer these questions as follows:

features, generate an attention-like similarity map. This
map enables Rein to perform precise refinement tailored to
each instance within an image, significantly boosting VFMs
in the context of DGSS. Moreover, to reduce the number
of trainable parameters, we employ shared weights across
MLPs in different layers and design our learnable tokens
by multiplying two low-rank matrices. Extensive experiments on various DGSS settings demonstrate that the proposed Rein outperforms existing DGSS methods by a large
margin with fewer trainable parameters. In a nutshell, the
main contributions of this paper are as follows:
• We first assess various Vision Foundation Models
(VFMs) in the context of Domain Generalized Semantic
Segmentation (DGSS). Our extensive experiments in the
DGSS framework highlight the impressive generalization
capabilities of VFMs. The findings confirm that VFMs
serve as Stronger backbones, thereby establishing a significant benchmark in this field.
• We present a robust fine-tuning method, namely “Rein”,
to parameter-efficiently harness VFMs. At its core, Rein
consists of a set of learnable tokens, each directly linked
to distinct instances. With deliberate design, this linkage enables Rein to refine feature maps at an instancelevel within each backbone layer. As a result, Rein reinforces the ability of VFMs in DGSS tasks, achieving
this with Fewer trainable parameters while preserving the
pre-trained knowledge.
• Comprehensive experiments across various DGSS settings demonstrate that Rein employs Fewer trainable
parameters to effectively leverage Stronger VFMs for
achieving Superior generalizability. This performance
surpasses existing DGSS methods by a large margin. Notably, Rein is designed to integrate smoothly with existing
plain vision transformers, improving their generalization
ability and making training more efficient.

Stronger: We begin by evaluating and comparing the
performance of various VFMs against existing DGSS methods. To ensure a fair comparison, we use image encoders
from a variety of VFMs as the backbone for feature extraction in all cases. These backbones are coupled with the
widely-used decode head, i.e., Mask2Former [7], to generate semantic predictions. As illustrated in Tab. 1, while
previous DGSS methods have showcased commendable
results, they perform less effectively compared to frozen
VFMs. This finding clearly demonstrates the powerful potential of VFMs in DGSS, outperforming traditional backbones like ResNet [18] and MobileNetV2 [55], thereby establishing VFMs as a meaningful benchmark in the field.
Fewer: Although VFMs have exhibited impressive generalization capabilities, fine-tuning them for DGSS tasks
poses a challenge. The datasets [10, 53] commonly used
in DGSS tasks are significantly smaller in scale compared
to ImageNet [11], and fine-tuning VFMs with their huge
number of trainable parameters on these datasets result in
limited generalizability [27]. To address this issue, instead
of the difficult task of large datasets collection, we resort to
fine-tuning VFMs with fewer trainable parameters. However, most existing parameter-efficient fine-tuning strategies, which fine-tune a large-scale model with fewer trainable parameters, are primarily designed for adapting large
language models [20, 21, 36, 38, 40, 66, 71] or classification networks [5, 23]. These methods often lack precision
in refining features for distinct instances within a single image, thereby limiting their effectiveness in DGSS contexts.

2. Related Works

Superior: In this work, we introduce a robust and efficient fine-tuning approach, namely “Rein”. Tailored for
DGSS tasks, Rein employs fewer trainable parameters to
harness stronger VFMs for achieving superior generalization. At its core, Rein comprises a set of randomly initialized tokens, each directly linked to different instances.
These tokens, through a dot-product operation with VFMs

Domain Generalized Semantic Segmentation. Domain
Generalized Semantic Segmentation (DGSS) focuses on
enhancing model generalizability. This field typically involves training models on a set of source domain data to
enhance their performance on distinct and unseen target
domain datasets. Various approaches [12, 22, 24, 25, 50]
2

Frozen Backbone, Tunable Rein

Input
Image
𝑓𝑖

𝑓𝑖 ×𝑇𝑖𝑇

𝑆𝑖 ×𝑀(𝑇𝑖 )

MLP

S
…

Layer 𝐿𝑖

𝑚

…

=𝑟

𝐴

𝑟

× 𝐵

𝑐

(𝑟 ≪ 𝑐)

𝑀𝑓
𝑀𝑠

…

𝑀𝑄
𝑄

…

𝑀
𝐿
𝑃

…
Learnable Tokens
𝑇𝑖 (𝑚 × 𝑐)

𝑓𝑜

𝑓𝑖 ′

Layer 𝐿𝑖+1

softmax

…

Head

MLP

Similarity Map 𝑆𝑖

MLP

max & avg & last

Figure 2. An overview of proposed Rein. Rein primarily consists of a collection of low-rank learnable tokens, denoted as T =
{T1 , T2 , . . . , TN }. These tokens establish direct connections to distinct instances, facilitating instance-level feature refinement. This
mechanism results in the generation of an enhancement feature map fi′ = fi + Rein(fi ) for each layer within backbone. All MLPs share
same parameters to reduce the number of parameters. The notation max & avg & last refers to the equation Eq. (8) and Eq. (10).

tion task; EVA02 [14, 15], which integrates Masked Image
Modeling pre-training with CLIP’s vision features as the
pretext task’s target; and DINOv2 [46], which is pretrained
on extensive, carefully curated datasets without explicit supervision. These VFMs have shown remarkable performance in a variety of downstream applications, demonstrating their impressive generalization capabilities. Yet, a dedicated investigation into their performance in the specific
context of DGSS tasks remains unexplored.

have been proposed to address this issue within DGSS,
with representative methods including splitting the learned
features into domain-invariant and domain-specific components [59, 63], or employing meta-learning to train more
robust models [29]. A standard scenario in DGSS is generalizing from one urban-scene dataset to another, for instance, from the synthetic GTAV [53] dataset to the realworld Cityscapes [10]. In this classic setting, certain
techniques [8, 47, 48] have achieved notable performance
through learning feature normalization/whitening schemes,
while others [35] have further improved segmentation results through feature-level style transfer and the introduction of additional data. Additionally, strong data augmentation [4, 49, 69] often simply and effectively enhances
model robustness. However, most of previous DGSS methods generally utilize outdated backbones like ResNet [18],
VGGNet [58], MobileNetV2 [55], and ShuffleNetV2 [43],
thereby leaving the efficacy of stronger Vision Foundation
Models (VFMs) in DGSS relatively unexplored.

Parameter-Efficient Fine-tuning. In the NLP domain,
parameter-efficient fine-tuning (PEFT) has achieved notable success by freezing most parameters of the foundation
model and fine-tuning a select few. Various strategies have
been introduced, such as BitFit [66], which tweaks only the
bias-terms of the model, or just a subset of these terms;
Prompt-tuning [36], which learns soft prompts to condition frozen language models to perform specific downstream tasks; Adapter-tuning [20], which incorporates extra lightweight modules within each Transformer layer; and
notably, LoRA [21], which injects trainable rank decomposition matrices into each layer of transformer architecture,
yielding significant influence. PEFT methods are gaining
traction in computer vision as well, exemplified by Visual
Prompt Tuning [23], which prepends prompts into the input
sequence of Transformer layers for fine-tuning, and AdaptFormer [5], which replaces MLP block in the transformer
encoder with an AdaptMLP comprising two sub-branches.
However, these methodologies are primarily tuned for classification tasks, where each image contains only one target
to identify. Our endeavor is tailored for segmentation tasks,
refining feature maps at the object-level for each instance in
the image, thereby achieving superior performance.

Vision Foundation Models. The concept of a Foundation Model, initially introduced by Bommasani et al. [1]
in the field of Natural Language Processing (NLP), defined
as “the base models trained on large-scale data in a selfsupervised or semi-supervised manner that can be adapted
for several other downstream tasks”. While models like
the ViT [13] and Swin Transformer [41] have demonstrated
excellent performance, the quest for a Vision Foundation
Model (VFM) similar to their NLP counterparts is ongoing. This pursuit has yielded significant advancements with
the advent of models such as CLIP [51], which learn highquality visual representation by exploring contrastive learning with large-scale image text pairs; MAE [19], utilizing a
masked image modeling framework for learning latent image representations; SAM [33], which develops a promptable model and pre-train it on a broad dataset for segmenta3

Fine-tune Trainable
mIoU
Method
Params∗ Citys BDD Map Avg.
Full
304.15M 51.3 47.6 54.3 51.1
CLIP [51]
Freeze
0.00M 53.7 48.7 55.0 52.4
(ViT-Large)
2.99M 57.1 54.7 60.5 57.4
Rein
Full
330.94M 53.7 50.8 58.1 54.2
MAE [19]
Freeze
0.00M 43.3 37.8 48.0 43.0
(Large)
Rein
2.99M 55.0 49.3 58.6 54.3
Full
632.18M 57.6 51.7 61.5 56.9
SAM [33]
Freeze
0.00M 57.0 47.1 58.4 54.2
(Huge)
Rein
4.51M 59.6 52.0 62.1 57.9
Full
304.24M 62.1 56.2 64.6 60.9
EVA02 [14, 15]
Freeze
0.00M 56.5 53.6 58.6 56.2
(Large)
Rein
2.99M 65.3 60.5 64.9 63.6
Full
304.20M 63.7 57.4 64.2 61.7
DINOV2 [46]
Freeze
0.00M 63.3 56.1 63.9 61.1
(Large)
2.99M 66.4 60.4 66.1 64.3
Rein

Backbone

Backbone

EVA02
(Large)
[14, 15]

DINOv2
(Large)
[46]

Table 2. Performance Comparison with the proposed Rein across
Multiple VFMs as Backbones under the GTAV → Cityscapes
(Citys) + BDD100K (BDD) + Mapillary (Map) generalization setting. Models are fine-tuned on GTAV and tested on Cityscapes,
BDD100K and Mapillary. The best results are highlighted. ∗ denotes trainable parameters in backbones.

3. Methods

Table 3. Performance Comparison of the proposed Rein against
other DGSS and PEFT methods under the GTAV → Cityscapes
(Citys) + BDD100K (BDD) + Mapillary (Map) generalization setting. Models are fine-tuned on GTAV and tested on Cityscapes,
BDD100K and Mapillary. The best results are highlighted. ∗ denotes trainable parameters in backbones.

3.1. Preliminary
Driven by the motivation that Leveraging Stronger pretrained models and Fewer trainable parameters for Superior generalizability, we choose to fine-tune VFMs with
a reduced parameter set. A straightforward thought might
involve a smaller decode head; however, this method merely
acts as a passive receiver of feature maps from the backbone, lacking the flexibility to effectively adapt a frozen
backbone for generating task-specific or scene-specific features. In contrast, we propose to embed a mechanism,
named “Rein”, between the layers within the backbone.
Rein actively refines and forwards the feature maps from
each layer to the subsequent one. This approach allows us to
more effectively utilize the powerful capabilities of VFMs,
much like using rein to control a horse.
Given a pre-trained VFM with parameters ΦM , consisting of a sequence of layers L1 , L2 , . . . , LN , a decode head
H parameterized by θh , and the Rein strategy with parameters θr , the optimization objective can be written as:
arg min
θR ,θh

Nd
X

Loss(Hθh (FΦM ,θR (xi )), yi ),

Fine-tune
Trainable
mIoU
Method
Params∗ Citys BDD Map Avg.
304.24M 62.1 56.2 64.6 60.9
Full
+AdvStyle [69]
304.24M 63.1 56.4 64.0 61.2
+PASTA [4]
304.24M 61.8 57.1 63.6 60.8
+GTR-LTR [49] 304.24M 59.8 57.4 63.2 60.1
Freeze
0.00M 56.5 53.6 58.6 56.2
+AdvStyle [69]
0.00M 51.4 51.6 56.5 53.2
+PASTA [4]
0.00M 57.8 52.3 58.5 56.2
+GTR-LTR [49]
0.00M 52.5 52.8 57.1 54.1
+LoRA [21]
1.18M 55.5 52.7 58.3 55.5
+AdaptFormer [5] 3.17M 63.7 59.9 64.2 62.6
+VPT [23]
3.69M 62.2 57.7 62.5 60.8
+Rein (ours)
2.99M 65.3 60.5 64.9 63.6
304.20M 63.7 57.4 64.2 61.7
Full
+AdvStyle [69]
304.20M 60.8 58.0 62.5 60.4
+PASTA [4]
304.20M 62.5 57.2 64.7 61.5
304.20M 62.7 57.4 64.5 61.6
+GTR-LTR [4]
Freeze
0.00M 63.3 56.1 63.9 61.1
+AdvStyle [69]
0.00M 61.5 55.1 63.9 60.1
+PASTA [4]
0.00M 62.1 57.2 64.5 61.3
0.00M 60.2 57.7 62.2 60.0
+GTR-LTR [4]
+LoRA [21]
0.79M 65.2 58.3 64.6 62.7
+AdaptFormer [5] 3.17M 64.9 59.0 64.2 62.7
+VPT [23]
3.69M 65.2 59.4 65.5 63.3
+Rein (ours)
2.99M 66.4 60.4 66.1 64.3

3.2. Core of Rein
For simple implementation across different VFMs, we
opt not to modify MLP weights at specific positions as described in the [5, 21]. Instead, our approach focuses on
refining the output feature maps at each layer within the
VFMs, as illustrated in Fig. 2. Precisely, for the features
fi produced by the i-th layer Li , Rein produces enhanced
feature maps for the next layer as follows:
f1 = L1 (Embed(x))
fi+1 = Li+1 (fi + ∆fi )

f1 ∈ Rn×c ,
i = 1, 2, . . . , N − 1,

(2)

fout = fN + ∆fN ,
where fi′ = fi + ∆fi symbolizes the refined feature map,
x is the input image, Embed denotes the patch embedding
layer in VFMs, n represents the number of patches, and c
is the dimensionality of f1 , f2 , . . . , fN . Note that the layers
L1 , L2 , . . . , LN are kept frozen, and our focus is on training
an efficient module, Rein, to generate ∆fi as follows:

(1)

i=1

where xi and yi denote the input image and its corresponding ground truth, respectively, and Nd signifies the total
number of datasets. FΦM ,θr represents the forward process
of VFM after applying the Rein strategy.

∆fi = Rein(fi )

∆fi ∈ Rn×c , i = 1, 2, . . . , N. (3)

In the context of DGSS, an ideal ∆fi should assist VFMs
4

feature modifications ∆fi :

to bridge two types of gaps. The first is gap in scene between pre-training dataset and target scene, exemplified by
the contrast between ImageNet [11] and urban-scene images [10, 53]. The second is task divergence between pretraining and fine-tuning, such as the differences between
masked image modeling and semantic segmentation tasks.
To establish this dual bridge, Rein starts with a set of
learnable tokens T = {Ti ∈ Rm×c | i ∈ N, 1 ≤ i ≤ N },
where each token sequence Ti is randomly initialized, and
m denotes the sequence length of Ti . Rein freezes the
backbone and embeds knowledge learned from the finetuning dataset into these tokens, thereby bridging the gap
in scene relative to the pre-training dataset. Moreover, considering the essential need in semantic segmentation to discern multiple instances within a single image, Rein implements an attention-inspired mechanism, which enables
VFMs to make tailored adjustments to the features of distinct instances, thereby aiding VFMs in adapting to the differences between semantic segmentation and pre-training
tasks. Specifically, Rein employs a dot-product operation
to generate a similarity map Si , which captures the associations between feature vectors in fi and the tokens in T :
Si = fi × TiT

Si ∈ Rn×m ,

∆fi = (∆f¯i + fi ) × Wfi + bfi .

Benefiting from these instance-level ∆fi adjustments,
Rein is capable of generating diverse modifications for various categories within a single image. The details of Rein
will be explained in the next section.

3.3. Details of Rein
Linking tokens to instances. At the core of Rein, we establish an implicit yet effective linkage between tokens and
instances, which has demonstrated notable performance, as
detailed in Sec. 4. This connection is further reinforced by
utilizing object queries, a key component in DETR[2]-style
decode heads [6, 7, 67], as intermediaries. These queries
are empirically proven to establish a direct association with
instances. Specifically, we generate layer-wise queries Qi
from our learnable tokens Ti via linear transformation:

fi × TiT
√
).
c

(8)

where WQi and bQi signify the weights and biases, respectively, and c′ denotes the dimension of Qi . However, due
to the complexity arising from the large numbers of various layers in VFMs, transforming the diverse Qi into a
single query Q poses computational challenges. To address this, Rein computes both the maximal component
′
′
Qmax ∈ Rm×c and the average component Qavg ∈ Rm×c
using the following equation:

(4)

Qmax (j, k) =

max

i=1,2,...,N

Qi (j, k),

N

(5)

Qavg (j, k) =

Leveraging the feature-to-token similarity map Si , we
can preliminarily estimates of ∆fi using the equation:
∆f¯i = Si (:, 2 : m) × [ Ti (2 : m) × WTi + bTi ],

′

Qi ∈ Rm×c ,

Qi = Ti × WQi + bQi

where Ti represents the token sequence of the i-th layer, m
indicates the number of tokens in Ti . As S quantitatively
evaluates the relationships between various tokens and feature vectors, Rein can apply a softmax function to align each
patch with a unique instance:
Si = Sof tmax(

(7)

1 X
Qi (j, k).
N i=1

(9)

Subsequently, Q is derived as:
Q = Concat([Qmax , Qavg , QN ]) × WQ + bQ .

(6)

(10)

By mapping T onto Q, which subsequently links to
instances, Rein achieves enhanced performance with a
marginal increase in parameters.
Layer-shared MLP weights. To address the redundancy of
parameters in the layer-specific MLP weights, specifically
WTi in Eq. (6), Wfi in Eq. (7), and WQi in Eq. (8), which
collectively contribute to a substantial trainable parameter
count, we adopt a new strategy. Since the learnable Ti is capable of producing distinct ∆fi for each layer, we design
the role of the MLP to primarily perform consistent linear transformations across different feature spaces for each
layer within the backbone. To this end, we employ shared
MLP weights across layers as outlined in the equations:

where WTi and bTi denote the weights and biases of a MLP,
respectively. This MLP enables the transformation of Ti
across different feature spaces during the computation of Si
and ∆f¯i . Optionally, Rein can pre-calculate Ti × WTi + bTi
to reduce inference time. Notably, Si (:, 2 : m) selects
columns 2 to m of Si , and Ti (2 : m) denotes the selection
of rows 2 to m of Ti . This selection is particularly useful
in handling challenging samples that might not have corresponding tokens in Ti . In these cases, the total similarity
in the respective row of Si remains 1, potentially leading
to erroneous modifications. To counter this, Rein excludes
the first token of Ti and the first column of Si , enabling the
sum of each row in Si to range between 0 and 1, thereby
reducing the risk of inappropriate alterations.
To enhance the flexibility in feature adjustment, Rein utilizes a MLP composed of Wfi and bfi to produce the final

[WT1 , bT1 ] = [WT2 , bT2 ] = . . . = [WTN , bTN ],
[Wf1 , bf1 ] = [Wf2 , bf2 ] = . . . = [WfN , bfN ],
[WQ1 , bQ1 ] = [WQ2 , bQ2 ] = . . . = [WQN , bQN ].
5

(11)

Methods

Publication

RobustNet [8]
PintheMem [29]
SAN-SAW [50]
WildNet [35]
DIGA [61]
SPC [22]
EVA02 - Frozen [14, 15]
EVA02 + Rein
DINOv2 - Frozen [46]
DINOv2 + Rein

CVPR 21
CVPR 22
CVPR 22
CVPR 22
CVPR 23
CVPR 23
arXiV 23
arXiV 23
-

Citys
37.7
44.5
42.1
43.7
46.4
46.4
55.8
63.5
64.8
68.1

mIoU
BDD Map Avg.
34.1 38.5 36.8
38.1 42.7 41.8
37.7 42.9 40.9
39.9 43.3 42.3
33.9 43.5 41.3
43.2 48.2 45.9
55.1 59.1 56.7
60.7 63.9 62.7
60.2 65.2 63.4
60.5 67.1 65.2

Methods
IBN [47]
DRPC [65]
GTR [49]
SAN-SAW [50]
WildNet [35]
HGFormer [12]
Freeze
Rein (Ours)
Freeze
Rein (Ours)

BDD
48.6
49.9
50.8
53.0
50.9
61.5
57.8
64.1
63.4
65.0

mIoU
Map Avg.
57.0 52.8
56.3 53.1
57.2 54.0
59.8 56.4
58.8 54.9
72.1 66.8
63.8 60.8
69.5 66.8
69.7 66.7
72.3 68.7

Mapillary [45]) and synthetic datasets (GTAV [53], Synthia [54]). In detail, Cityscapes (denoted as Citys) is an
autonomous driving dataset that contains 2975 training images and 500 validation images, each with the resolution of
2048 × 1024. BDD100K (shortened to BDD) and Mapillary (denoted by Map) offer 1,000 (1280 × 720) and 2,000
(1902 × 1080) validation images, respectively. GTAV, a
synthetic dataset, presents 24,966 labeled images obtained
from the game. Synthia, another synthetic dataset, provides
25,000 images created by photo-realistic rendering.
Implementation details. We utilize the MMSegmentation
[9] codebase for our implementation. For superior performance, mask2former [7], a widely-used segmentation head,
is integrated with various VFMs that serve as the backbone. Additional experiments involving other decode heads
are detailed in the supplementary material. For the training phase, the AdamW optimizer [42] is employed, setting
the learning rate at 1e-5 for the backbone and 1e-4 for both
the decode head and the proposed Rein. Aiming to efficient training process, we utilize a configuration of 40,000
iterations with a batch size of 4, and crop images to a resolution of 512 × 512. Our approach includes only basic
data augmentation, following Mask2Former [7]. Thanks
to our streamlined training configuration and reduced number of trainable parameters, Rein can fine-tune models like
DINOv2-Large or EVA02-Large on a single RTX 3090Ti
GPU within 12 hours for superior generalization ability.

Low-rank token sequence. Recognizing the potential for
information overlap among diverse learnable tokens, such
as the high similarity between tokens representing a car’s
headlight and a bicycle’s light, Rein adopts a strategy to
generate a low-rank token sequence T as follows:
A ∈ Rm×r , B ∈ Rr×c ,

ResNet50 [18]
ResNet50 [18]
ResNet50 [18]
ResNet50 [18]
ResNet101 [18]
Swin-L [41]
EVA02-L [14]
EVA02-L [14]
DINOv2-L [46]
DINOv2-L [46]

Trainable
Parameters∗
23.58M
23.58M
23.58M
23.58M
42.62M
196.03M
0.00M
2.99M
0.00M
2.99M

Table 5. Performance Comparison of the Rein against other
DGSS methods under Cityscapes → BDD100K (BDD) +Mapillary (Map) generalization. Models are fine-tuned on Cityscapes
and tested on BDD and Map. The best results are highlighted.

Table 4. Performance Comparison of the proposed Rein against
other DGSS methods under GTAV + Synthia → Cityscapes
(Citys) + BDD100K (BDD) +Mapillary (Map) generalization.
Models are fine-tuned on GTAV and Synthia, tested on Cityscapes,
BDD100K and Mapillary. The best results are highlighted.

T i = Ai × Bi ,

Backbone

(12)

where c denotes the dimension of Ti , m is the length of
sequence Ti , and r represents the rank, with r ≪ c. Here,
matrices A and B are constructed as low-rank matrices. To
reduce inference time, Rein can precompute and store T .
By implementing this low-rank token sequence approach,
Rein significantly reduces the number of parameter.

4. Experiments
4.1. Settings
Visual Foundation Models. To thoroughly assess the influence of Visual Foundation Models (VFMs) within the context of DGSS, we analyze five distinct VFMs, each with
different training strategies and datasets. Our selection includes CLIP [51], a language-image pre-training model;
MAE [19], known for its masked pre-training approach;
SAM [33], which leverages a large-scale segmentation
dataset; EVA02 [14, 15] combines CLIP with masked image
modeling; and DINOv2 [46], based on self-supervised pretraining with curated dataset. For balancing precision and
efficiency, we mainly employ the ViT-Large architecture for
these VFMs, except SAM, which utilizes a ViT-Huge image
encoder, as described in its original paper [33]. We establish two fundamental baselines for VFMs: “Full”, where
we fine-tune the entire network, and “Freeze”, in which all
backbone parameters are fixed, with training solely on the
segmentation head. More details about VFMs are available
in the supplementary material.
Datasets. We evaluate VFMs and proposed methods on
both real-world datasets (Cityscapes [10], BDD100K [64],

4.2. Comparison with State-of-The-Art Methods
In this section, we comprehensively evaluate Rein over
five datasets within three generalization settings: GTAV →
Citys + BDD + Map, GTAV + Synthia → Citys + BDD
+ Map, and Citys → BDD + Map. Rein is benchmarked
against state-of-the-art (SOTA) methods, which can be classified into two groups, including domain generalized semantic segmentation (DGSS) methods[4, 8, 12, 22, 29, 35,
47, 49, 50, 61, 65, 69], and parameter-efficient fine-tuning
(PEFT) approaches [5, 21, 23].
6

Backbone

EVA02
(Large)
[14, 15]

DINOv2
(Large)
[46]

Fine-tune
Trainable
road side. build. wall fence pole light sign vege terr. sky pers. rider car truck bus train moto. bicy. mIoU
Method
Params∗
Full
304.24M 89.3 46.9 89.9 47.7 45.6 50.1 56.8 42.2 88.8 48.4 89.9 75.8 49.0 90.5 45.3 69.2 55.9 44.4 55.1 62.2
Freeze
0.00M 93.1 52.7 88.0 47.4 31.1 41.7 46.0 39.6 85.7 41.4 89.5 67.5 39.7 89.0 47.0 72.8 46.3 19.2 35.2 56.5
Rein-core
52.84M 91.1 53.8 90.0 50.3 47.7 46.6 56.4 42.9 87.8 44.2 90.4 73.5 44.2 91.8 58.1 77.2 57.3 43.4 57.3 63.4
+ Rein-link
59.33M 90.9 48.5 90.0 52.6 49.4 49.1 57.2 39.8 88.9 46.5 90.5 74.4 44.0 91.0 52.3 80.7 67.3 44.3 60.3 64.1
+ Rein-share
5.02M 92.7 54.3 90.0 51.8 48.6 48.8 55.3 45.0 88.9 46.7 89.8 73.7 43.3 90.6 49.5 81.1 69.6 41.7 50.2 63.4
+ Rein-lora
2.99M 91.7 51.8 90.1 52.8 48.4 48.2 56.0 42.0 89.1 44.1 90.2 74.2 47.0 91.1 54.5 84.1 78.9 47.2 59.4 65.3
Full
304.20M 89.0 44.5 89.6 51.1 46.4 49.2 60.0 38.9 89.1 47.5 91.7 75.8 48.2 91.7 52.5 82.9 81.0 30.4 49.9 63.7
Freeze
0.00M 92.1 55.2 90.2 57.2 48.5 49.5 56.7 47.7 89.3 47.8 91.1 74.2 46.7 92.2 62.6 77.5 47.7 29.6 47.2 61.1
Rein-core
52.84M 92.4 57.8 90.6 56.8 50.7 50.5 57.5 44.8 89.8 47.0 91.1 75.9 47.2 91.9 60.1 80.3 59.8 37.9 52.3 64.9
+ Rein-link
59.33M 91.2 55.5 90.6 55.6 52.5 51.1 59.7 45.1 89.8 47.1 91.1 75.8 47.1 92.6 64.6 82.2 65.5 40.4 52.7 65.8
+ Rein-share
5.02M 93.5 61.2 90.7 57.7 53.2 52.4 58.0 50.1 89.7 49.9 90.7 74.8 45.0 91.7 58.5 80.1 66.3 36.9 50.7 65.8
+ Rein-lora
2.99M 92.4 59.1 90.7 58.3 53.7 51.8 58.2 46.4 89.8 49.4 90.8 73.9 43.3 92.3 64.3 81.6 70.9 40.4 54.0 66.4

Table 6. Ablation Study about Rein under Cityscapes → BDD100K generalization in terms of mIoU. Components are sequentially incorporated. To better illustrate the gains contributed by each component, we employ varying shades of yellow to demonstrate the relative
performance of the Freeze and Rein methods. The best results across all methods are highlighted.
Input

RobustNet

GTR

WildNet

Ours

GT

Citys

BDD

Map

Figure 3. Qualitative Comparison under GTAV → Cityscapes (Citys) + BDD100K (BDD) + Mapillary (Map) generalization setting.

Investigation and comparison of various VFMs. Our
analysis of VFMs and proposed Rein in the GTAV → Citys
+ BDD + Map setting is presented in Tables 1 and 2. In
this setup, models are fine-tuned using GTAV and evaluated on Cityscapes, BDD100K, and Mapillary. Note that,
due to the fixed and relatively small number of trainable parameters in the decode head (20.6M), the count of trainable
parameters presented in the tables are focused solely on the
backbone and the PEFT module. Our results, as detailed
in Table 1, indicate that frozen VFMs significantly outperform previous DGSS methods without specialized design.
Moreover, as shown in Table 2, VFMs with full parameter
fine-tuning exhibit enhanced performance relative to their
frozen counterparts. Remarkably, Rein achieves even superior generalization capabilities, surpassing the full parameter fine-tuning with merely an extra 1% of trainable parameters compared to the original backbone. Visual samples for
qualitative comparison are given in Fig. 3.

tation or consistency constraints, (e.g., AdvStyle, PASTA,
and GTR), do not exhibit significant performance improvement. On the other hand, PEFT methods have demonstrated
notable advancements. For instance, AdaptFormer outperforms the “Freeze” baseline using EVA02 as the backbone,
while VPT shows improved performance over “Full” with
DINOv2. Employing the same backbones (DINOv2 and
EVA02), proposed Rein achieves superior performance and
surpass previous DGSS and PEFT methods.
Multi-source generalization. In this part, we compare
Rein against other DGSS methods under GTAV + Synthia
→ Citys + BDD + Map setting, in which networks are finetuned using both GTAV and Synthia datasets, and tested on
Cityscapes, BDD100K, and Mapillary. As shown in Table 4, we report the performance of Rein employing two
VFMs, EVA02 and DINOv2. Our results demonstrate that
Rein significantly surpasses existing DGSS methods by a
large margin in average mIoU (from 45.9% to 65.2%).
Cityscapes-to-other datasets generalization. The generalization from one real-world dataset to others is pivotal for
practical applications in the field. To this end, we conduct
experiments under the Citys → BDD + Map generalization setting. In this context, Rein, when coupled with the
DINOv2-Large, demonstrates superior performance across
all datasets. This underscores the effectiveness of Rein in

Comparing Rein with SOTA on identical backbones. We
conduct a comprehensive performance comparison of the
proposed Rein against existing DGSS and PEFT methods under the GTAV → Citys + BDD + Map setting, as
detailed in Table 3. Owing to the robust feature extraction
capabilities inherent in VFMs, DGSS methods, which typically enhance generalizability through strong data augmen7

generalizing to diverse real-world scenarios.

    

 $ Y H U D J H  P , R 8    

4.3. Ablation Studies and Analysis
In this subsection, we conduct extensive ablation studies
within two settings: GTAV → Citys and GTAV → Citys +
BDD + Map. For all experiments, Rein is applied to two
VFMs, i.e., EVA02 and DINOv2.
Analysis of the key components. Table 6 is dedicated
to thoroughly examining the effectiveness of each component within Rein, focusing on how they influence recognition performance across various semantic categories. In
the GTAV → Citys generalization setting, we sequentially
incorporate different components of Rein and assess their
impact on enhancing performance when applied to two
VFMs, EVA02 and DINOv2. Interestingly, we observe
that the “Freeze” baseline occasionally exhibit better recognition for specific categories, e.g., ‘road, sidewalk’, compared to the “Full” baseline. This suggests that VFMs
lose some pre-training knowledge during fine-tuning, and
“Freeze” helps to prevent. Similarly, our methods mitigate
this knowledge forgetting. Furthermore, our methods show
improved recognition capabilities for the majority of the 19
categories. For example, in recognizing ‘wall, motorcycle,
bicycle’, our approach significantly outperforms both the
“Full” and “Freeze” baselines.
Overall, “Rein-core” boosts the average performance
across 19 classes. Furthermore, “Rein-link”, as mentioned
in Sec. 3.3, further boosts accuracy for certain objects, including ‘car, bus, train, motorcycle’, especially when DINOv2 serve as the backbone. The strategy of employing
layer-shared MLP weights efficiently reduces the number
of trainable parameters from 59.33M to 5.02M. Lastly, the
incorporation of a low-rank token sequence not only further
reduces the number of trainable parameters but also positively influences the performance of the model.
Study on token length m. The core component of Rein
is a set of learnable tokens T ∈ Rm×c . We explored various lengths m for the token sequence, ranging from 25 to
200. As demonstrated in Fig. 4, models with m = 100 and
m = 150 both achieve a strong mIoU of 64.3% when utilizing DINOv2 as the backbone, and models with m = 100
achieve the optimal mIoU of 63.6% when using EVA02. We
ultimately selected m = 100 as the most suitable parameter,
which is consistently applied in subsequent experiments.
Study on rank r As shown in Table 7, we turn our attention to the effect of rank r on model performance. When
employing EVA02 as the backbone, the peak performance
is achieved at r = 16. Similarly, with DINOv2 as the
backbone, the optimal results are observed at r = 16 and
r = 32. Consequently, unlike LoRA [21], we opt for a
comparatively higher value of r = 16 for our model.
Speed, memory, and storage. For practical applications,
training speed, GPU memory usage, and model storage re-

 7 U D L Q D E O H  3 D U D P H W H U V   0 
    
    
     
     

    
     

    
          
    
    
     
     
    
  

    
 ( 9 $  
 ' , 1 2 Y 

     

  

  

       
 / H Q J W K  P

     

     

   

     
       

Figure 4. Ablation study on token length m.

Rank r
Params
Citys
BDD
Map
Avg.
Citys
DINOv2
BDD
(Large)
Map
[46]
Avg.
EVA02
(Large)
[14]

4
8
16
32
64
2.67M 2.77M 2.99M 3.42M 4.28M
62.6
63.5
65.3
63.8
63.4
58.5
58.9
60.5
60.5
60.2
63.7
63.8
64.9
64.5
64.3
61.6
62.1
63.6
62.9
62.7
65.8
66.1
66.4
66.1
66.4
60.2
60.3
60.4
60.7
61.0
65.2
65.1
66.1
65.9
65.0
63.7
63.9
64.3
64.3
64.1

Table 7. Ablation study on lora dim r.

VFMs

Method

EVA02
(Large)
DINOv2
(Large)

Full
Rein
Full
Rein

Training
Time
11.8 h
10.5 h
11.2 h
9.5 h

GPU
Storage
Memory
15.9 GB 1.22 GB
12.5 GB 1.23 GB
14.7 GB 1.22 GB
10.0 GB 1.23 GB

Table 8. Training Time, GPU Memory, and Storage.

quirements are crucial. Lower training speeds and reduced
GPU memory usage are beneficial for development of new
methods and adaptation for new tasks. As shown in Table 8, compared to “Full” baseline, proposed Rein improves
training speed and reduces GPU memory usage. Additionally, Rein marginally increases the storage needs by only
0.01GB. A significant advantage of Rein is that models
trained under different settings can share the same backbone
parameters. This means that for deployment in diverse tasks
and settings, we can only swap the rein weights (0.01GB)
and head weights (0.08GB), rather than all parameters.

5. Conclusions
In this paper, we assess and harness Vision Foundation
Models (VFMs) in the context of DGSS. Driven by the motivation that Leveraging Stronger pre-trained models and
Fewer trainable parameters for Superior generalizability, we first investigate the performance of VFMs under diverse DGSS settings. Subsequently, we introduce a robust
fine-tuning approach, namely Rein, to parameter-efficiently
harness VFMs for DGSS. With a fewer extra trainable parameters, Rein significantly enhance generalization ability
of VFMs, outperforming SOTA methods by a large mar8

gin. Rein can be seamlessly integrated as a plug-and-play
adapter for existing VFMs based on plain vision transformer architecture, improving generalization while making
training efficient. Extensive experiments across various settings demonstrate the substantial potential of VFMs in the
DGSS field, validating the effectiveness of proposed Rein
in parameter-efficiently harnessing VFMs for DGSS.

9

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models
for Domain Generalized Semantic Segmentation
Supplementary Material

  

 ) U H H ] H  5 H L Q  ) X O O
     0      0        0

    

    
  
    
  

 ) U H H ] H  5 H L Q  ) X O O
     0      0        0

 ( 9 $      0 D V N  ) R U P H U     
    

  
    
  
    

    

  

 ) U H H ] H  5 H L Q  ) X O O
     0      0        0

  

    

 7 H V W  P , R 8

    
  

  

 ' , 1 2 Y     0 D V N  ) R U P H U     
    

  
    
  
    
  

 ) U H H ] H  5 H L Q  ) X O O
     0      0        0

 7 U D L Q L Q J  / R V V

    

    

 7 U D L Q L Q J  / R V V

    
  

 ' , 1 2 Y     6 H P ) 3 1

 7 H V W  P , R 8

    
  

  

 7 U D L Q L Q J  / R V V

    

 7 H V W  P , R 8

 ( 9 $      6 H P ) 3 1

 7 U D L Q L Q J  / R V V

 7 H V W  P , R 8

  

    

Figure 5. The curves of training loss and test metrics display consistent trends across different VFMs and decode heads: intuitively, as
trainable parameters increase from 0.00M (F reeze) → 2.53M (Rein) → 304.24M (F ull), the training loss monotonically decreases,
indicating that a greater number of trainable parameters indeed better fit the training dataset. However, the test metrics on the target
dataset initially rise and then fall, forming an inverted U-shape. This pattern suggests that the “Full” baseline overfits the training data,
leading to diminished test performance. These findings are aligned with our motivation that Leveraging Stronger pre-trained models
and Fewer trainable parameters for Superior generalizability. The blue bar charts in the figure represent the average mIoU tested on
the Cityscapes, BDD100K, and Mapillary datasets, while the yellow line denotes the training loss during fine-tuning on GTAV dataset.

6. Discussion about Fewer Trainable Parameters

Fig. 2 showcases a consistent trend across four different configurations. As trainable parameters increase from
0.00M (F reeze) → 2.53M (Rein) → 304.24M (F ull),
the training loss monotonically decreases. However, the test
metrics on the target dataset peak with Rein, which employs
2.53 million parameters and incurs a sub-optimal training
loss. In contrast, the “Full” baseline, despite recording the
lowest training loss, only achieves sub-optimal test performance, a clear indicator of overfitting when compared to
other setups. This observation aligns with the conclusions
in [16, 27], supporting ours observation that leveraging
Stronger pre-trained models and Fewer trainable parameters can lead to Superior generalizability.

Classical neural network theory [16, 17] points out that
as model capacity increases, the empirical risk (or training
risk) monotonically decreases, indicating an improved fit to
training data. Conversely, the true risk (or test risk) typically exhibits a “U-shaped” curve, initially decreasing and
then increasing, a phenomenon known as overfitting. From
a modern viewpoint, the scaling law [27] suggests that on
a smaller fixed dataset, performance stops to improve as
model parameters increase, leading to overfitting.
In the majority of general tasks, the practice of earlystopping, based on evaluation data, can partly mitigate overfitting. However, in the field of domain generalization, the
unknown test data distribution makes acquiring a valid evaluation dataset unavailable. Moreover, fine-tuning datasets
are often smaller compared to ImageNet [11] or LVD142M [46]. Hence, employing fewer trainable parameters
emerges as a strategic approach to mitigate overfitting.
In our main paper, extensive experiments comprehensively demonstrate Rein’s pivotal role in enhancing the generalization capabilities of VFMs. This enhancement may
be attributed to two factors: 1) Rein’s improved fitting capability for VFMs, ensuring better alignment with training
data; 2) Rein’s reduction of overfitting in VFMs during finetuning on smaller datasets, thus exhibiting enhanced generalization in testing. To delve into this, we analyze and compare the average training loss in the final 1000 iterations of
the fine-tuning phase and their corresponding test metrics
for various VFMs and decode heads.

7. Other decode head
Our experiments on Rein employ the Mask2Former [7]
decode head, which shares structures or core concepts with
numerous methods in dense prediction tasks [2, 6, 37,
44, 67]. The universality of Mask2Former highlights the
significance of our findings for a range of segmentation
tasks, including instance and panoptic segmentation. Furthermore, to demonstrate Rein’s effectiveness in enhancing
backbone generalization and its robustness across various
decode heads, we conduct supplementary experiments using the popular SemFPN decode head [32], in the GTAV→
Cityscapes + BDD100K + Mapillary setting.
As shown in Table 9, Rein surpasses the “Full” and
“Freeze” baselines, employing 2.53 million trainable parameters within the backbone, while the SemFPN decode
head comprises 1.63 million parameters. Owing to the ab1

Fine-tune Trainable
mIoU
Method
Params∗ Citys BDD Map Avg.
Full
304.24M 58.5 56.9 62.0 59.1
EVA02 [14, 15]
Freeze
0.00M 54.1 51.2 54.3 53.2
(Large)
Rein
2.53M 61.4 58.5 62.0 60.7
304.20M 61.2 55.9 62.5 59.9
Full
DINOV2 [46]
Freeze
0.00M 58.9 56.4 60.3 58.5
(Large)
Rein
2.53M 63.6 59.0 63.7 62.1

7th, 11th, 15th, and 23rd layers directly into the decoding
head.

Backbone

SAM. Aligning with the methodology described in the
foundational paper [33], we employ the ViT-Huge architecture as our image encoder, making use of pre-trained
weights that were trained on SA-1B [33] for a promptable
segmentation task. The patch size of this model is set to
16 × 16, and each layer is designed to output features with a
dimensionality of 1280, summing up to a total of 32 layers.
The positional embeddings of the model are upscaled to a
length of 1024 via bicubic interpolation. From this model,
we extract features from the 7th, 15th, 23rd, and 31st layers
and feed them into the decoder.

Table 9. Performance Comparison with the proposed Rein with
SemFPN [32] as Backbones under the GTAV → Cityscapes
(Citys) + BDD100K (BDD) + Mapillary (Map) generalization setting. Models are fine-tuned on GTAV and tested on Cityscapes,
BDD100K and Mapillary. The best results are highlighted. ∗ denotes trainable parameters in backbones.

EVA02. In our approach, we adopt the largest scale configuration, EVA02-L, as our structural backbone, as suggested
in the paper [14]. This particular model configuration determines its patch size as 16, with each layer producing feature maps of 1024 dimensions, across a total of 24 layers.
EVA02 undergoes training through a combination of CLIP
and Masked Image Modeling techniques on an aggregated
dataset that includes IN-21K [11], CC12M [3], CC3M [57],
COCO [39], ADE20K [70], Object365 [56], and OpenImages [34]. Mirroring the approach used in previous models, we upscale the positional embeddings to 1024 through
bilinear interpolation, and the patch embed layer’s convolutional kernel size is augmented to 16 × 16 via bicubic
interpolation. Features from the 7th, 11th, 15th, and 23rd
layers are then processed through the decode head.

sence of object queries in SemFPN, the “linking tokens
to instance” mechanism, described in Sec.3.3, is not utilized, resulting in a reduction of Rein’s trainable parameters
from 2.99 million to 2.53 million. When compared to the
complete Rein configuration using the Mask2Former, using
SemFPN achieves sub-optimal performance, evident in the
64.3% mIoU reported in Table 2 and 62.1% mIoU in Table 9, both implemented with DINOv2-Large. These findings guide our decision to focus on experiments involving
Mask2Former in the main paper.

8. More details about VFMs
CLIP. In our study, we utilize the ViT-Large architecture,
setting the patch size to 16 × 16. Each layer of this architecture outputs features with a dimensionality of 1024,
making use of the pre-trained weights from the foundational work [51]. Our model undergoes a pre-training phase
through contrastive learning, employing publicly available
image-caption data. This data is compiled through a
blend of web crawling from select websites and integrating widely-used, existing image datasets. For the model’s
pre-trained weights, which have a patch size of 14 × 14 and
an original pre-training image size of 224 × 224, we adopt
bilinear interpolation to upscale the positional embeddings
to a length of 1024. Moreover, trilinear interpolation is utilized to enlarge the kernel size of the patch embed layer to
16 × 16. Our model comprises 24 layers, and the features
extracted from the 7th, 11th, 15th, and 23rd layers (counting from the zeroth layer) are subsequently channeled into
the decoding head.
MAE. Employing the ViT-Large architecture, our model
outputs features from each layer with a dimensionality of
1024, maintaining a patch size of 16 × 16. This model
capitalizes on the pre-trained weights as delineated in the
original work [19], and it undergoes self-supervised training
using masked image modeling on ImageNet-1K. The architecture is composed of 24 layers, directing features from the

DINOv2. Our choice of backbone for this study is
DINOv2-L, which has been distilled from DINOv2-g. As
noted in the original documentation, DINOv2-L occasionally surpasses the performance of DINOv2-g [46]. Sharing the same patch size, dimensionality, and layer count as
EVA02-L, we apply equivalent processing to both the positional embeddings and patch embed layer of DINOv2-L.
The features extracted from the 7th, 11th, 15th, and 23rd
layers are subsequently fed into the decode head. DINOv2
is originally pretrained in a self-supervised fashion on the
LVD-142M [46] dataset, following the procedures outlined
in its respective paper.

9. Algorithm of Proposed Rein
Algorithm 1 outlines the training procedure for Rein,
wherein the weights conform to the constraints specified in
Eq. (11). In this context, the variable c represents the number of channels in the feature maps of model M, N denotes
the total number of layers within M, T indicates the overall
number of training iterations, and r is defined as a hyperparameter that is considerably smaller than c.
2

With the rapid development of generative models research, we anticipate that our work could leverage highquality generated samples to approach the performance of
models trained with supervision on real datasets. Furthermore, we are prepared to investigate how VFMs can enhance the performance of semantic segmentation models
trained on real datasets under various adverse weather conditions or on special road types. Finally, further exploration
is necessary to investigate how Rein can be extended to
tasks such as instance segmentation, panoptic segmentation,
open-vocabulary segmentation, and even object detection.

Algorithm 1: Training process of Rein.
Input: A sequence of input data and corresponding labels
{(xi , yi ) | t ∈ N, 1 ≤ i ≤ Nd }; Pre-trained
Vision Foundation Model M, consisting of a
patch embed layer Lemb , and layers
L1 , L2 , . . . , LN ; a decode head H; and a
proposed module Rein R. The module Rein
comprises the following matrices and vectors,
initialized as specified:
Ai ∈ Rm×r ,
Bi ∈ Rr×c ,
WTi ∈ Rc×c ,
Wfi ∈ Rc×c ,
′
WQi ∈ Rc×c ,
c
bTi ∈ R ,
bfi ∈ R c ,
′
bQi ∈ Rc ,

uniformly initialized,
uniformly initialized,
uniformly initialized,
initialized to zero,
uniformly initialized,
initialized to zero,
initialized to zero,
initialized to zero,

for each i ∈ N, 1 ≤ i ≤ N . Additionally,
′
′
WQ ∈ R3c ×c is uniformly initialized, and
′
bQ ∈ Rc is initialized to zero.
Output: The optimized H and R.
for t ← 1 to T do
Get batch data:(x, y)
f0 = Lemb (x)
for i ← 1 to N do
fi = Li (fi−1 )
Ti = Ai × Bi
f ×T T

Si = Sof tmax( i√c i )
∆f¯i = Si (:, 2 : m) × [Ti (2 : m) × WTi + bTi ]
∆fi = (∆f¯i + fi ) × Wfi + bfi
Qi = Ti × WQi + bQi
fi = fi + ∆fi
Ft ⊆ {f0 , f1 , . . . , fN }
Calculate Qmax and Qavg by Eq. (9)
Q = Concat([Qmax , Qavg , QN ]) × WQ + bQ
y¯t = H(Ft , Q)
Optimize H and R by Loss(ȳ, y)

10. Qualitative Results and Future works
In this section, we showcase our prediction results across
various datasets, including Cityscapes, BDD100K, and
Mapillary, as depicted in Fig.6, Fig.8, and Fig.7. All models are trained on the GTAV dataset without any fine-tuning
on real-world urban-scene datasets. Our method outshines
other approaches in accuracy, especially in categories like
traffic signs, bicycles, traffic lights, sidewalks, roads, and
trucks, demonstrating high precision for both large objects
and smaller targets. Notably, despite not specifically optimizing for night-time segmentation, Rein’s performance
during night conditions is surprisingly high, almost akin to
daytime performance, as illustrated in Fig.6.
3

Input

RobustNet

WildNet

GTR

Ours

GT

Figure 6. Prediction results of DINOv2+Rein on the BDD100K validation set. The model is fine-tuned exclusively on the GTAV dataset,
without access to any real-world urban-scene datasets.

4

Input

RobustNet

WildNet

GTR

Ours

GT

Figure 7. Prediction results of DINOv2+Rein on the Cityscapes validation set. The model is fine-tuned exclusively on the GTAV dataset,
without access to any real-world urban-scene datasets.

5

Input

RobustNet

WildNet

GTR

Ours

GT

Figure 8. Prediction results of DINOv2+Rein on the Mapillary validation set. The model is fine-tuned exclusively on the GTAV dataset,
without access to any real-world urban-scene datasets.

6

References

tation. Advances in Neural Information Processing Systems,
34:17864–17875, 2021. 5, 1
[7] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask
transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 1290–1299, 2022. 2, 5, 6, 1
[8] Sungha Choi, Sanghun Jung, Huiwon Yun, Joanne T Kim,
Seungryong Kim, and Jaegul Choo. Robustnet: Improving
domain generalization in urban-scene segmentation via instance selective whitening. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 11580–11590, 2021. 1, 3, 6
[9] MMSegmentation Contributors.
MMSegmentation:
Openmmlab semantic segmentation toolbox and
benchmark.
https : / / github . com / open mmlab/mmsegmentation, 2020. 6
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
Franke, Stefan Roth, and Bernt Schiele. The cityscapes
dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 3213–3223, 2016. 1, 2, 3, 5, 6
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and
pattern recognition, pages 248–255. Ieee, 2009. 2, 5, 1
[12] Jian Ding, Nan Xue, Gui-Song Xia, Bernt Schiele, and
Dengxin Dai. Hgformer: Hierarchical grouping transformer
for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 15413–15423, 2023. 2, 6
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020. 3
[14] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation
for neon genesis. arXiv preprint arXiv:2303.11331, 2023. 1,
2, 3, 4, 6, 7, 8
[15] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu,
Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue
Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
19358–19369, 2023. 2, 3, 4, 6, 7
[16] Stuart Geman, Elie Bienenstock, and René Doursat. Neural
networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992. 1
[17] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and
Jerome H Friedman. The elements of statistical learning:
data mining, inference, and prediction. Springer, 2009. 1
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 1, 2, 3, 6

[1] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen
Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue,
Moussa Doumbouya, Esin Durmus, Stefano Ermon, John
Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn,
Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman,
Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter
Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle
Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark
Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar,
Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle
Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik,
Christopher D. Manning, Suvir Mirchandani, Eric Mitchell,
Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak
Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles,
Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr,
Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva
Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo
Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori
Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga,
Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang,
Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and
Percy Liang. On the opportunities and risks of foundation
models, 2022. 3
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. 5,
1
[3] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu
Soricut. Conceptual 12m: Pushing web-scale image-text pretraining to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 3558–3568, 2021. 2
[4] Prithvijit Chattopadhyay, Kartik Sarangmath, Vivek Vijaykumar, and Judy Hoffman. Pasta: Proportional amplitude
spectrum training augmentation for syn-to-real domain generalization. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 19288–19300, 2023.
1, 2, 3, 4, 6
[5] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang,
Yibing Song, Jue Wang, and Ping Luo. Adaptformer:
Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems,
35:16664–16678, 2022. 2, 3, 4, 6
[6] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Perpixel classification is not all you need for semantic segmen-

7

[19] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
Dollár, and Ross Girshick. Masked autoencoders are scalable
vision learners. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 16000–
16009, 2022. 1, 2, 3, 4, 6
[20] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna
Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona
Attariyan, and Sylvain Gelly. Parameter-efficient transfer
learning for nlp. In International Conference on Machine
Learning, pages 2790–2799. PMLR, 2019. 2, 3
[21] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021. 2, 3, 4, 6, 8
[22] Wei Huang, Chang Chen, Yong Li, Jiacheng Li, Cheng Li,
Fenglong Song, Youliang Yan, and Zhiwei Xiong. Style
projected clustering for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 3061–
3071, 2023. 2, 6
[23] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie,
Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer
Vision, pages 709–727. Springer, 2022. 2, 3, 4, 6
[24] Xueying Jiang, Jiaxing Huang, Sheng Jin, and Shijian Lu.
Domain generalization via balancing training difficulty and
model capability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18993–19003,
2023. 2
[25] Mengmeng Jing, Xiantong Zhen, Jingjing Li, and Cees GM
Snoek. Order-preserving consistency regularization for domain adaptation and generalization. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 18916–18927, 2023. 2
[26] Juwon Kang, Sohyun Lee, Namyup Kim, and Suha Kwak.
Style neophile: Constantly seeking novel styles for domain
generalization. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 7130–
7140, 2022. 1
[27] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for
neural language models. arXiv preprint arXiv:2001.08361,
2020. 2, 1
[28] Hyeonseong Kim, Yoonsu Kang, Changgyoon Oh, and KukJin Yoon. Single domain generalization for lidar semantic
segmentation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 17587–
17598, 2023. 1
[29] Jin Kim, Jiyoung Lee, Jungin Park, Dongbo Min, and
Kwanghoon Sohn. Pin the memory: Learning to generalize semantic segmentation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 4350–4360, 2022. 3, 6
[30] Namyup Kim, Taeyoung Son, Jaehyun Pahk, Cuiling Lan,
Wenjun Zeng, and Suha Kwak. Wedge: web-image assisted
domain generalization for semantic segmentation. In 2023

IEEE International Conference on Robotics and Automation
(ICRA), pages 9281–9288. IEEE, 2023. 1
[31] Sunghwan Kim, Dae-hwan Kim, and Hoseong Kim. Texture learning domain randomization for domain generalized
segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 677–
687, 2023. 2
[32] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
Dollár. Panoptic feature pyramid networks. In Proceedings
of the IEEE/CVF conference on computer vision and pattern
recognition, pages 6399–6408, 2019. 1, 2
[33] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and
Ross Girshick. Segment anything. In Proceedings of the
IEEE/CVF International Conference on Computer Vision
(ICCV), pages 4015–4026, 2023. 1, 2, 3, 4, 6
[34] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan
Popov, Matteo Malloci, Alexander Kolesnikov, et al. The
open images dataset v4: Unified image classification, object
detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
2
[35] Suhyeon Lee, Hongje Seong, Seongwon Lee, and Euntai
Kim. Wildnet: Learning domain generalized semantic segmentation from the wild. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 9936–9946, 2022. 2, 3, 6
[36] Brian Lester, Rami Al-Rfou, and Noah Constant. The power
of scale for parameter-efficient prompt tuning. arXiv preprint
arXiv:2104.08691, 2021. 2, 3
[37] Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang,
Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards
a unified transformer-based framework for object detection
and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 3041–3050, 2023. 1
[38] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint
arXiv:2101.00190, 2021. 2
[39] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In
Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings,
Part V 13, pages 740–755. Springer, 2014. 2
[40] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao
Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can
be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pages
61–68, 2022. 2
[41] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Hierarchical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF international conference on
computer vision, pages 10012–10022, 2021. 3, 6

8

tion of urban scenes. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016. 6
[55] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
residuals and linear bottlenecks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018. 1, 2, 3
[56] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A
large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019. 2
[57] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages
2556–2565, 2018. 2
[58] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014. 1, 3
[59] Zhiqiang Tang, Yunhe Gao, Yi Zhu, Zhi Zhang, Mu Li, and
Dimitris N. Metaxas. Crossnorm and selfnorm for generalization under distribution shifts. In Proceedings of the
IEEE/CVF International Conference on Computer Vision
(ICCV), pages 52–61, 2021. 1, 3
[60] Jan-Aike Termöhlen, Timo Bartels, and Tim Fingscheidt.
A re-parameterized vision transformer (revt) for domaingeneralized semantic segmentation. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 4376–4385, 2023. 1
[61] Wei Wang, Zhun Zhong, Weijie Wang, Xi Chen, Charles
Ling, Boyu Wang, and Nicu Sebe. Dynamically instanceguided adaptation: A backward-free approach for test-time
domain adaptive semantic segmentation. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 24090–24099, 2023. 6
[62] Zhenyao Wu, Xinyi Wu, Xiaoping Zhang, Lili Ju, and Song
Wang. Siamdoge: Domain generalizable semantic segmentation using siamese network. In European Conference on
Computer Vision, pages 603–620. Springer, 2022. 1
[63] Qi Xu, Liang Yao, Zhengkai Jiang, Guannan Jiang, Wenqing Chu, Wenhui Han, Wei Zhang, Chengjie Wang, and
Ying Tai. Dirl: Domain-invariant representation learning
for generalizable semantic segmentation. In Proceedings of
the AAAI Conference on Artificial Intelligence, pages 2884–
2892, 2022. 1, 3
[64] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying
Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous
multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
2636–2645, 2020. 6
[65] Xiangyu Yue, Yang Zhang, Sicheng Zhao, Alberto
Sangiovanni-Vincentelli, Kurt Keutzer, and Boqing
Gong. Domain randomization and pyramid consistency:
Simulation-to-real generalization without accessing target

[42] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. arXiv preprint arXiv:1711.05101, 2017. 6
[43] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.
Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on
computer vision (ECCV), pages 116–131, 2018. 3
[44] Rodrigo Marcuzzi, Lucas Nunes, Louis Wiesmann, Jens
Behley, and Cyrill Stachniss. Mask-based panoptic lidar segmentation for autonomous driving. IEEE Robotics and Automation Letters, 8(2):1141–1148, 2023. 1
[45] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and
Peter Kontschieder. The mapillary vistas dataset for semantic
understanding of street scenes. In Proceedings of the IEEE
international conference on computer vision, pages 4990–
4999, 2017. 6
[46] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
Dinov2: Learning robust visual features without supervision.
arXiv preprint arXiv:2304.07193, 2023. 2, 3, 4, 6, 7, 8, 1
[47] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two
at once: Enhancing learning and generalization capacities
via ibn-net. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 464–479, 2018. 1, 3, 6
[48] Xingang Pan, Xiaohang Zhan, Jianping Shi, Xiaoou Tang,
and Ping Luo. Switchable whitening for deep representation learning. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 1863–1871, 2019. 3
[49] Duo Peng, Yinjie Lei, Lingqiao Liu, Pingping Zhang,
and Jun Liu. Global and local texture randomization for
synthetic-to-real semantic segmentation. IEEE Transactions
on Image Processing, 30:6594–6608, 2021. 1, 2, 3, 4, 6
[50] Duo Peng, Yinjie Lei, Munawar Hayat, Yulan Guo, and Wen
Li. Semantic-aware domain generalized segmentation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2594–2605, 2022. 2,
6
[51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervision. In International conference on machine learning, pages
8748–8763. PMLR, 2021. 1, 2, 3, 4, 6
[52] Nikhil Reddy, Abhinav Singhal, Abhishek Kumar, Mahsa
Baktashmotlagh, and Chetan Arora. Master of all: Simultaneous generalization of urban-scene segmentation to all adverse weather conditions. In European Conference on Computer Vision, pages 51–69. Springer, 2022. 1
[53] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen
Koltun. Playing for data: Ground truth from computer
games. In Computer Vision–ECCV 2016: 14th European
Conference, Amsterdam, The Netherlands, October 11-14,
2016, Proceedings, Part II 14, pages 102–118. Springer,
2016. 1, 2, 3, 5, 6
[54] German Ros, Laura Sellart, Joanna Materzynska, David
Vazquez, and Antonio M. Lopez. The synthia dataset: A
large collection of synthetic images for semantic segmenta-

9

domain data. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 2100–2110, 2019. 6
[66] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit:
Simple parameter-efficient fine-tuning for transformer-based
masked language-models. arXiv preprint arXiv:2106.10199,
2021. 2, 3
[67] Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, et al. Segvit: Semantic segmentation with plain vision transformers. Advances in Neural
Information Processing Systems, 35:4971–4982, 2022. 5, 1
[68] Yuhang Zhang, Shishun Tian, Muxin Liao, Guoguang Hua,
Wenbin Zou, and Chen Xu. Learning shape-invariant representation for generalizable semantic segmentation. IEEE
Transactions on Image Processing, 2023. 1
[69] Zhun Zhong, Yuyang Zhao, Gim Hee Lee, and Nicu
Sebe. Adversarial style augmentation for domain generalized urban-scene segmentation. Advances in Neural Information Processing Systems, 35:338–350, 2022. 1, 2, 3, 4,
6
[70] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International
Journal of Computer Vision, 127:302–321, 2019. 2
[71] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348,
2022. 2

10