Fine-tune vision foundation model for crack segmentation in civil
infrastructures
K. Ge1, C. Wang2, T.Y. Guo1*
1

Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
2
Department of Civil Engineering, Tsinghua University, Beijing, China

Abstract
Large-scale foundation models have become the mainstream method in the field of deep learning,
while in civil engineering, the scale of AI models is strictly limited. In this work, vision foundation
model is introduced for crack segmentation. Two Parameter-efficient fine-tuning methods, adapter
and low-rank adaptation, are adopted to fine-tune the foundation model in the field of semantic
segmentation: Segment Anything Model (SAM). The fine-tuned model CrackSAM is much larger
than all the existing crack segmentation models, but shows excellent performance. To test the zeroshot performance of the proposed method, two unique datasets related to road and exterior wall cracks
are collected, annotated and open-sourced, in total 810 images. Comparative experiments are
conducted with twelve mature semantic segmentation models. On datasets with artificial noise and
previously unseen datasets, the performance of CrackSAM far exceeds that of all state-of-the-art
models. CrackSAM exhibits remarkable superiority, particularly in challenging conditions such as
dim lighting, shadows, road markings, construction joints, and other interference factors. Such crossscenario results demonstrate the outstanding zero-shot capability of foundation models, and provide
new ideas for the development of vision models in civil engineering.
Keywords: Crack segmentation, Parameter-efficient fine-tuning, Vision Transformer, Transfer
learning, Zero-shot

* Corresponding author: Y. T. Guo (guoyutao@sz.tsinghua.edu.cn)
Emails: K. Ge (gk22@mails.tsinghua.edu.cn),

1

1. Introduction
Crack is a common damage in engineering structures, which may reduce the load-bearing capacity
and stiffness of the structures, as well as lead to the corrosion of internal reinforcements, thereby
reducing the durability and even causing structural failure [1]. Therefore, identifying and analysing
cracks are important in structural health monitoring (SHM).
Traditionally, the detection of cracks is carried out manually, which is costly, subjective and
inefficient. The emerging SHM methods have paved the way for a more automated, efficient and
intelligent monitoring. Multiple non-destructive monitoring methods are widely used in crack
analyses, such as contact-based technologies like sensors [2] and contactless methods including
ultrasound [3] and infrared thermography [4]. A prominence technology is the unmanned aerial
vehicle (UAV). Equipped with devices like high-resolution cameras, radar, and infrared cameras, etc
[5], UAV has been applied in a series of crack assessment tasks [6][7][8].
The aim of crack segmentation is to classify crack images at the pixel-wise to distinguish between
cracks and backgrounds, which entails image processing technologies. More than a decade ago, the
main approach for crack segmentation are filters [9], wavelet transforms [10], and other operations
to denoise crack images.
In recent years, the deep learning technologies have achieved a rapid progress and is widely
employed in the computer vision (CV) tasks. Therefore, the neural networks have become the
mainstream approach for crack segmentation problems since 2016 [11]. The model can be divided
into two categories from the architecture perspective: CNN-based networks and Transformer-based
[12] networks. The former can be seen as a series of stacked local filters, enhancing the model's
receptive field through multi-scale feature fusion. The latter effectively addresses the challenge of
capturing long-distance dependencies through attention mechanism.
A problem that cannot be ignored still exists: the crack segmentation model trained on a certain
dataset may not be generalizable to other datasets. Pre-trained models typically contain biases in the
training set and tend to overfit on it, but will perform poorly on unseen datasets. Chen noticed the
problem of cross-scenario/scale generalizability of defect detection models, where a pre-trained crack
segmentation model is not readily generalizable to sophisticated defect types and large-scale images,
with the intersection over union (IoU) dropped from 46.9% to 14.2 and 1.3%, respectively [13].
Crack segmentation is a high class-imbalanced classification problem, and will be affected by
many interference factors [14], including too fine cracks, low resolution, different shooting distances,
blurring, shadows, occlusions, traffic and pedestrian flows, and different working conditions.
2

However, the crack images commonly used for training are much cleaner. These factors seriously
hinder the deployment of pre-trained models to identify cracks in engineering practice.
Inspired by the excellent generalization and zero-shot performance of large-scale foundation
models, in this paper, parameter-efficient fine-tuning (PEFT) technologies are applied to introduce
vision foundation model SAM into crack segmentation. Two datasets with severe interference under
different working conditions are collected. Evaluation of the proposed CrackSAM is focused on the
inference performance on datasets with artificial noise and zero-shot performance on previously
unseen datasets. There have been relatively few studies on the zero-shot capability of pre-trained
crack segmentation models on different datasets. However, this is precisely the most concerning
aspect in engineering. Without this, models will remain at the research stage and cannot be practically
implemented. To the best of the authors’ knowledge, this is the first work to fine-tune large vision
foundation model for crack segmentation.
Contributions of this paper are summarized as follows:


A road crack dataset and an exterior wall crack dataset are collected using mobile phone and
UAV.



Two PEFT methods: adapter [15] and low-rank adaptation (LoRA) [16] are employed to
apply SAM for crack segmentation.



The fine-tuned CrackSAM exhibit outstanding performance on datasets with artificial noise
and previously unseen datasets when compared to twelve state-of-the-art (SOTA) models.
Excellent zero-shot identification of cracks is achieved without any additional training.



The collected and labelled datasets, models, and pre-trained weights will be publicly
available after acceptance.

The rest of this paper is arranged as follows. Section 2 provides the relevant studies related to
this paper. Section 3 describes the preparation of datasets. Section 4 introduces model architecture
and fine-tune methods. Section 5 presents the ablation studies and experiment results. Section 6
compares the proposed method with twelve SOTA methods.

2. Relevant work
2.1 Models for crack segmentation
Semantic segmentation tasks lay great emphasis on global context information. Consequently, for
CNN-based crack segmentation architectures, the UNet architecture [17] with skip connections and
pyramid architecture fused with multi-scale feature maps, such as FPN [18], PSPNet [19], and
DeepLabV3+ [20] have achieved excellent performance in various crack segmentation tasks. Zhang
3

et al. [21] used an improved UNet architecture to integrate high-level features and shallow features
of cracks. Ren et al. [22] employed methods such as dilated convolutions, spatial pyramid pooling,
and skip connections, to achieve feature aggregation and resolution reconstruction in crack
segmentation. Dais et al. [23] combined UNet and FPN, integrating multiple backbones such as VGG,
ResNet, MobileNet, etc., to conduct comparative experiments in crack segmentation for masonry
structures.
Due to the dominance of the Transformer architecture in recent years in CV tasks, Vision
Transformer (ViT) [24], Swin Transformer [25], SegFormer [26] and many Transformer-based
architectures have been widely used in crack segmentation tasks. Shamsabadi et al. [27] used a
TransUNet model with a hybrid CNN-ViT backbone to segment cracks in a dataset with very small
semantic information. The performance was superior to CNN-based UNet and DeepLabV3+, while
also exhibiting stronger noise robustness. Guo et al. [28] employed an encoder-decoder architecture
with a Swin Transformer backbone, achieving superior segmentation results on road surface cracks
compared to models with UNet and ResNet as backbones.
However, small models trained on limited datasets still face issues of insufficient recognition
capabilities and poor generalization.

2.2 Vision foundation models
Foundation models can be regarded as a new general paradigm of AI, and has stronger
intelligence than traditional models. They are usually based on Transformer architectures with
billions of parameters. The realization of foundation model entails large-scale pre-training on huge
datasets using massive GPUs [29]. In the field of CV, SAM [30] is a recently proposed foundation
model for semantic segmentation, which is trained on over 1 billion masks from 11 million images.
The large-scale pre-training endows SAM with the zero-shot ability to respond to various downstream
tasks. SegGPT [31] is also a similar work, which is capable of segmenting everything in context with
one single model. SEEM [32] achieves semantic, instance, and panoptic segmentation through diverse
prompts such as textual, visual, and referring region prompts. DINOv2 [33] is a vision foundation
model trained in a self-supervised manner. The model learns directly from images without the need
for text guidance. Its backbone can be employed in various downstream tasks such as image
classification, instance recognition, semantic segmentation, and depth estimation.
However, directly applying vision foundation models for crack segmentation is not feasible.
Ahmadi et al. [34] tried to directly utilize SAM for crack segmentation and found that SAM did not
perform well in spalled cracks. Moreover, the masks of cracks cannot be directly obtained, as shown
in Figure 1. Consequently, it is necessary to fine-tune SAM to learn the specific semantics of cracks.
4

Figure 1 Example image of directly applying SAM for crack segmentation.

2.3 Transfer learning and zero-shot learning
Transfer learning is technique that involves leveraging knowledge learned from one task or
domain to improve the performance of another related task or domain. It involves transferring the
learned representations, features, and patterns acquired during the training process of a source task to
a target task with limited labelled data.
Transfer learning has been widely practiced in crack segmentation tasks. Zhou et al. [35] utilized
the pre-trained weights from Imagenet-1k [36] to initialized the weights of backbone in order to solve
the data dependency of Swin Transformer. The parameters of backbone are frozen for first 50 epochs
and unfrozen for the latter 50 epochs. Lau et al. [37] compared such "two-stage" training strategy
with standard training procedure and found that the UNet trained with former method had a better
effect. Gao et al. [38] established a hierarchical transfer learning architecture, where the pre-trained
model for localization and segmentation tasks directly inherits the well-trained backbone used for
classification tasks.
Zero-shot learning is a subfield of transfer learning. The definition of zero-shot learning is to
classify test instances to an unseen class [39]. In crack segmentation tasks, there may be confusing
interference from objects that have never been seen before in practical engineering applications, and
the pre-trained model needs to successfully classify this type of semantic information. Therefore, in
this work, zero-shot learning can be evidenced by the ability to identify cracks in complex working
conditions beyond the training set.
In the field of AI-aided SHM, the traditional transfer learning methods can be summarized as
initializing the backbone of the model with high-quality pre-trained weights (Figure 2(a)). During
training, it is common in some studies to only fine-tune downstream networks (Figure 2(b)) or freeze
certain layers initially and then unfreeze and train all layers together. Such full-training process
requires significant GPU memory usage and sufficient data support.
Nevertheless, the scarcity of high-quality annotated dataset and limited computational resource
makes it impossible for full-training of a foundation model in civil engineering. For this reason, PEFT
methods must be adopted.
5

Frozen

Tunable

Delta
+

Model

Backbone

Pre-trained weights

(a)

Head
Pre-trained weights

+

Model
Pre-trained weights

(b)

(c)

Figure 2 Fine-tune methods for pre-trained models. (a) Full fine-tuning. (b) Fine-tune downstream networks.
(c) PEFT.

2.4 PEFT
PEFT is a lighter but more efficient fine-tune method, specially designed for transformer
architectures. With fewer resources and training iterations, PEFT can preserve the original knowledge
of pre-trained foundation models, avoid catastrophic forgetting, and reduce overfitting.
By introducing a few trainable parameters (less than 5%) that do not exist in the original network,
the pre-trained foundation model can be adapted to downstream tasks. Compared to other methods
that train some or all the layers, PEFT is essentially "delta-tuning" [40], as shown in Figure 2(c). For
example, prefix-tuning achieves comparable performance in the full data setting by adding a trainable
"soft prompt" to all the key and value matrices in transformer layers [41], and even outperforms full
fine-tuning in low data settings. Prompt-tuning, as a simpler version of prefix-tuning, just adds soft
prompt to the input embeddings [42].
PEEF of SAM has already been applied to medical segmentation fields. Chen et al. [43] proposed
SAM-Adapter which is demonstrated effective in camouflaged object detection, shadow detection,
and polyp segmentation. Wu et al. [44] also fine-tuned SAM with adapter-based strategy on 19
medical image segmentation tasks, including CT, MRI, ultrasound image, fundus image, and
dermoscopic images. Zhang et al. [45] applied LoRA-based strategy to fine-tune SAM for multiorgan segmentation for Synapse dataset, which is on par with the SOTA method.
However, similar methods have not yet been applied in crack segmentation tasks. The most
widely utilized PEFT technologies, adapter and LoRA, will be involved in this work.

3. Data preparation
3.1 Dataset for pre-training
The large labelled crack segmentation dataset collected by khanhha [46] is utilized in this study.
Khanhha dataset is a union of multiple open-source sub datasets, including CRACK500 [47],
GAPs384 [48], CFD [49], AEL [50], CrackTree200 [51], CrackForest [49], and DeepCrack [52]. It
6

has 9603 images for training and 1695 images for testing, with a resolution of 448448. The dataset
comes from various sources, including road surfaces, pavements, walls, bridges, and so on. The
richness of sources allows the model trained on this dataset to perceive cracks of different working
conditions and scales, thus having a certain degree of generalization. Of course, the inconsistent
annotation thickness in different datasets can also have a certain negative impact on the model.
Note that the annotation of sub dataset CrackTree200 is too fine compared to other datasets, with
the width of the crack masks is only one pixel. This inconsistent labelling strategy made it difficult
for previous studies [53][54] to identify cracks in CrackTree200 dataset, so this dataset is relabelled
manually by experts. The relabelled CrackTree200 dataset will also be available together with the
following collected datasets.

3.2 Datasets for zero-shot
Two unique crack datasets are captured for evaluating the model’s zero-shot performance, namely
Road420 and Facade390. The pixel-level binary masks of all the collected datasets are obtained by
expert annotation. All the taken images and masks are converted into RGB and grayscale pictures
and down-sampled to the size of 448448.
3.2.1 Road420
Road420 consists of 420 images of asphalt concrete and cement concrete road surfaces with
cracks. The pictures contain a lot of interfering information, such as shadows, occlusions, road signs,
vehicles, manhole covers, people, leaves, etc. Some small cracks will become difficult for naked eye
to recognize after down-sampling. Some of the pictures are taken at night. The image semantics of
some interfering factors have never appeared in the khanhha dataset, and the deliberate introduction
of such interference makes zero-shot learning very challenging on this dataset. All the images are
captured with an iPhone 14 Pro. Some representative sample images are shown in Figure 3.

Figure 3 Sample images of Road420.

3.2.2 Facade390
Facade390 is composed of cracks on the exterior walls and columns of buildings captured by
7

UAV. Because the UAV must maintain a safe distance from the building during operation and may
experience displacement during hovering, so the captured images will have potential blurry and some
fine cracks may not be seen clearly. The identification of these cracks is susceptible to interference
from various things such as wall stains, peeling, water traces, shadows, paint, vegetation, construction
joints, and other extraneous factors. The cracks in Facade390 are generally not structural cracks, but
may bring unsettling risks such as water seepage. The UAV employed in this study is Dajiang Mini
3. Some representative sample images are presented in Figure 4.

Figure 4 Sample images of Facade390.

3.2.3 Concrete3k
Concrete3k is a large ready-made dataset with 3000 image-label pairs of concrete cracks
contributed by [53][55], it will also be leveraged for zero-shot.

4. Methodology
4.1 Overall architecture
Mask Decoder
2 Layer TwoWayTransformer

Positional embedding

Image Embedding

Neck

ViT Block

+

ViT Block

Patch Embedding

Images

Image Encoder

ViT Backbone

Masks

Image to token Attention

+

Upsample
4
•

MLP

2 Layer
ConvTranspose

Token to Image Attention

Token to Image
Attention

MLP

Self-Attention

MLP

Cat

IOU Scores

Learnable Output Tokens

Prompt Encoder
Conv

Points Boxes Text

Masks

Figure 5 The original overall architecture of SAM.

SAM is composed of three key components: an image encoder, a prompt encoder and a mask
8

decoder, as shown in Figure 5. The latter two parts are much lighter than the image encoder.
4.1.1 ViT block
The ViT block is made up of two parts: window attention and MLP, as shown in Figure 6.

W

MLP



Layer Norm

softmax
+Positional Embedding

Window Unpartition


 Scale

Query

Key

Attention

Value

WQ/K/V

Window Partition

X

Layer Norm

Figure 6 The architecture of ViT block.

First, the input patches x p 

H W C

undergoes a window partition with a size of w and is
N  w wC

separated into N non-overlapping windows x 

, and N =HW / w2 . The window size is set

to 14 here. Then, multi-head self-attention approach is carried out on x.
Divide x along the channel dimension and feed it into multiple attention heads. In the i-th head,
query vector Q and key-value pairs K and V is obtained via a learnable linear layer:
Qi / Ki / Vi = WQi / Ki /Vi xi + bQi / Ki /Vi

(1)

A dot production is computed to calculate the similarity scores between Q and K, then divide it
by the square root of the dimension size of K for scaling. The result is normalized by a softmax
activation function after a learnable positional embedding is added to it. The resulting attention
weights are multiplied with V to obtain the output for each head:
Atteni = softmax(

Qi K iT
+ pos )Vi
dk

(2)

The outputs of each head are concatenated and the output of the attention layer xa is obtained
through a linear layer:
9

xa = concat ( Atten1 , Atten2 ,

, Attenn )W + b

Finally, reorganize the windows of xa into its original shapes

xo 

(3)
H W C

.

The MLP is a fully connected neural network stacked in multiple layers, which expands the
original dimension four times, and compresses it back with a GELU activation [56].
The input of window attention and MLP is normalized through layer normalization [57]. At the
same time, residual connections [58] are added to each module to achieve stable propagation of
gradients in deep network.
4.1.2 Image encoder
The image encoder is a MAE [59] pre-trained ViT, and consists of a patch embedding layer, a
learnable positional embedding, a ViT backbone and a neck. The patch embedding layer converts the
input image into a small-sized high-dimensional feature map through a 1616 convolution with a
stride of 16. The absolute positional embedding is added to each position of the feature map. The
backbone is a stack of ViT blocks. Based on the size of the backbone, there are three options: ViT-H,
ViT-L, and ViT-B. The embedding dimension, number of blocks and number of attention heads for
ViT-B is 768, 12 and 12. For ViT-L, it is 1024, 24 and 16. As for ViT-H, it is 1280, 32 and 16. In the
neck, the image embedding is fed to a point-wise convolution and a 33 convolution to reduce the
dimension to 256. Each convolution is immediately followed by a layer normalization.
The output of the image encoder is a 16 downscaled image embedding of the original size with
256 dimensions.
4.1.3 Prompt encoder
The prompt encoder receives sparse (points and boxes) or dense (masks and text) prompts.
However, in this task, the segmentation object (crack) is determined, so the input of prompt is
simplified to None. The default embedding is a learnable embedding representing, which is added to
each position of image embedding.
4.1.4 Mask decoder
A set of learnable output tokens is concatenated with the sparse prompt embeddings, and the
obtained tokens as well as the image embedding and its positional embedding are fed into a 2-layer
two-way Transformer. The tokens are learnable vectors and interact with image features through
attention mechanism.
Details of two-way Transformer can refer to the original code. Within the two-way Transformer,
the following steps are mainly performed: first, conduct an 8 heads self-attention on input tokens;
10

then, perform cross attention from tokens to image embedding, where tokens are regarded as query
and image embedding is treated as key and value; after that, tokens are updated through an MLP with
a hidden dimension of 2048 and a ReLU activation; at last, an image to token cross attention is carried
out. The attention layers consist of 8 heads each. The channel dimension for the query, key, and value
in the self-attention layer is set to 256, while the dimension for the cross-attention layer is set to 128.
After each attention and MLP layer, a residual connection and a layer normalization is added. Before
fed into each attention, the original prompt tokens and image positional embedding are re-added to
the queries and keys for a better memory of prompt token’s information. The output updated tokens
and image embedding of the first layer of the two-way Transformer serve as the input for the second
layer.
A two-layer transposed convolution with a stride of 2 and a kernel size of 22 will up-sample the
1/16-size image embedding to 1/4-size. Channels of the transposed convolution layer are 64 and 32.
A GELU activation is performed after convolution and a layer normalization is added between them.
Tokens are updated through the final cross attention with image embedding. The dimension of
tokens is transformed to 32 through a 3-layer MLP. The output of MLP is a linear classifier with a
shape of (num_class, 32), it will predict the mask foreground probability at each position of the image
embedding. The shape of image embedding is (32, H/4, W/4). Finally, the output low-resolution
masks with a shape of (num_class, H/4, W/4) are obtained by a point-wise product. High-resolution
masks can be derived through bilinear interpolation of low-resolution masks.
Due to the fact that the main parameters are concentrated in the ViT blocks, so PEFT is conducted
on them. The lightweight prompt encoder and mask decoder are also fine-tuned together [45].
Tunable

Frozen

+

+

+

(3HW )

Image Embedding

Image Encoder
LoRA/Adapter

(embedding_dimH/16W/16 )

Mask Decoder
Prompt Encoder

(2HW )

None
Figure 7 The architecture of proposed CrackSAM.

The general architecture of fine-tuned SAM, CrackSAM, is illustrated in Figure 7. Note that the
bilinear interpolation of low-resolution masks is placed before feeding them into the final classifier
in this architecture to obtain more accurate masks of cracks. This change will also be made to other
comparative models in subsequent sections.
11

4.2 Adapter

S
MLP

Adapter

Layer Norm

Tunable

Frozen

S Scaling
Window Unpartition
Adapter

Linear Up
GELU

Attention

Linear Down

Window Partition
Layer Norm

Figure 8 Fine-tune strategy of adapter.

The simple design of adapter makes it the most commonly used method in the field of PEFT.
There are many variants of adapter, which can be either sequential or parallel. In this work, adapter
is sequentially inserted behind the attention layer and parallelly inserted at the MLP [44], as shown
in Figure 8. GELU activation function is employed in the middle, as it provides smoother gradients
compared to ReLU.
Adapter can be regarded as a smaller MLP designed with a bottleneck structure to reduce
parameter count. Initially, Adapter employs a down-projection linear layer with parameters

Wdown 

d m

to project the original d-dimensional features to a smaller dimension m. Subsequently,

a non-linear activation function is applied, followed by an up-projection layer with parameters

Wup 

md

to restore the features to the d-dimensional space. Notably, a residual connection is

incorporated in this process. The middle dimension m is constrained such that m

d . Denoting the

input as x and the output after adaptation as x', the transformation is formally expressed as Eq. (4).
x ' = (Wup  GELU(Wdown x + bdown )) + bup ) + x

(4)

For a parallel adapter, there is no need for an additional residual connection, but a scaling factor
s is required to control the extent of the update of adapter. Given an input xm for MLP and its adapted
output xm', the formula is as follows Eq. (5).
xm ' = s  (Wup  GELU(Wdown  LN( xm ) + bdown ) + bup ) + MLP(LN( xm )) + xm
12

(5)

During fine-tuning, the weights of attention layer and MLP are frozen, and only the weights of
adapter are trained. When the middle dimension is a multiple of the input dimension, fine-tuning
adapter at this point is equivalent to fine-tuning the newly added MLP.

4.3 Low-rank adaptation
LoRA is a reparameterization method that converts the original parameters in a neural network
into parameter-efficient form [40]. A neural network usually consists of a large number of full-rank
matrix operations. When migrating to downstream tasks, LoRA assumes that the pre-trained model
has a small intrinsic rank, and the updating of weights can be achieved on this small subspace.
For a pre-trained weight matrix Wo 

d k

, a bypass W 

d k

is added to constrain the

update of its weight, and W is decomposed into the product of matrices A 
using low-rank decomposition, with the rank r

d r

and B 

rk

min(d , k ) . As shown in Figure 9, for the original

path y=Wox, the updated result y' is:
y ' = (WO + W ) x = WO x + ABx

(6)

During fine-tuning, the weight matrix Wo is kept frozen, and only the matrices A and B are finetuned. Matrix A is initialized using random Gaussian initialization, matrix B is initialized to 0 [16].
As a more general approach, LoRA can theoretically be added to any set of weights. In this way,
fine-tuning LoRA is equivalent to full fine-tuning. From a parameter-efficient perspective, LoRA is
always added to the attention weights, typically on the query and value components. Due to the
adoption of low-rank decomposition, LoRA has fewer parameters compared to adapter. Additionally,
the parallel design of LoRA can reduce the inference latency caused by the sequential execution of
adapter.
Tunable

Frozen
dr

X

Query

rd

A

B
Value

dd

WQ/V/...

...

Figure 9 Fine-tune strategy of LoRA.

5. Experiment and results
5.1 Implementation details
The primary loss function for semantic segmentation is cross-entropy (Eq. (7)). However, due to
the highly imbalanced nature of crack segmentation, cross-entropy loss tends to rapidly converges to
13

zero during training, thereby allowing the background to dominate the loss [35]. Hence, a more
effective approach involves using a weighted combination of cross-entropy and dice loss (Eq. (8)),
as demonstrated in Eq. (9).




LCE = - y log( y) − (1 − y) log(1 − y)

LDice = 1 −

2X

(7)

Y

(8)

X +Y

L =  LCE + (1 −  ) LDice

(9)


In Eq. (7), y represents the ground truth labels, taking values of 0 or 1. y denotes the probability
of predicted labels. In Eq. (8), X corresponds to the mask regions of the true labels, and Y represents
the mask regions of the predicted labels. The parameter  in Eq. (9) serves as a weighting coefficient,
set to 0.2 in the context of this study.
The learning rate is adjusted using a "poly" policy incorporating a warm-up strategy. Initially,
for the first 300 iterations, the learning rate linearly increases from 0 to the initial learning rate of
0.0004. Subsequently, throughout the iterations, the learning rate dynamically scales by multiplying
with (1 −

iter − warm _ up power
, where power is set to 6. The maximum iteration limit is set to 140
)
max_ iter

epochs, with a batch size of 8. The optimization of the model utilizes AdamW, with parameters 1,

2, and weight_decay set to 0.9, 0.999, and 0.01, respectively.
A threshold of 0.5 is chosen for mask binarization. Due to the relative abundance of the training
dataset, only basic data augmentation techniques are employed, including random rotation and
random flip. The training process is accelerated using auto mixed-precision and TensorFloat-32 [45].
Pre-trained weights of SAM are loaded and frozen before training. The best-performing checkpoint
of delta and head on the validation set is saved and selected for subsequent testing. At the inference
stage, the final predicted mask is obtained by returning the index of the maximum value in the channel
dimension of the masks. The model is established on Pytorch framework and trained only on one
24GB RTX3090 GPU.

5.2 Evaluation metrics
Precision (Pr), recall (Re), F1-score (F1), and IoU are employed to evaluate the segmentation
performance of the model, as defined in Eq. (10) - Eq. (13).

14

Pr =

TP
TP + FP

(10)

Re =

TP
TP + FN

(11)

F1 =

2  Pr Re
Pr + Re

(12)

TP
TP + FP + FN

(13)

IoU =

Where true positive (TP) denotes pixels representing cracks that are correctly classified, false positive
(FP) represents background pixels erroneously classified as cracks, and false negative (FN) indicates
crack pixels misclassified as background.

5.3 Ablation study
In this section, ablation study is conducted on the parameter setting of the proposed architecture.
For adapter, it is necessary to study the size of middle dimension and scaling factor. For LoRA, the
positions where LoRA is applied and the size of rank should be investigated. The settings of these
hyperparameters vary greatly for different downstream tasks. In addition, the impact of the size of
the backbone and the combination of two fine-tuning methods are also studied.
The following experiment discusses the metrics of Pr, Re, F1, and IoU (Eq. (10) - Eq. (13)) when
a well-trained crack segmentation model is inferring on the test set. Meanwhile, the model's
generalization ability is evaluated by directly applying the pre-trained model to three new datasets
(Road420, Facade390, and Concrete3k) in a zero-shot manner without any additional training, and
measuring the IoU metric.
5.3.1 CrackSAM_Adapter
As shown in Table 1, introducing a few parameters is sufficient to achieve excellent transfer to
downstream tasks. Even when the middle dimension is set to 1, the model can still achieve decent
precision. An interesting phenomenon is that increasing the middle dimension continuously improves
metrics on the test set, but blindly increasing parameters leads to a decrease in generalization. When
the middle dimension is 32, the IoU on the test set only decreases by 0.3% compared to when the
middle dimension is 64. However, there are improvements of 1.9%, 3.7%, and 4.8% when performing
zero-shot learning on Road420, Facade390, and Concrete3k, respectively. Considering that the
number of parameters in adapter is directly proportional to the middle dimension, setting the middle
dimension to 32 is more appropriate.
15

Table 1 Ablation study on the middle dimension of adapter.
Middle
dimension
dim=1
dim=16
dim=32
dim=64

Pr
0.7554
0.7664
0.7676
0.7674

Metric of inference on test set
Re
F1
IoU
0.7786
0.7515
0.6270
0.7968
0.7696
0.6479
0.7965
0.7704
0.6495
0.8002
0.7719
0.6513

IoU of zero-shot on new dataset
Road420
Facade390
Concrete3k
0.5310
0.4618
0.6743
0.6139
0.4772
0.6461
0.6149
0.4718
0.6718
0.6033
0.4548
0.6412

Note that the model's IoU on Facade390 dataset is relatively low. This is actually because
Facade390 is mainly composed of cracks in building exterior wall materials, which are very fine
compared to road cracks. While the masks in the training set are mostly coarse segment-wise
annotations, such as the masks in CRACK500 subset. Fine annotations are less prevalent in the
training set, resulting in lower IoU during zero-shot on Facade390. In fact, a well-tuned CrackSAM
model can accurately detect the majority of cracks in test set (Re > 0.9). Given this, in ablation
experiments, priority is given to evaluating generalization ability based on the Road420 and
Concrete3k datasets.
The scaling factor s is introduced to balance the task-agnostic features generated by the frozen
backbone and the task-specific features generated by the tunable parallel adapters. As shown in Table
2, setting the scaling factor to 0.2 can yield better performance in terms of generalization.
Table 2 Ablation study on the scaling factor of adapter.
Scaling
factor

Pr

s=0.1
s=0.2
s=0.5
s=1
s=2

0.7671
0.7676
0.7716
0.7702
0.7751

Metric of inference on test set
Re
F1
IoU
0.7953
0.7965
0.7934
0.7958
0.7902

0.7693
0.7704
0.7706
0.7709
0.7707

0.648
0.6495
0.6499
0.6500
0.6494

IoU of zero-shot on new dataset
Road420
0.6042
0.6149
0.609
0.6006
0.5981

Facade390
0.4487
0.4718
0.4354
0.4313
0.4586

Concrete3k
0.6597
0.6718
0.6635
0.6426
0.6548

5.3.2 CrackSAM_LoRA
Table 3 Ablation study on the rank of LoRA.
Rank
r=1
r=4
r=8
r=16

Pr
0.7509
0.7620
0.7656
0.7657

Metric of inference on test set
Re
F1
IoU
0.7941
0.7585
0.6352
0.7918
0.7639
0.6416
0.7925
0.7665
0.6448
0.7947
0.7687
0.6473

IoU of zero-shot on new dataset
Road420
Facade390
Concrete3k
0.6176
0.4494
0.6516
0.6222
0.4544
0.6798
0.6201
0.4601
0.6800
0.6200
0.4573
0.6727

According to Table 3, similar to adapter, fine-tuning LoRA with a rank set to 1 is quite effective,
and at this point, the parameters of the LoRA component are only 0.16M. As the rank increases, the
metrics on the test set continuously improve. The model's generalization reaches saturation when the
rank is set to 4 and 8, while it decreases slightly when the rank is set to 16. The decrease in
generalization caused by over-parameterization is similar to that of adapter. Considering performance
16

and cost, a rank of 4 or 8 is more reasonable.
LoRA layer can be applied to the query, key, value, and output matrices in the attention layer.
As shown in Table 3 and Table 4, when the rank is 8 and LoRA layer is applied only to the query,
even though the number of parameters is equivalent to the situation when the rank is 4 and LoRA is
applied to both the query and value, the latter achieves higher metrics. This implies the position of
LoRA layer is a crucial factor. When the rank is 8 and LoRA is applied to both query and value, the
generalization ability on Facade390 and Concrete3k improves by 6.9% and 3.6% compared to
applying LoRA only to the query.
Applying LoRA to all four matrices has a similar effect to excessively increasing the rank,
resulting in a slight improvement in the metrics of test set but a decrease in zero-shot capability.
Therefore, it suffices to add LoRA to the query and value matrices alone.
Table 4 Ablation study on the weight type of LoRA.
Weight type
Pr
Wq
Wq , Wv
Wq , Wk , Wv , Wo

0.7489
0.7656
0.7717

Metric of inference on test set
Re
F1
0.7964
0.7925
0.7887

0.7575
0.7665
0.7690

IoU
0.6344
0.6448
0.6476

IoU of zero-shot on new dataset
Road420
Facade390
Concrete3k
0.5800
0.5122
0.6562
0.6201
0.4601
0.6800
0.6183
0.4602
0.6501

When comparing adapter and LoRA, the former demonstrates slightly higher metrics on the test
set, while the latter exhibits better generalization performance. Considering that LoRA has fewer
parameters than adapter, it is more recommended to apply CrackSAM_LoRA in engineering.
5.3.3 Combine two PEFT methods or use neither
Here, the comparison is made between simultaneously using both two PEFT methods and using
neither, i.e., employing only the traditional fine-tuning of the head (Figure 2(b)).
Three different parameter scales of adapter and LoRA combinations are tested. According to
Table 5, the effectiveness of combinations of PEFT methods depends on the parameter scale. When
both adapter with a middle dimension of 32 and LoRA with a rank of 8 are added simultaneously to
the model, compared to the model with only LoRA of rank 8 as shown in Table 4, the metrics
improves slightly. However, considering the obvious increase in the computation cost, there is no
significant necessity in combining multiple PEFT approaches.
When fine-tuning only the prompt encoder and mask decoder, there is a significant performance
decline. The IoU metrics in Table 4 on the test set and three new datasets increase by approximately
15.9%, 28.5%, 7.3%, and 12.2%, respectively, compared to the method of only fine-tuning head in
Table 5. This clearly demonstrates the superiority of PEFT over traditional fine-tuning methods
because completely freezing the backbone makes it more challenging for the model to extract
17

semantic information related to cracks.
Table 5 Experimental results of combining two methods and using neither method.
Delta type

Metric of inference on test set
Pr
Re
F1

IoU of zero-shot on new dataset

IoU
Road420 Facade390
Concrete3k
No PEFT,
0.6951
0.7188
0.6843
0.5564
0.4826
0.4288
0.6059
only fine-tune head
*
*
adapter(s =0.2, dim =8) +
0.7596
0.7959
0.7657
0.6438
0.6132
0.4560
0.6628
LORA(qv*, r*=2)
adapter(s=0.2, dim=16) +
0.7637
0.8005
0.7703
0.6488
0.6188
0.4639
0.6798
LORA(qv, r=4)
adapter(s=0.2, dim=32) +
0.7664
0.7959
0.7696
0.6485
0.6230
0.4862
0.6835
LORA(qv, r=8)
Note: *1s = scaling factor; *2dim = middle dimension; *3qv = Apply LoRA to query and value matrices; *4r = rank.

5.3.4 Size of backbone
According to Table 6, the size of the backbone has a significant impact on the fine-tuned model.
In general, the larger the backbone, the more powerful the segmentation ability after fine-tuning.
Progressing from ViT-B to ViT-L and then to ViT-H, the segmentation performance and
generalization ability have improved both for adapter and LoRA. This observation aligns with the
scaling law of large language models. It may be attributed to the richer features extracted by the
stronger backbone, and the smaller intrinsic dimension it has. Under the same parameter configuration,
fine-tuning becomes more effective for large-scale backbones. Therefore, in other experiments, this
paper regards only ViT-H as the backbone for CrackSAM.
Table 6 Ablation study on the size of backbone.
Delta
type

Backbone

adapter
adapter
adapter
LoRA
LoRA
LoRA

ViT-B
ViT-L
ViT-H
ViT-B
ViT-L
ViT-H

Pr
0.7574
0.7611
0.7676
0.7512
0.7623
0.7620

Metric of inference on test set
Re
F1
IoU
0.7920
0.7610
0.6379
0.8004
0.7682
0.6464
0.7965
0.7704
0.6495
0.7823
0.7523
0.6286
0.7849
0.7608
0.6379
0.7918
0.7639
0.6416

IoU of zero-shot on new dataset
Road420
Facade390
Concrete3k
0.5859
0.4680
0.6573
0.6263
0.4700
0.6672
0.6149
0.4718
0.6718
0.5905
0.4787
0.6557
0.6162
0.4862
0.6791
0.6222
0.4544
0.6798

6. Comparison with SOTA models
Semantic segmentation models typically consist of three main components: backbone, neck, and
head. The backbone plays a crucial role in extracting high-level and semantically rich features. The
neck assists in fusing information across multiple scales, while the head can be viewed as a decoder,
receiving multi-scale features from the backbone and neck. It achieves the desired mask through
aggregation, up-sampling, and refinement.
Twelve other models that have performed well in the field of semantic segmentation are selected
for comparative experiments with CrackSAM. These include VGG-UNet [46], Swin-UPerNet [25],
18

MobileNet [60], UNet-FCN and UNet-PSPNet [17], ResNet-DeepLabV3+ [58], ViT-UPerNet [24],
SegFormer [26], HRNet-FCN [61], and ResNet-PSPNet [62]. To ensure fair competition, the selected
comparative models have significant differences in parameter quantity and include various types of
image backbones, such as CNN-based UNet, ResNet-50, ResNet-101, as well as attention-based
architectures like Swin-T, ViT-B, Mix Transformer (MiT-B5). One reason for choosing these models
is their outstanding performance in prior crack segmentation tasks [27][28][53][63].
The architectures for comparative models are configured using the settings from OpenMMLab
segmentation toolkit [64]. The main training configuration is closely aligned with CrackSAM. The
maximum number of iterations are set to 200 epochs, and the initial learning rate is determined
through multiple trial and error adjustments. Based on the idea of transfer learning, the parameters
are initialized using the weights of baseline models which are pre-trained on large datasets such as
Cityscapes [65].

Figure 10 The changes of F1-score of different models on the validation set with training epochs. Batch size
is set to 8. (For interpretation of the references to color in this figure legend, the reader is referred to the
web version of this article.)

As shown in Figure 10, fine-tuning a vision foundation model using PEFT requires fewer
iterations to achieve convergence compared to training an entire relatively small model. Consequently,
the total training duration does not significantly increase.
This study mainly evaluates the performance of CrackSAM from two perspectives: robustness
and generalization.

6.1 Evaluation on datasets with artificial noise
To assess the robustness of the model, the artificial noise is introduced into test set. This paper
primarily investigates the following two cases when introducing artificial noise:
Case 1: For the input image I, reduce its brightness and apply Gaussian blur:
19

I ' = ( I − bri)  K

(11)

Bri represents brightness,  denotes convolution operation, and K represents the Gaussian kernel.
Specifically, convert the JPG image to the HSV color space. Then, subtract 50 from the V channel to
achieve a decrease in brightness. Next, use a 2D Gaussian filter to smooth the JPG image, with a
kernel size of 9x9 and both directional variances set to 0.
Case 2: Apply serious blur to the input image followed by down-sampling:
I ' = ( I  K ) S

(12)

 denotes the down-sampling operation, and S is the scaling factor. Apply Gaussian blur to the image
with a kernel size of 21x21, then down-sample it to half its original size using cubic interpolation,
followed by interpolation back to the original size.

Figure 11 Inference results of comparative experiments on test set with artificial noise.

The experimental results are listed in Table 7, and some predicted masks are shown in Figure 11.
20

Figure 11 (a)-(d) come from case 1, simulating a dim environment. The remaining figures are from
case 2, indicating a fuzzy situation. As shown in the figure, the proposed CrackSAM performs well
in identifying various forms of cracks, such as linear (Figure 11 (c)), branched (Figure 11 (d)), and
webbed (Figure 11 (b)). It is capable of predicting on different materials (asphalt, concrete), various
structures (road surfaces, walls, etc.), and diverse crack thicknesses, brightness levels, and contrasts
ratio.
Figure 11 (a) is a non-crack image. Both CrackSAM_adapter and CrackSAM_LoRA do not
provide any masks, while the other four comparative models identified construction joints and paint
as cracks. It turns out that the crack classification task can be part of the crack segmentation task,
eliminating the need for a two-stage design. When the model does not return any crack masks, it
means the image is a non-crack image.
For extremely dark conditions (Figure 11 (c)), CrackSAM can still identify some cracks, although
a few may be indistinguishable from the background. In extremely blurry situations (Figure 11 (d),
(e), (h)), CrackSAM can find most cracks that are difficult for naked eye to discern, while other
models can hardly detect the cracks present in the image.
Figure 12 studied the IoU of different models with different Gaussian kernel sizes when adding
a Gaussian blur to images in the test set. As shown from the figure, CrackSAM demonstrates much
higher robustness compared to other models. Despite comparable performance of the compared
models on the unprocessed test set, their IoU significantly decreases when a Gaussian blur is added.
When the kernel size is 25, CrackSAM's performance even surpasses that of ViT-B and Swin-B with
a kernel size of 15.

Figure 12 Variation of IoU for different models under different Gaussian kernel sizes.

Table 7 reveals that almost all models achieved satisfactory results on the original test set (IoU
≥ 0.62), except for UNet-PSPNet, which performed relatively worse. Excluding UNet-PSPNet, the
21

maximum gap between CrackSAM and the other 11 SOTA models is 4.6% on original test set.
However, significant differences emerge on the noisy test set, with variations reaching 27.0% and
42.0% in two specific cases. The classic UNet architecture performs poorly on severely blurred test
sets, with UNet-FCN and UNet-PSPNet experiencing accuracy drops of 54.4% and 60.3% after
adding severe blur, while CrackSAM_LoRA only experiences a 23.4% drop. Among all the models,
CrackSAM_adapter stands out as the most accurate model on the test set, while CrackSAM_LoRA
performs best on the noisy test set. From this, it can be seen that CrackSAM's robustness is much
better than traditional models.

6.2 Zero-shot performance on unseen datasets

Figure 13 Zero-shot results of comparative experiments on Road420.

As shown in Figure 13, the proposed model demonstrates satisfactory predictions in various
scales, environments, and under different interferences. In Figure 13 (a), cracks captured from a
distant view become challenging to discern for naked eye after down-sampling, yet the AI models
22

used in this experiment can still identify cracks in the image. This effectively showcases the
superiority of deep learning algorithms in crack segmentation. For cracks captured at night, as shown
in Figure 13 (c), (d), and (e), CrackSAM maintains highly accurate predictions, particularly in Figure
13 (e) where the cracks almost merge with shadows, while CrackSAM still accurately identifies the
cracks. Other models, however, are obviously affected by shadows and road markings, resulting in
numerous artifacts. In Figure 13 (b), CrackSAM correctly distinguishes between construction joints
and cracks, but other models are misled by the sidewalk in the image. In Figure 13 (f), CrackSAM is
not affected by occlusion from people and accurately outputs three segments of cracks, whereas other
models either produce artifacts or fail to recognize the complete three segments. CrackSAM correctly
segments the cracks in Figure 13 (g) with little influence from tire tracks on the road surface. In Figure
13 (h), the segmentation performance of CrackSAM remains unaffected by the presence of a cup and
refracted light.

Figure 14 Zero-shot results of comparative experiments on Facade390.

Due to the impact of annotation thickness on the IoU of the model during zero-shot on Facade390,
23

resulting in generally lower IoU, it is necessary to combine figures of segmentation results when
conducting comparative analysis on this dataset. As depicted in Figure 14, the proposed model can
effectively identify cracks in an automated and efficient manner when combined with UAV, even
when images are captured from different angles and distances. In Figure 14 (a), other models struggle
to distinguish between construction joints and cracks, whereas CrackSAM can. Figure 14 (b) and (c)
showcase red building facades with peeling, where CrackSAM successfully segments small cracks,
while other models fail to do so. Figure 14 (d) illustrates walls with paint and peeling, where due to
severe interference, comparative models cannot segment complete cracks, whereas CrackSAM
provides results closest to the ground truth. Figure 14 (f) exhibits surface cracks on column with damp
stains and grass; CrackSAM's segmentation mask closely resembles the actual crack morphology. In
Figure 14 (g), at the junction of beam and column with tree shadows, CrackSAM identifies all four
cracks, including the two small ones in the top of the image.
According to Table 7, similar to robustness, the generalization gap among different models is
also substantial, reaching 42.9%, 33.0%, and 31.1% on three new datasets. Lightweight networks like
MobileNet-V3 and HRNet-FCN show decent zero-shot performance on Concrete3k but perform
poorly on the interference-filled Road420, demonstrating the limited feature extraction capability of
small models. Some larger models, such as ViT-B and Swin-B, perform similarly to CrackSAM on
the test set, with acceptable performance on Road420, but their prediction accuracy is sensitive to
noise. The widely used ResNet model demonstrates reasonable robustness, but its performance on
the Road420 dataset is unexpectedly poor. Both fine-tuned versions of CrackSAM exhibit excellent
cross-dataset generalization capabilities. CrackSAM_LoRA is the best-performing model during
zero-shot on the Road420 and Concrete3k datasets, while CrackSAM_adapter performs better on
Facade390. Among the twelve SOTA models, SegFormer is the best in both robustness and
generalization. However, the proposed CrackSAM achieves a significant improvement in IoU metrics
compared to SegFormer on the two noise test sets and three new datasets, with increases of up to
11.1%, 10.8%, 7.0%, 2.1%, and 4.1%, indicating a notable performance boost.
Table 7 also illustrates the significance of studying the robustness and generalization of crack
segmentation models, as it truly impacts the model's feasibility for real-world applications. In
summary, through comparative experiments, the proposed CrackSAM has achieved the best results
in terms of test set accuracy, robustness, and zero-shot performance.
The robust generalization ability of CrackSAM is primarily attributed to the power of large
backbone and the effectiveness of the PEFT method. This is evident from comparative and ablation
experiments, as CrackSAM does not perform as outstandingly when ViT-H and PEFT are not utilized.
24

Table 7 Comparison results on noisy dataset and unseen dataset with other SOTA models
Model

Backbone

CrackSAM_adapter
(dim=32, s=0.2)

ViT-H

CrackSAM_LoRA
(qv, rank=4)

ViT-H

VGG-UNet
VGG16
Swin-UPerNet
Swin-T
Swin-UPerNet
Swin-B
MobileNet-V3
MobileNet-V3
UNet-FCN
UNet
UNet-PSPNet
UNet
ResNet-DeepLabV3+
ResNet-101
ResNet-DeepLabV3+
ResNet-50
ResNet-PSPNet
ResNetV1c-101
ViT-UPerNet
ViT-B
SegFormer
MiT-B5
HRNet-FCN
HRNet-W18
Note: *1bri= brightness; *2k= kernel size.

Parameters

641.9M
(Tunable 9.1M)
637.2M
(Tunable 4.4M)
53.91M
58.9M
120.0M
3.28M
28.99M
28.97M
60.2M
41.2M
65.59M
142.1M
82.0M
9.63M

Test set

Noisy test set 1

Noisy test set 2

Road420

Facade390

Concrete3k

-

-50 bri* + blur(k*=9)

×1/2 + blur(k=21)

Zero-shot

Zero-shot

Zero-shot

0.6495

0.5466

0.4763

0.6149

0.4718

0.6718

0.6416

0.5782

0.4915

0.6222

0.4544

0.6798

0.6419
0.6199
0.6428
0.6208
0.6255
0.5594
0.6402
0.6395
0.6346
0.6328
0.6484
0.6356

0.4337
0.4745
0.5003
0.5068
0.4218
0.3535
0.5115
0.5077
0.5110
0.4714
0.5204
0.5055

0.3472
0.3963
0.3857
0.3738
0.2852
0.2222
0.4088
0.3947
0.4207
0.3554
0.4436
0.4434

0.5126
0.4628
0.5262
0.4447
0.4531
0.3555
0.3827
0.3918
0.4084
0.5171
0.5817
0.4172

0.4547
0.4065
0.4593
0.4322
0.3677
0.3163
0.4399
0.4402
0.4327
0.4276
0.4622
0.4214

0.5152
0.4778
0.5655
0.6154
0.4682
0.5262
0.5791
0.5601
0.5544
0.6027
0.6533
0.6322

25

Image

Label

Prediction

Image

Wrong classification

Label
Controversial annotations

IOU 0.35

(a)
IOU 0.40

(b)

(c)

Figure 15 Some prediction situations with low IoU. (a) Wrong classification. (b) Thicker prediction mask.
(c) Controversial annotations subject to subjective judgements.

Issues affecting the model's accuracy can generally be categorized into three situations: In the
first scenario, the model fails to recognize the semantics of a particular object, as illustrated in Figure
15 (a). This reflects a deficiency in the model. The second situation arises when the model correctly
identifies a crack and successfully outputs its mask. However, due to the output mask being much
coarser than the annotation, Precision and IoU are lower, while Recall is very high, as depicted in
Figure 15 (b). This situation does not impact the normal recognition of cracks and is considered
acceptable. The third type is that annotators, when providing high-quality annotations on highresolution images, may label not only the main crack but also adjacent minor cracks and defects.
Whether these tiny defects should be annotated depends on the annotator's subjective judgment
(Figure 15 (c)). After down-sampling, information about these minor defects is severely lost,
rendering them unidentifiable. This situation often occurs in the annotation of asphalt road cracks,
where asphalt and cracks have similar brightness and contrast ratio, leading to ambiguous situations.

7. Conclusions
This paper fine-tuned the segment anything model using PEFT methods for crack segmentation.
The proposed CrackSAM was pre-trained on over 11k images. Two new labelled datasets comprising
810 images were collected through smartphone and UAV for zero-shot. The main conclusions of this
paper are as follows:
(1) PEFT technology was utilized, with SAM’s image encoder being frozen, and trainable delta
26

(adapter and LoRA) introduced on the ViT backbone. Fine-tuning was applied to the head and delta.
The pre-trained vision foundation model can be introduced into crack segmentation effectively.
(2) The proposed CrackSAM based on PEFT improved the IoU score greatly compared to the
traditional method of only fine-tuning the head. The fine-tuning of CrackSAM followed the scaling
law, where using ViT-H as the backbone instead of ViT-B resulted in additional performance gains.
The combination of ViT-H backbone and the PEFT method is the main reason for the successful
performance of CrackSAM.
(3) Excessive over-parameterization can possibly enhance performance on the test set but may
not necessarily generalize to other datasets. Increasing the middle dimension of adapter, raising the
rank of LoRA, applying LoRA on more positions often came with an increase in computational cost
and a simultaneous decrease in generalization. Therefore, the design of fine-tuning entails a trade-off
between complexity, performance, and generalization.
(4) CrackSAM worked exceptionally well in cross-scale and cross-scenario situations, exhibiting
strong robustness and generalization capabilities. In the evaluation on two artificially introduced noise
scenarios and three previously unseen datasets, CrackSAM demonstrated a great improvement in IoU
compared to the twelve SOTA models, ranging from 11.1% - 63.6%, 10.8% - 121.2%, 7.0% - 75.0%,
2.1% - 49.2%, and 4.1% - 45.2%.
(5) Satisfactory results were achieved on the test set by almost all different models, but there was
a significant difference in generalization. In complex environments with severe interference,
noticeable advantages were demonstrated by CrackSAM. Considering the various factors that may
affect crack segmentation in real-world deployment, it is essential to study the robustness and zeroshot capabilities of a newly proposed architecture.
Considering the lack of large benchmark datasets in the field of crack segmentation which
constrain the performance of AI models and their practical applications. The authors call for more
open-source efforts and the establishment of a large-scale crack segmentation dataset with a unified
standard.
If deploying a lighter network is necessary, it is recommended to employ knowledge distillation
to train a lightweight model with guidance from CrackSAM. Furthermore, other powerful visual
foundation models like the self-supervised DINOv2 [33] can be leveraged to detect cracks. Such
technique can also be extended to segment other types of structural defects, which enables achieving
"segment everything" in the field of SHM. These will be the focus of future work.

27

8. Declarations
8.1. Funding
The authors gratefully acknowledge the financial support provided by the National Natural Science
Foundation of China, Grant No. 52308179.
8.2. Conflicts of interest
There are no conflicts of interest for this paper.
8.3. Data availability
All the utilized models, pre-trained weights, and the labelled datasets will be publicly available on
https://github.com/KG-TSI-Civil/CrackSAM after acceptance.

References
[1] Zawad, Md Rahat Shahriar, et al. "A comparative review of image processing based crack detection
techniques on civil engineering structures." Journal of Soft Computing in Civil Engineering 5.3 (2021):
58-74.
[2] Wan, Kai Tai, and Christopher KY Leung. "Applications of a distributed fiber optic crack sensor for
concrete structures." Sensors and actuators A: physical 135.2 (2007): 458-464.
[3] Aggelis, D. G., et al. "Combined use of thermography and ultrasound for the characterization of
subsurface cracks in concrete." Construction and Building Materials 24.10 (2010): 1888-1897.
[4] Tashan, Jawdat, and R. Al-Mahaidi. "Detection of cracks in concrete strengthened with CFRP systems
using infra-red thermography." Composites Part B: Engineering 64 (2014): 116-125.
[5] Azimi, Mohsen, Armin Dadras Eslamlou, and Gokhan Pekcan. "Data-driven structural health monitoring
and damage detection through deep learning: State-of-the-art review." Sensors 20.10 (2020): 2778.
[6] Wang, Long, and Zijun Zhang. "Automatic detection of wind turbine blade surface cracks based on UAVtaken images." IEEE Transactions on Industrial Electronics 64.9 (2017): 7293-7303.
[7] Liu, Yu‐Fei, et al. "Image‐based crack assessment of bridge piers using unmanned aerial vehicles and
three‐dimensional scene reconstruction." Computer‐Aided Civil and Infrastructure Engineering 35.5
(2020): 511-529.
[8] Gopalakrishnan, Kasthurirangan, et al. "Crack damage detection in unmanned aerial vehicle images of
civil infrastructure using pre-trained deep learning model." Int. J. Traffic Transp. Eng 8.1 (2018): 1-14.
[9] Sinha, Sunil K., and Paul W. Fieguth. "Automated detection of cracks in buried concrete pipe images."
Automation in construction 15.1 (2006): 58-72.
[10] Subirats, Peggy, et al. "Automation of pavement surface crack detection using the continuous wavelet
transform." 2006 International Conference on Image Processing. IEEE, 2006.
[11] Zhang, Lei, et al. "Road crack detection using deep convolutional neural network." 2016 IEEE
international conference on image processing (ICIP). IEEE, 2016.
[12] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems
30 (2017).
[13] Chen, Junjie, Weisheng Lu, and Jinfeng Lou. "Automatic concrete defect detection and reconstruction by
aligning aerial images onto semantic‐rich building information model." Computer‐Aided Civil and
Infrastructure Engineering 38.8 (2023): 1079-1098.
[14] Kondoa, Yuki, and Norimichi Ukita. "Joint Learning of Blind Super-Resolution and Crack Segmentation
for Realistic Degraded Images." arXiv preprint arXiv:2302.12491 (2023).
[15] Houlsby, Neil, et al. "Parameter-efficient transfer learning for NLP." International Conference on
Machine Learning. PMLR, 2019.
[16] Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint
arXiv:2106.09685 (2021).
[17] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical
image segmentation." Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015:
18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer
International Publishing, 2015.
[18] Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE
28

conference on computer vision and pattern recognition. 2017.
[19] Zhao, Hengshuang, et al. "Pyramid scene parsing network." Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017.
[20] Chen, Liang-Chieh, et al. "Encoder-decoder with atrous separable convolution for semantic image
segmentation." Proceedings of the European conference on computer vision (ECCV). 2018.
[21] Zhang, Lingxin, Junkai Shen, and Baijie Zhu. "A research on an improved Unet-based concrete crack
detection algorithm." Structural Health Monitoring 20.4 (2021): 1864-1879.
[22] Ren, Yupeng, et al. "Image-based concrete crack detection in tunnels using deep fully convolutional
networks." Construction and Building Materials 234 (2020): 117367.
[23] Dais, Dimitris, et al. "Automatic crack classification and segmentation on masonry surfaces using
convolutional neural networks and transfer learning." Automation in Construction 125 (2021): 103606.
[24] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at
scale." arXiv preprint arXiv:2010.11929 (2020).
[25] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of
the IEEE/CVF international conference on computer vision. 2021.
[26] Xie, Enze, et al. "SegFormer: Simple and efficient design for semantic segmentation with
transformers." Advances in Neural Information Processing Systems 34 (2021): 12077-12090.
[27] Shamsabadi, Elyas Asadi, et al. "Vision transformer-based autonomous crack detection on asphalt and
concrete surfaces." Automation in Construction 140 (2022): 104316.
[28] Guo, Feng, et al. "Pavement crack detection based on transformer network." Automation in Construction
145 (2023): 104646.
[29] Bommasani, Rishi, et al. "On the opportunities and risks of foundation models." arXiv preprint
arXiv:2108.07258 (2021).
[30] Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023).
[31] Wang, Xinlong, et al. "Seggpt: Segmenting everything in context." arXiv preprint arXiv:2304.03284
(2023).
[32] Zou, Xueyan, et al. "Segment everything everywhere all at once." arXiv preprint arXiv:2304.06718 (2023).
[33] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint
arXiv:2304.07193 (2023).
[34] Ahmadi, Mohsen, et al. "Application of segment anything model for civil infrastructure defect
assessment." arXiv preprint arXiv:2304.12600 (2023).
[35] Zhou, Zhong, Junjie Zhang, and Chenjie Gong. "Hybrid semantic segmentation for tunnel lining cracks
based on Swin Transformer and convolutional neural network." Computer ‐ Aided Civil and
Infrastructure Engineering (2023).
[36] Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on
computer vision and pattern recognition. Ieee, 2009.
[37] Lau, Stephen LH, et al. "Automated pavement crack segmentation using u-net-based convolutional neural
network." Ieee Access 8 (2020): 114892-114899.
[38] Gao, Yuqing, et al. "Multiattribute multitask transformer framework for vision‐based structural health
monitoring." Computer‐Aided Civil and Infrastructure Engineering (2023).
[39] Wang, Wei, et al. "A survey of zero-shot learning: Settings, methods, and applications." ACM
Transactions on Intelligent Systems and Technology (TIST) 10.2 (2019): 1-37.
[40] Ding, Ning, et al. "Parameter-efficient fine-tuning of large-scale pre-trained language models." Nature
Machine Intelligence 5.3 (2023): 220-235.
[41] Li, Xiang Lisa, and Percy Liang. "Prefix-tuning: Optimizing continuous prompts for generation." arXiv
preprint arXiv:2101.00190 (2021).
[42] Lester, Brian, Rami Al-Rfou, and Noah Constant. "The power of scale for parameter-efficient prompt
tuning." arXiv preprint arXiv:2104.08691 (2021).
[43] Chen, Tianrun, et al. "SAM Fails to Segment Anything?--SAM-Adapter: Adapting SAM in
Underperformed Scenes: Camouflage, Shadow, and More." arXiv preprint arXiv:2304.09148 (2023).
[44] Wu, Junde, et al. "Medical sam adapter: Adapting segment anything model for medical image
segmentation." arXiv preprint arXiv:2304.12620 (2023).
[45] Zhang, Kaidong, and Dong Liu. "Customized segment anything model for medical image
segmentation." arXiv preprint arXiv:2304.13785 (2023).
[46] Khanhha,
n.d.
Khanhha/crack_segmentation.
GitHub.
URL
https://github.com/khanhha/crack_segmentation#Dataset (accessed 11.9.23).
29

[47] Yang, Fan, et al. "Feature pyramid and hierarchical boosting network for pavement crack detection." IEEE
Transactions on Intelligent Transportation Systems 21.4 (2019): 1525-1535.
[48] Eisenbach, Markus, et al. "How to get pavement distress detection ready for deep learning? A systematic
approach." 2017 international joint conference on neural networks (IJCNN). IEEE, 2017.
[49] Shi, Yong, et al. "Automatic road crack detection using random structured forests." IEEE Transactions on
Intelligent Transportation Systems 17.12 (2016): 3434-3445.
[50] Amhaz, Rabih, et al. "Automatic crack detection on two-dimensional pavement images: An algorithm
based on minimal path selection." IEEE Transactions on Intelligent Transportation Systems 17.10 (2016):
2718-2729.
[51] Zou, Qin, et al. "CrackTree: Automatic crack detection from pavement images." Pattern Recognition
Letters 33.3 (2012): 227-238.
[52] Liu, Yahui, et al. "DeepCrack: A deep hierarchical feature learning architecture for crack segmentation."
Neurocomputing 338 (2019): 139-153.
[53] Li, Yongshang, et al. "Real-time high-resolution neural network with semantic guidance for crack
segmentation." Automation in Construction 156 (2023): 105112.
[54] Tabernik, Domen, Matic Šuc, and Danijel Skočaj. "Automated detection and segmentation of cracks in
concrete surfaces using joined segmentation and classification deep neural network." Construction and
Building Materials 408 (2023): 133582.
[55] Wang, Wenjun, and Chao Su. "Automatic concrete crack segmentation model based on transformer."
Automation in Construction 139 (2022): 104275.
[56] Hendrycks, Dan, and Kevin Gimpel. "Gaussian error linear units (gelus)." arXiv preprint
arXiv:1606.08415 (2016).
[57] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint
arXiv:1607.06450 (2016).
[58] He, Kaiming, et al. "Identity mappings in deep residual networks." Computer Vision–ECCV 2016: 14th
European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14.
Springer International Publishing, 2016.
[59] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. 2022.
[60] Howard, Andrew, et al. "Searching for mobilenetv3." Proceedings of the IEEE/CVF international
conference on computer vision. 2019.
[61] Sun, Ke, et al. "Deep high-resolution representation learning for human pose estimation." Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[62] He, Tong, et al. "Bag of tricks for image classification with convolutional neural networks." Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[63] Kulkarni, Shreyas, et al. "CrackSeg9k: a collection and benchmark for crack segmentation datasets and
frameworks." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[64] Chen, Kai, et al. "MMDetection: Open mmlab detection toolbox and benchmark." arXiv preprint
arXiv:1906.07155 (2019).
[65] Cordts, Marius, et al. "The cityscapes dataset for semantic urban scene understanding." Proceedings of
the IEEE conference on computer vision and pattern recognition. 2016.

30