Fine-tune vision foundation model for crack segmentation in civil infrastructures K. Ge1, C. Wang2, T.Y. Guo1* 1 Shenzhen International Graduate School, Tsinghua University, Shenzhen, China 2 Department of Civil Engineering, Tsinghua University, Beijing, China Abstract Large-scale foundation models have become the mainstream method in the field of deep learning, while in civil engineering, the scale of AI models is strictly limited. In this work, vision foundation model is introduced for crack segmentation. Two Parameter-efficient fine-tuning methods, adapter and low-rank adaptation, are adopted to fine-tune the foundation model in the field of semantic segmentation: Segment Anything Model (SAM). The fine-tuned model CrackSAM is much larger than all the existing crack segmentation models, but shows excellent performance. To test the zeroshot performance of the proposed method, two unique datasets related to road and exterior wall cracks are collected, annotated and open-sourced, in total 810 images. Comparative experiments are conducted with twelve mature semantic segmentation models. On datasets with artificial noise and previously unseen datasets, the performance of CrackSAM far exceeds that of all state-of-the-art models. CrackSAM exhibits remarkable superiority, particularly in challenging conditions such as dim lighting, shadows, road markings, construction joints, and other interference factors. Such crossscenario results demonstrate the outstanding zero-shot capability of foundation models, and provide new ideas for the development of vision models in civil engineering. Keywords: Crack segmentation, Parameter-efficient fine-tuning, Vision Transformer, Transfer learning, Zero-shot * Corresponding author: Y. T. Guo (guoyutao@sz.tsinghua.edu.cn) Emails: K. Ge (gk22@mails.tsinghua.edu.cn), 1 1. Introduction Crack is a common damage in engineering structures, which may reduce the load-bearing capacity and stiffness of the structures, as well as lead to the corrosion of internal reinforcements, thereby reducing the durability and even causing structural failure [1]. Therefore, identifying and analysing cracks are important in structural health monitoring (SHM). Traditionally, the detection of cracks is carried out manually, which is costly, subjective and inefficient. The emerging SHM methods have paved the way for a more automated, efficient and intelligent monitoring. Multiple non-destructive monitoring methods are widely used in crack analyses, such as contact-based technologies like sensors [2] and contactless methods including ultrasound [3] and infrared thermography [4]. A prominence technology is the unmanned aerial vehicle (UAV). Equipped with devices like high-resolution cameras, radar, and infrared cameras, etc [5], UAV has been applied in a series of crack assessment tasks [6][7][8]. The aim of crack segmentation is to classify crack images at the pixel-wise to distinguish between cracks and backgrounds, which entails image processing technologies. More than a decade ago, the main approach for crack segmentation are filters [9], wavelet transforms [10], and other operations to denoise crack images. In recent years, the deep learning technologies have achieved a rapid progress and is widely employed in the computer vision (CV) tasks. Therefore, the neural networks have become the mainstream approach for crack segmentation problems since 2016 [11]. The model can be divided into two categories from the architecture perspective: CNN-based networks and Transformer-based [12] networks. The former can be seen as a series of stacked local filters, enhancing the model's receptive field through multi-scale feature fusion. The latter effectively addresses the challenge of capturing long-distance dependencies through attention mechanism. A problem that cannot be ignored still exists: the crack segmentation model trained on a certain dataset may not be generalizable to other datasets. Pre-trained models typically contain biases in the training set and tend to overfit on it, but will perform poorly on unseen datasets. Chen noticed the problem of cross-scenario/scale generalizability of defect detection models, where a pre-trained crack segmentation model is not readily generalizable to sophisticated defect types and large-scale images, with the intersection over union (IoU) dropped from 46.9% to 14.2 and 1.3%, respectively [13]. Crack segmentation is a high class-imbalanced classification problem, and will be affected by many interference factors [14], including too fine cracks, low resolution, different shooting distances, blurring, shadows, occlusions, traffic and pedestrian flows, and different working conditions. 2 However, the crack images commonly used for training are much cleaner. These factors seriously hinder the deployment of pre-trained models to identify cracks in engineering practice. Inspired by the excellent generalization and zero-shot performance of large-scale foundation models, in this paper, parameter-efficient fine-tuning (PEFT) technologies are applied to introduce vision foundation model SAM into crack segmentation. Two datasets with severe interference under different working conditions are collected. Evaluation of the proposed CrackSAM is focused on the inference performance on datasets with artificial noise and zero-shot performance on previously unseen datasets. There have been relatively few studies on the zero-shot capability of pre-trained crack segmentation models on different datasets. However, this is precisely the most concerning aspect in engineering. Without this, models will remain at the research stage and cannot be practically implemented. To the best of the authors’ knowledge, this is the first work to fine-tune large vision foundation model for crack segmentation. Contributions of this paper are summarized as follows:  A road crack dataset and an exterior wall crack dataset are collected using mobile phone and UAV.  Two PEFT methods: adapter [15] and low-rank adaptation (LoRA) [16] are employed to apply SAM for crack segmentation.  The fine-tuned CrackSAM exhibit outstanding performance on datasets with artificial noise and previously unseen datasets when compared to twelve state-of-the-art (SOTA) models. Excellent zero-shot identification of cracks is achieved without any additional training.  The collected and labelled datasets, models, and pre-trained weights will be publicly available after acceptance. The rest of this paper is arranged as follows. Section 2 provides the relevant studies related to this paper. Section 3 describes the preparation of datasets. Section 4 introduces model architecture and fine-tune methods. Section 5 presents the ablation studies and experiment results. Section 6 compares the proposed method with twelve SOTA methods. 2. Relevant work 2.1 Models for crack segmentation Semantic segmentation tasks lay great emphasis on global context information. Consequently, for CNN-based crack segmentation architectures, the UNet architecture [17] with skip connections and pyramid architecture fused with multi-scale feature maps, such as FPN [18], PSPNet [19], and DeepLabV3+ [20] have achieved excellent performance in various crack segmentation tasks. Zhang 3 et al. [21] used an improved UNet architecture to integrate high-level features and shallow features of cracks. Ren et al. [22] employed methods such as dilated convolutions, spatial pyramid pooling, and skip connections, to achieve feature aggregation and resolution reconstruction in crack segmentation. Dais et al. [23] combined UNet and FPN, integrating multiple backbones such as VGG, ResNet, MobileNet, etc., to conduct comparative experiments in crack segmentation for masonry structures. Due to the dominance of the Transformer architecture in recent years in CV tasks, Vision Transformer (ViT) [24], Swin Transformer [25], SegFormer [26] and many Transformer-based architectures have been widely used in crack segmentation tasks. Shamsabadi et al. [27] used a TransUNet model with a hybrid CNN-ViT backbone to segment cracks in a dataset with very small semantic information. The performance was superior to CNN-based UNet and DeepLabV3+, while also exhibiting stronger noise robustness. Guo et al. [28] employed an encoder-decoder architecture with a Swin Transformer backbone, achieving superior segmentation results on road surface cracks compared to models with UNet and ResNet as backbones. However, small models trained on limited datasets still face issues of insufficient recognition capabilities and poor generalization. 2.2 Vision foundation models Foundation models can be regarded as a new general paradigm of AI, and has stronger intelligence than traditional models. They are usually based on Transformer architectures with billions of parameters. The realization of foundation model entails large-scale pre-training on huge datasets using massive GPUs [29]. In the field of CV, SAM [30] is a recently proposed foundation model for semantic segmentation, which is trained on over 1 billion masks from 11 million images. The large-scale pre-training endows SAM with the zero-shot ability to respond to various downstream tasks. SegGPT [31] is also a similar work, which is capable of segmenting everything in context with one single model. SEEM [32] achieves semantic, instance, and panoptic segmentation through diverse prompts such as textual, visual, and referring region prompts. DINOv2 [33] is a vision foundation model trained in a self-supervised manner. The model learns directly from images without the need for text guidance. Its backbone can be employed in various downstream tasks such as image classification, instance recognition, semantic segmentation, and depth estimation. However, directly applying vision foundation models for crack segmentation is not feasible. Ahmadi et al. [34] tried to directly utilize SAM for crack segmentation and found that SAM did not perform well in spalled cracks. Moreover, the masks of cracks cannot be directly obtained, as shown in Figure 1. Consequently, it is necessary to fine-tune SAM to learn the specific semantics of cracks. 4 Figure 1 Example image of directly applying SAM for crack segmentation. 2.3 Transfer learning and zero-shot learning Transfer learning is technique that involves leveraging knowledge learned from one task or domain to improve the performance of another related task or domain. It involves transferring the learned representations, features, and patterns acquired during the training process of a source task to a target task with limited labelled data. Transfer learning has been widely practiced in crack segmentation tasks. Zhou et al. [35] utilized the pre-trained weights from Imagenet-1k [36] to initialized the weights of backbone in order to solve the data dependency of Swin Transformer. The parameters of backbone are frozen for first 50 epochs and unfrozen for the latter 50 epochs. Lau et al. [37] compared such "two-stage" training strategy with standard training procedure and found that the UNet trained with former method had a better effect. Gao et al. [38] established a hierarchical transfer learning architecture, where the pre-trained model for localization and segmentation tasks directly inherits the well-trained backbone used for classification tasks. Zero-shot learning is a subfield of transfer learning. The definition of zero-shot learning is to classify test instances to an unseen class [39]. In crack segmentation tasks, there may be confusing interference from objects that have never been seen before in practical engineering applications, and the pre-trained model needs to successfully classify this type of semantic information. Therefore, in this work, zero-shot learning can be evidenced by the ability to identify cracks in complex working conditions beyond the training set. In the field of AI-aided SHM, the traditional transfer learning methods can be summarized as initializing the backbone of the model with high-quality pre-trained weights (Figure 2(a)). During training, it is common in some studies to only fine-tune downstream networks (Figure 2(b)) or freeze certain layers initially and then unfreeze and train all layers together. Such full-training process requires significant GPU memory usage and sufficient data support. Nevertheless, the scarcity of high-quality annotated dataset and limited computational resource makes it impossible for full-training of a foundation model in civil engineering. For this reason, PEFT methods must be adopted. 5 Frozen Tunable Delta + Model Backbone Pre-trained weights (a) Head Pre-trained weights + Model Pre-trained weights (b) (c) Figure 2 Fine-tune methods for pre-trained models. (a) Full fine-tuning. (b) Fine-tune downstream networks. (c) PEFT. 2.4 PEFT PEFT is a lighter but more efficient fine-tune method, specially designed for transformer architectures. With fewer resources and training iterations, PEFT can preserve the original knowledge of pre-trained foundation models, avoid catastrophic forgetting, and reduce overfitting. By introducing a few trainable parameters (less than 5%) that do not exist in the original network, the pre-trained foundation model can be adapted to downstream tasks. Compared to other methods that train some or all the layers, PEFT is essentially "delta-tuning" [40], as shown in Figure 2(c). For example, prefix-tuning achieves comparable performance in the full data setting by adding a trainable "soft prompt" to all the key and value matrices in transformer layers [41], and even outperforms full fine-tuning in low data settings. Prompt-tuning, as a simpler version of prefix-tuning, just adds soft prompt to the input embeddings [42]. PEEF of SAM has already been applied to medical segmentation fields. Chen et al. [43] proposed SAM-Adapter which is demonstrated effective in camouflaged object detection, shadow detection, and polyp segmentation. Wu et al. [44] also fine-tuned SAM with adapter-based strategy on 19 medical image segmentation tasks, including CT, MRI, ultrasound image, fundus image, and dermoscopic images. Zhang et al. [45] applied LoRA-based strategy to fine-tune SAM for multiorgan segmentation for Synapse dataset, which is on par with the SOTA method. However, similar methods have not yet been applied in crack segmentation tasks. The most widely utilized PEFT technologies, adapter and LoRA, will be involved in this work. 3. Data preparation 3.1 Dataset for pre-training The large labelled crack segmentation dataset collected by khanhha [46] is utilized in this study. Khanhha dataset is a union of multiple open-source sub datasets, including CRACK500 [47], GAPs384 [48], CFD [49], AEL [50], CrackTree200 [51], CrackForest [49], and DeepCrack [52]. It 6 has 9603 images for training and 1695 images for testing, with a resolution of 448448. The dataset comes from various sources, including road surfaces, pavements, walls, bridges, and so on. The richness of sources allows the model trained on this dataset to perceive cracks of different working conditions and scales, thus having a certain degree of generalization. Of course, the inconsistent annotation thickness in different datasets can also have a certain negative impact on the model. Note that the annotation of sub dataset CrackTree200 is too fine compared to other datasets, with the width of the crack masks is only one pixel. This inconsistent labelling strategy made it difficult for previous studies [53][54] to identify cracks in CrackTree200 dataset, so this dataset is relabelled manually by experts. The relabelled CrackTree200 dataset will also be available together with the following collected datasets. 3.2 Datasets for zero-shot Two unique crack datasets are captured for evaluating the model’s zero-shot performance, namely Road420 and Facade390. The pixel-level binary masks of all the collected datasets are obtained by expert annotation. All the taken images and masks are converted into RGB and grayscale pictures and down-sampled to the size of 448448. 3.2.1 Road420 Road420 consists of 420 images of asphalt concrete and cement concrete road surfaces with cracks. The pictures contain a lot of interfering information, such as shadows, occlusions, road signs, vehicles, manhole covers, people, leaves, etc. Some small cracks will become difficult for naked eye to recognize after down-sampling. Some of the pictures are taken at night. The image semantics of some interfering factors have never appeared in the khanhha dataset, and the deliberate introduction of such interference makes zero-shot learning very challenging on this dataset. All the images are captured with an iPhone 14 Pro. Some representative sample images are shown in Figure 3. Figure 3 Sample images of Road420. 3.2.2 Facade390 Facade390 is composed of cracks on the exterior walls and columns of buildings captured by 7 UAV. Because the UAV must maintain a safe distance from the building during operation and may experience displacement during hovering, so the captured images will have potential blurry and some fine cracks may not be seen clearly. The identification of these cracks is susceptible to interference from various things such as wall stains, peeling, water traces, shadows, paint, vegetation, construction joints, and other extraneous factors. The cracks in Facade390 are generally not structural cracks, but may bring unsettling risks such as water seepage. The UAV employed in this study is Dajiang Mini 3. Some representative sample images are presented in Figure 4. Figure 4 Sample images of Facade390. 3.2.3 Concrete3k Concrete3k is a large ready-made dataset with 3000 image-label pairs of concrete cracks contributed by [53][55], it will also be leveraged for zero-shot. 4. Methodology 4.1 Overall architecture Mask Decoder 2 Layer TwoWayTransformer Positional embedding Image Embedding Neck ViT Block + ViT Block Patch Embedding Images Image Encoder ViT Backbone Masks Image to token Attention + Upsample 4 • MLP 2 Layer ConvTranspose Token to Image Attention Token to Image Attention MLP Self-Attention MLP Cat IOU Scores Learnable Output Tokens Prompt Encoder Conv Points Boxes Text Masks Figure 5 The original overall architecture of SAM. SAM is composed of three key components: an image encoder, a prompt encoder and a mask 8 decoder, as shown in Figure 5. The latter two parts are much lighter than the image encoder. 4.1.1 ViT block The ViT block is made up of two parts: window attention and MLP, as shown in Figure 6. W MLP  Layer Norm softmax +Positional Embedding Window Unpartition   Scale Query Key Attention Value WQ/K/V Window Partition X Layer Norm Figure 6 The architecture of ViT block. First, the input patches x p  H W C undergoes a window partition with a size of w and is N  w wC separated into N non-overlapping windows x  , and N =HW / w2 . The window size is set to 14 here. Then, multi-head self-attention approach is carried out on x. Divide x along the channel dimension and feed it into multiple attention heads. In the i-th head, query vector Q and key-value pairs K and V is obtained via a learnable linear layer: Qi / Ki / Vi = WQi / Ki /Vi xi + bQi / Ki /Vi (1) A dot production is computed to calculate the similarity scores between Q and K, then divide it by the square root of the dimension size of K for scaling. The result is normalized by a softmax activation function after a learnable positional embedding is added to it. The resulting attention weights are multiplied with V to obtain the output for each head: Atteni = softmax( Qi K iT + pos )Vi dk (2) The outputs of each head are concatenated and the output of the attention layer xa is obtained through a linear layer: 9 xa = concat ( Atten1 , Atten2 , , Attenn )W + b Finally, reorganize the windows of xa into its original shapes xo  (3) H W C . The MLP is a fully connected neural network stacked in multiple layers, which expands the original dimension four times, and compresses it back with a GELU activation [56]. The input of window attention and MLP is normalized through layer normalization [57]. At the same time, residual connections [58] are added to each module to achieve stable propagation of gradients in deep network. 4.1.2 Image encoder The image encoder is a MAE [59] pre-trained ViT, and consists of a patch embedding layer, a learnable positional embedding, a ViT backbone and a neck. The patch embedding layer converts the input image into a small-sized high-dimensional feature map through a 1616 convolution with a stride of 16. The absolute positional embedding is added to each position of the feature map. The backbone is a stack of ViT blocks. Based on the size of the backbone, there are three options: ViT-H, ViT-L, and ViT-B. The embedding dimension, number of blocks and number of attention heads for ViT-B is 768, 12 and 12. For ViT-L, it is 1024, 24 and 16. As for ViT-H, it is 1280, 32 and 16. In the neck, the image embedding is fed to a point-wise convolution and a 33 convolution to reduce the dimension to 256. Each convolution is immediately followed by a layer normalization. The output of the image encoder is a 16 downscaled image embedding of the original size with 256 dimensions. 4.1.3 Prompt encoder The prompt encoder receives sparse (points and boxes) or dense (masks and text) prompts. However, in this task, the segmentation object (crack) is determined, so the input of prompt is simplified to None. The default embedding is a learnable embedding representing, which is added to each position of image embedding. 4.1.4 Mask decoder A set of learnable output tokens is concatenated with the sparse prompt embeddings, and the obtained tokens as well as the image embedding and its positional embedding are fed into a 2-layer two-way Transformer. The tokens are learnable vectors and interact with image features through attention mechanism. Details of two-way Transformer can refer to the original code. Within the two-way Transformer, the following steps are mainly performed: first, conduct an 8 heads self-attention on input tokens; 10 then, perform cross attention from tokens to image embedding, where tokens are regarded as query and image embedding is treated as key and value; after that, tokens are updated through an MLP with a hidden dimension of 2048 and a ReLU activation; at last, an image to token cross attention is carried out. The attention layers consist of 8 heads each. The channel dimension for the query, key, and value in the self-attention layer is set to 256, while the dimension for the cross-attention layer is set to 128. After each attention and MLP layer, a residual connection and a layer normalization is added. Before fed into each attention, the original prompt tokens and image positional embedding are re-added to the queries and keys for a better memory of prompt token’s information. The output updated tokens and image embedding of the first layer of the two-way Transformer serve as the input for the second layer. A two-layer transposed convolution with a stride of 2 and a kernel size of 22 will up-sample the 1/16-size image embedding to 1/4-size. Channels of the transposed convolution layer are 64 and 32. A GELU activation is performed after convolution and a layer normalization is added between them. Tokens are updated through the final cross attention with image embedding. The dimension of tokens is transformed to 32 through a 3-layer MLP. The output of MLP is a linear classifier with a shape of (num_class, 32), it will predict the mask foreground probability at each position of the image embedding. The shape of image embedding is (32, H/4, W/4). Finally, the output low-resolution masks with a shape of (num_class, H/4, W/4) are obtained by a point-wise product. High-resolution masks can be derived through bilinear interpolation of low-resolution masks. Due to the fact that the main parameters are concentrated in the ViT blocks, so PEFT is conducted on them. The lightweight prompt encoder and mask decoder are also fine-tuned together [45]. Tunable Frozen + + + (3HW ) Image Embedding Image Encoder LoRA/Adapter (embedding_dimH/16W/16 ) Mask Decoder Prompt Encoder (2HW ) None Figure 7 The architecture of proposed CrackSAM. The general architecture of fine-tuned SAM, CrackSAM, is illustrated in Figure 7. Note that the bilinear interpolation of low-resolution masks is placed before feeding them into the final classifier in this architecture to obtain more accurate masks of cracks. This change will also be made to other comparative models in subsequent sections. 11 4.2 Adapter S MLP Adapter Layer Norm Tunable Frozen S Scaling Window Unpartition Adapter Linear Up GELU Attention Linear Down Window Partition Layer Norm Figure 8 Fine-tune strategy of adapter. The simple design of adapter makes it the most commonly used method in the field of PEFT. There are many variants of adapter, which can be either sequential or parallel. In this work, adapter is sequentially inserted behind the attention layer and parallelly inserted at the MLP [44], as shown in Figure 8. GELU activation function is employed in the middle, as it provides smoother gradients compared to ReLU. Adapter can be regarded as a smaller MLP designed with a bottleneck structure to reduce parameter count. Initially, Adapter employs a down-projection linear layer with parameters Wdown  d m to project the original d-dimensional features to a smaller dimension m. Subsequently, a non-linear activation function is applied, followed by an up-projection layer with parameters Wup  md to restore the features to the d-dimensional space. Notably, a residual connection is incorporated in this process. The middle dimension m is constrained such that m d . Denoting the input as x and the output after adaptation as x', the transformation is formally expressed as Eq. (4). x ' = (Wup  GELU(Wdown x + bdown )) + bup ) + x (4) For a parallel adapter, there is no need for an additional residual connection, but a scaling factor s is required to control the extent of the update of adapter. Given an input xm for MLP and its adapted output xm', the formula is as follows Eq. (5). xm ' = s  (Wup  GELU(Wdown  LN( xm ) + bdown ) + bup ) + MLP(LN( xm )) + xm 12 (5) During fine-tuning, the weights of attention layer and MLP are frozen, and only the weights of adapter are trained. When the middle dimension is a multiple of the input dimension, fine-tuning adapter at this point is equivalent to fine-tuning the newly added MLP. 4.3 Low-rank adaptation LoRA is a reparameterization method that converts the original parameters in a neural network into parameter-efficient form [40]. A neural network usually consists of a large number of full-rank matrix operations. When migrating to downstream tasks, LoRA assumes that the pre-trained model has a small intrinsic rank, and the updating of weights can be achieved on this small subspace. For a pre-trained weight matrix Wo  d k , a bypass W  d k is added to constrain the update of its weight, and W is decomposed into the product of matrices A  using low-rank decomposition, with the rank r d r and B  rk min(d , k ) . As shown in Figure 9, for the original path y=Wox, the updated result y' is: y ' = (WO + W ) x = WO x + ABx (6) During fine-tuning, the weight matrix Wo is kept frozen, and only the matrices A and B are finetuned. Matrix A is initialized using random Gaussian initialization, matrix B is initialized to 0 [16]. As a more general approach, LoRA can theoretically be added to any set of weights. In this way, fine-tuning LoRA is equivalent to full fine-tuning. From a parameter-efficient perspective, LoRA is always added to the attention weights, typically on the query and value components. Due to the adoption of low-rank decomposition, LoRA has fewer parameters compared to adapter. Additionally, the parallel design of LoRA can reduce the inference latency caused by the sequential execution of adapter. Tunable Frozen dr X Query rd A B Value dd WQ/V/... ... Figure 9 Fine-tune strategy of LoRA. 5. Experiment and results 5.1 Implementation details The primary loss function for semantic segmentation is cross-entropy (Eq. (7)). However, due to the highly imbalanced nature of crack segmentation, cross-entropy loss tends to rapidly converges to 13 zero during training, thereby allowing the background to dominate the loss [35]. Hence, a more effective approach involves using a weighted combination of cross-entropy and dice loss (Eq. (8)), as demonstrated in Eq. (9).   LCE = - y log( y) − (1 − y) log(1 − y) LDice = 1 − 2X (7) Y (8) X +Y L =  LCE + (1 −  ) LDice (9)  In Eq. (7), y represents the ground truth labels, taking values of 0 or 1. y denotes the probability of predicted labels. In Eq. (8), X corresponds to the mask regions of the true labels, and Y represents the mask regions of the predicted labels. The parameter  in Eq. (9) serves as a weighting coefficient, set to 0.2 in the context of this study. The learning rate is adjusted using a "poly" policy incorporating a warm-up strategy. Initially, for the first 300 iterations, the learning rate linearly increases from 0 to the initial learning rate of 0.0004. Subsequently, throughout the iterations, the learning rate dynamically scales by multiplying with (1 − iter − warm _ up power , where power is set to 6. The maximum iteration limit is set to 140 ) max_ iter epochs, with a batch size of 8. The optimization of the model utilizes AdamW, with parameters 1, 2, and weight_decay set to 0.9, 0.999, and 0.01, respectively. A threshold of 0.5 is chosen for mask binarization. Due to the relative abundance of the training dataset, only basic data augmentation techniques are employed, including random rotation and random flip. The training process is accelerated using auto mixed-precision and TensorFloat-32 [45]. Pre-trained weights of SAM are loaded and frozen before training. The best-performing checkpoint of delta and head on the validation set is saved and selected for subsequent testing. At the inference stage, the final predicted mask is obtained by returning the index of the maximum value in the channel dimension of the masks. The model is established on Pytorch framework and trained only on one 24GB RTX3090 GPU. 5.2 Evaluation metrics Precision (Pr), recall (Re), F1-score (F1), and IoU are employed to evaluate the segmentation performance of the model, as defined in Eq. (10) - Eq. (13). 14 Pr = TP TP + FP (10) Re = TP TP + FN (11) F1 = 2  Pr Re Pr + Re (12) TP TP + FP + FN (13) IoU = Where true positive (TP) denotes pixels representing cracks that are correctly classified, false positive (FP) represents background pixels erroneously classified as cracks, and false negative (FN) indicates crack pixels misclassified as background. 5.3 Ablation study In this section, ablation study is conducted on the parameter setting of the proposed architecture. For adapter, it is necessary to study the size of middle dimension and scaling factor. For LoRA, the positions where LoRA is applied and the size of rank should be investigated. The settings of these hyperparameters vary greatly for different downstream tasks. In addition, the impact of the size of the backbone and the combination of two fine-tuning methods are also studied. The following experiment discusses the metrics of Pr, Re, F1, and IoU (Eq. (10) - Eq. (13)) when a well-trained crack segmentation model is inferring on the test set. Meanwhile, the model's generalization ability is evaluated by directly applying the pre-trained model to three new datasets (Road420, Facade390, and Concrete3k) in a zero-shot manner without any additional training, and measuring the IoU metric. 5.3.1 CrackSAM_Adapter As shown in Table 1, introducing a few parameters is sufficient to achieve excellent transfer to downstream tasks. Even when the middle dimension is set to 1, the model can still achieve decent precision. An interesting phenomenon is that increasing the middle dimension continuously improves metrics on the test set, but blindly increasing parameters leads to a decrease in generalization. When the middle dimension is 32, the IoU on the test set only decreases by 0.3% compared to when the middle dimension is 64. However, there are improvements of 1.9%, 3.7%, and 4.8% when performing zero-shot learning on Road420, Facade390, and Concrete3k, respectively. Considering that the number of parameters in adapter is directly proportional to the middle dimension, setting the middle dimension to 32 is more appropriate. 15 Table 1 Ablation study on the middle dimension of adapter. Middle dimension dim=1 dim=16 dim=32 dim=64 Pr 0.7554 0.7664 0.7676 0.7674 Metric of inference on test set Re F1 IoU 0.7786 0.7515 0.6270 0.7968 0.7696 0.6479 0.7965 0.7704 0.6495 0.8002 0.7719 0.6513 IoU of zero-shot on new dataset Road420 Facade390 Concrete3k 0.5310 0.4618 0.6743 0.6139 0.4772 0.6461 0.6149 0.4718 0.6718 0.6033 0.4548 0.6412 Note that the model's IoU on Facade390 dataset is relatively low. This is actually because Facade390 is mainly composed of cracks in building exterior wall materials, which are very fine compared to road cracks. While the masks in the training set are mostly coarse segment-wise annotations, such as the masks in CRACK500 subset. Fine annotations are less prevalent in the training set, resulting in lower IoU during zero-shot on Facade390. In fact, a well-tuned CrackSAM model can accurately detect the majority of cracks in test set (Re > 0.9). Given this, in ablation experiments, priority is given to evaluating generalization ability based on the Road420 and Concrete3k datasets. The scaling factor s is introduced to balance the task-agnostic features generated by the frozen backbone and the task-specific features generated by the tunable parallel adapters. As shown in Table 2, setting the scaling factor to 0.2 can yield better performance in terms of generalization. Table 2 Ablation study on the scaling factor of adapter. Scaling factor Pr s=0.1 s=0.2 s=0.5 s=1 s=2 0.7671 0.7676 0.7716 0.7702 0.7751 Metric of inference on test set Re F1 IoU 0.7953 0.7965 0.7934 0.7958 0.7902 0.7693 0.7704 0.7706 0.7709 0.7707 0.648 0.6495 0.6499 0.6500 0.6494 IoU of zero-shot on new dataset Road420 0.6042 0.6149 0.609 0.6006 0.5981 Facade390 0.4487 0.4718 0.4354 0.4313 0.4586 Concrete3k 0.6597 0.6718 0.6635 0.6426 0.6548 5.3.2 CrackSAM_LoRA Table 3 Ablation study on the rank of LoRA. Rank r=1 r=4 r=8 r=16 Pr 0.7509 0.7620 0.7656 0.7657 Metric of inference on test set Re F1 IoU 0.7941 0.7585 0.6352 0.7918 0.7639 0.6416 0.7925 0.7665 0.6448 0.7947 0.7687 0.6473 IoU of zero-shot on new dataset Road420 Facade390 Concrete3k 0.6176 0.4494 0.6516 0.6222 0.4544 0.6798 0.6201 0.4601 0.6800 0.6200 0.4573 0.6727 According to Table 3, similar to adapter, fine-tuning LoRA with a rank set to 1 is quite effective, and at this point, the parameters of the LoRA component are only 0.16M. As the rank increases, the metrics on the test set continuously improve. The model's generalization reaches saturation when the rank is set to 4 and 8, while it decreases slightly when the rank is set to 16. The decrease in generalization caused by over-parameterization is similar to that of adapter. Considering performance 16 and cost, a rank of 4 or 8 is more reasonable. LoRA layer can be applied to the query, key, value, and output matrices in the attention layer. As shown in Table 3 and Table 4, when the rank is 8 and LoRA layer is applied only to the query, even though the number of parameters is equivalent to the situation when the rank is 4 and LoRA is applied to both the query and value, the latter achieves higher metrics. This implies the position of LoRA layer is a crucial factor. When the rank is 8 and LoRA is applied to both query and value, the generalization ability on Facade390 and Concrete3k improves by 6.9% and 3.6% compared to applying LoRA only to the query. Applying LoRA to all four matrices has a similar effect to excessively increasing the rank, resulting in a slight improvement in the metrics of test set but a decrease in zero-shot capability. Therefore, it suffices to add LoRA to the query and value matrices alone. Table 4 Ablation study on the weight type of LoRA. Weight type Pr Wq Wq , Wv Wq , Wk , Wv , Wo 0.7489 0.7656 0.7717 Metric of inference on test set Re F1 0.7964 0.7925 0.7887 0.7575 0.7665 0.7690 IoU 0.6344 0.6448 0.6476 IoU of zero-shot on new dataset Road420 Facade390 Concrete3k 0.5800 0.5122 0.6562 0.6201 0.4601 0.6800 0.6183 0.4602 0.6501 When comparing adapter and LoRA, the former demonstrates slightly higher metrics on the test set, while the latter exhibits better generalization performance. Considering that LoRA has fewer parameters than adapter, it is more recommended to apply CrackSAM_LoRA in engineering. 5.3.3 Combine two PEFT methods or use neither Here, the comparison is made between simultaneously using both two PEFT methods and using neither, i.e., employing only the traditional fine-tuning of the head (Figure 2(b)). Three different parameter scales of adapter and LoRA combinations are tested. According to Table 5, the effectiveness of combinations of PEFT methods depends on the parameter scale. When both adapter with a middle dimension of 32 and LoRA with a rank of 8 are added simultaneously to the model, compared to the model with only LoRA of rank 8 as shown in Table 4, the metrics improves slightly. However, considering the obvious increase in the computation cost, there is no significant necessity in combining multiple PEFT approaches. When fine-tuning only the prompt encoder and mask decoder, there is a significant performance decline. The IoU metrics in Table 4 on the test set and three new datasets increase by approximately 15.9%, 28.5%, 7.3%, and 12.2%, respectively, compared to the method of only fine-tuning head in Table 5. This clearly demonstrates the superiority of PEFT over traditional fine-tuning methods because completely freezing the backbone makes it more challenging for the model to extract 17 semantic information related to cracks. Table 5 Experimental results of combining two methods and using neither method. Delta type Metric of inference on test set Pr Re F1 IoU of zero-shot on new dataset IoU Road420 Facade390 Concrete3k No PEFT, 0.6951 0.7188 0.6843 0.5564 0.4826 0.4288 0.6059 only fine-tune head * * adapter(s =0.2, dim =8) + 0.7596 0.7959 0.7657 0.6438 0.6132 0.4560 0.6628 LORA(qv*, r*=2) adapter(s=0.2, dim=16) + 0.7637 0.8005 0.7703 0.6488 0.6188 0.4639 0.6798 LORA(qv, r=4) adapter(s=0.2, dim=32) + 0.7664 0.7959 0.7696 0.6485 0.6230 0.4862 0.6835 LORA(qv, r=8) Note: *1s = scaling factor; *2dim = middle dimension; *3qv = Apply LoRA to query and value matrices; *4r = rank. 5.3.4 Size of backbone According to Table 6, the size of the backbone has a significant impact on the fine-tuned model. In general, the larger the backbone, the more powerful the segmentation ability after fine-tuning. Progressing from ViT-B to ViT-L and then to ViT-H, the segmentation performance and generalization ability have improved both for adapter and LoRA. This observation aligns with the scaling law of large language models. It may be attributed to the richer features extracted by the stronger backbone, and the smaller intrinsic dimension it has. Under the same parameter configuration, fine-tuning becomes more effective for large-scale backbones. Therefore, in other experiments, this paper regards only ViT-H as the backbone for CrackSAM. Table 6 Ablation study on the size of backbone. Delta type Backbone adapter adapter adapter LoRA LoRA LoRA ViT-B ViT-L ViT-H ViT-B ViT-L ViT-H Pr 0.7574 0.7611 0.7676 0.7512 0.7623 0.7620 Metric of inference on test set Re F1 IoU 0.7920 0.7610 0.6379 0.8004 0.7682 0.6464 0.7965 0.7704 0.6495 0.7823 0.7523 0.6286 0.7849 0.7608 0.6379 0.7918 0.7639 0.6416 IoU of zero-shot on new dataset Road420 Facade390 Concrete3k 0.5859 0.4680 0.6573 0.6263 0.4700 0.6672 0.6149 0.4718 0.6718 0.5905 0.4787 0.6557 0.6162 0.4862 0.6791 0.6222 0.4544 0.6798 6. Comparison with SOTA models Semantic segmentation models typically consist of three main components: backbone, neck, and head. The backbone plays a crucial role in extracting high-level and semantically rich features. The neck assists in fusing information across multiple scales, while the head can be viewed as a decoder, receiving multi-scale features from the backbone and neck. It achieves the desired mask through aggregation, up-sampling, and refinement. Twelve other models that have performed well in the field of semantic segmentation are selected for comparative experiments with CrackSAM. These include VGG-UNet [46], Swin-UPerNet [25], 18 MobileNet [60], UNet-FCN and UNet-PSPNet [17], ResNet-DeepLabV3+ [58], ViT-UPerNet [24], SegFormer [26], HRNet-FCN [61], and ResNet-PSPNet [62]. To ensure fair competition, the selected comparative models have significant differences in parameter quantity and include various types of image backbones, such as CNN-based UNet, ResNet-50, ResNet-101, as well as attention-based architectures like Swin-T, ViT-B, Mix Transformer (MiT-B5). One reason for choosing these models is their outstanding performance in prior crack segmentation tasks [27][28][53][63]. The architectures for comparative models are configured using the settings from OpenMMLab segmentation toolkit [64]. The main training configuration is closely aligned with CrackSAM. The maximum number of iterations are set to 200 epochs, and the initial learning rate is determined through multiple trial and error adjustments. Based on the idea of transfer learning, the parameters are initialized using the weights of baseline models which are pre-trained on large datasets such as Cityscapes [65]. Figure 10 The changes of F1-score of different models on the validation set with training epochs. Batch size is set to 8. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) As shown in Figure 10, fine-tuning a vision foundation model using PEFT requires fewer iterations to achieve convergence compared to training an entire relatively small model. Consequently, the total training duration does not significantly increase. This study mainly evaluates the performance of CrackSAM from two perspectives: robustness and generalization. 6.1 Evaluation on datasets with artificial noise To assess the robustness of the model, the artificial noise is introduced into test set. This paper primarily investigates the following two cases when introducing artificial noise: Case 1: For the input image I, reduce its brightness and apply Gaussian blur: 19 I ' = ( I − bri)  K (11) Bri represents brightness,  denotes convolution operation, and K represents the Gaussian kernel. Specifically, convert the JPG image to the HSV color space. Then, subtract 50 from the V channel to achieve a decrease in brightness. Next, use a 2D Gaussian filter to smooth the JPG image, with a kernel size of 9x9 and both directional variances set to 0. Case 2: Apply serious blur to the input image followed by down-sampling: I ' = ( I  K ) S (12)  denotes the down-sampling operation, and S is the scaling factor. Apply Gaussian blur to the image with a kernel size of 21x21, then down-sample it to half its original size using cubic interpolation, followed by interpolation back to the original size. Figure 11 Inference results of comparative experiments on test set with artificial noise. The experimental results are listed in Table 7, and some predicted masks are shown in Figure 11. 20 Figure 11 (a)-(d) come from case 1, simulating a dim environment. The remaining figures are from case 2, indicating a fuzzy situation. As shown in the figure, the proposed CrackSAM performs well in identifying various forms of cracks, such as linear (Figure 11 (c)), branched (Figure 11 (d)), and webbed (Figure 11 (b)). It is capable of predicting on different materials (asphalt, concrete), various structures (road surfaces, walls, etc.), and diverse crack thicknesses, brightness levels, and contrasts ratio. Figure 11 (a) is a non-crack image. Both CrackSAM_adapter and CrackSAM_LoRA do not provide any masks, while the other four comparative models identified construction joints and paint as cracks. It turns out that the crack classification task can be part of the crack segmentation task, eliminating the need for a two-stage design. When the model does not return any crack masks, it means the image is a non-crack image. For extremely dark conditions (Figure 11 (c)), CrackSAM can still identify some cracks, although a few may be indistinguishable from the background. In extremely blurry situations (Figure 11 (d), (e), (h)), CrackSAM can find most cracks that are difficult for naked eye to discern, while other models can hardly detect the cracks present in the image. Figure 12 studied the IoU of different models with different Gaussian kernel sizes when adding a Gaussian blur to images in the test set. As shown from the figure, CrackSAM demonstrates much higher robustness compared to other models. Despite comparable performance of the compared models on the unprocessed test set, their IoU significantly decreases when a Gaussian blur is added. When the kernel size is 25, CrackSAM's performance even surpasses that of ViT-B and Swin-B with a kernel size of 15. Figure 12 Variation of IoU for different models under different Gaussian kernel sizes. Table 7 reveals that almost all models achieved satisfactory results on the original test set (IoU ≥ 0.62), except for UNet-PSPNet, which performed relatively worse. Excluding UNet-PSPNet, the 21 maximum gap between CrackSAM and the other 11 SOTA models is 4.6% on original test set. However, significant differences emerge on the noisy test set, with variations reaching 27.0% and 42.0% in two specific cases. The classic UNet architecture performs poorly on severely blurred test sets, with UNet-FCN and UNet-PSPNet experiencing accuracy drops of 54.4% and 60.3% after adding severe blur, while CrackSAM_LoRA only experiences a 23.4% drop. Among all the models, CrackSAM_adapter stands out as the most accurate model on the test set, while CrackSAM_LoRA performs best on the noisy test set. From this, it can be seen that CrackSAM's robustness is much better than traditional models. 6.2 Zero-shot performance on unseen datasets Figure 13 Zero-shot results of comparative experiments on Road420. As shown in Figure 13, the proposed model demonstrates satisfactory predictions in various scales, environments, and under different interferences. In Figure 13 (a), cracks captured from a distant view become challenging to discern for naked eye after down-sampling, yet the AI models 22 used in this experiment can still identify cracks in the image. This effectively showcases the superiority of deep learning algorithms in crack segmentation. For cracks captured at night, as shown in Figure 13 (c), (d), and (e), CrackSAM maintains highly accurate predictions, particularly in Figure 13 (e) where the cracks almost merge with shadows, while CrackSAM still accurately identifies the cracks. Other models, however, are obviously affected by shadows and road markings, resulting in numerous artifacts. In Figure 13 (b), CrackSAM correctly distinguishes between construction joints and cracks, but other models are misled by the sidewalk in the image. In Figure 13 (f), CrackSAM is not affected by occlusion from people and accurately outputs three segments of cracks, whereas other models either produce artifacts or fail to recognize the complete three segments. CrackSAM correctly segments the cracks in Figure 13 (g) with little influence from tire tracks on the road surface. In Figure 13 (h), the segmentation performance of CrackSAM remains unaffected by the presence of a cup and refracted light. Figure 14 Zero-shot results of comparative experiments on Facade390. Due to the impact of annotation thickness on the IoU of the model during zero-shot on Facade390, 23 resulting in generally lower IoU, it is necessary to combine figures of segmentation results when conducting comparative analysis on this dataset. As depicted in Figure 14, the proposed model can effectively identify cracks in an automated and efficient manner when combined with UAV, even when images are captured from different angles and distances. In Figure 14 (a), other models struggle to distinguish between construction joints and cracks, whereas CrackSAM can. Figure 14 (b) and (c) showcase red building facades with peeling, where CrackSAM successfully segments small cracks, while other models fail to do so. Figure 14 (d) illustrates walls with paint and peeling, where due to severe interference, comparative models cannot segment complete cracks, whereas CrackSAM provides results closest to the ground truth. Figure 14 (f) exhibits surface cracks on column with damp stains and grass; CrackSAM's segmentation mask closely resembles the actual crack morphology. In Figure 14 (g), at the junction of beam and column with tree shadows, CrackSAM identifies all four cracks, including the two small ones in the top of the image. According to Table 7, similar to robustness, the generalization gap among different models is also substantial, reaching 42.9%, 33.0%, and 31.1% on three new datasets. Lightweight networks like MobileNet-V3 and HRNet-FCN show decent zero-shot performance on Concrete3k but perform poorly on the interference-filled Road420, demonstrating the limited feature extraction capability of small models. Some larger models, such as ViT-B and Swin-B, perform similarly to CrackSAM on the test set, with acceptable performance on Road420, but their prediction accuracy is sensitive to noise. The widely used ResNet model demonstrates reasonable robustness, but its performance on the Road420 dataset is unexpectedly poor. Both fine-tuned versions of CrackSAM exhibit excellent cross-dataset generalization capabilities. CrackSAM_LoRA is the best-performing model during zero-shot on the Road420 and Concrete3k datasets, while CrackSAM_adapter performs better on Facade390. Among the twelve SOTA models, SegFormer is the best in both robustness and generalization. However, the proposed CrackSAM achieves a significant improvement in IoU metrics compared to SegFormer on the two noise test sets and three new datasets, with increases of up to 11.1%, 10.8%, 7.0%, 2.1%, and 4.1%, indicating a notable performance boost. Table 7 also illustrates the significance of studying the robustness and generalization of crack segmentation models, as it truly impacts the model's feasibility for real-world applications. In summary, through comparative experiments, the proposed CrackSAM has achieved the best results in terms of test set accuracy, robustness, and zero-shot performance. The robust generalization ability of CrackSAM is primarily attributed to the power of large backbone and the effectiveness of the PEFT method. This is evident from comparative and ablation experiments, as CrackSAM does not perform as outstandingly when ViT-H and PEFT are not utilized. 24 Table 7 Comparison results on noisy dataset and unseen dataset with other SOTA models Model Backbone CrackSAM_adapter (dim=32, s=0.2) ViT-H CrackSAM_LoRA (qv, rank=4) ViT-H VGG-UNet VGG16 Swin-UPerNet Swin-T Swin-UPerNet Swin-B MobileNet-V3 MobileNet-V3 UNet-FCN UNet UNet-PSPNet UNet ResNet-DeepLabV3+ ResNet-101 ResNet-DeepLabV3+ ResNet-50 ResNet-PSPNet ResNetV1c-101 ViT-UPerNet ViT-B SegFormer MiT-B5 HRNet-FCN HRNet-W18 Note: *1bri= brightness; *2k= kernel size. Parameters 641.9M (Tunable 9.1M) 637.2M (Tunable 4.4M) 53.91M 58.9M 120.0M 3.28M 28.99M 28.97M 60.2M 41.2M 65.59M 142.1M 82.0M 9.63M Test set Noisy test set 1 Noisy test set 2 Road420 Facade390 Concrete3k - -50 bri* + blur(k*=9) ×1/2 + blur(k=21) Zero-shot Zero-shot Zero-shot 0.6495 0.5466 0.4763 0.6149 0.4718 0.6718 0.6416 0.5782 0.4915 0.6222 0.4544 0.6798 0.6419 0.6199 0.6428 0.6208 0.6255 0.5594 0.6402 0.6395 0.6346 0.6328 0.6484 0.6356 0.4337 0.4745 0.5003 0.5068 0.4218 0.3535 0.5115 0.5077 0.5110 0.4714 0.5204 0.5055 0.3472 0.3963 0.3857 0.3738 0.2852 0.2222 0.4088 0.3947 0.4207 0.3554 0.4436 0.4434 0.5126 0.4628 0.5262 0.4447 0.4531 0.3555 0.3827 0.3918 0.4084 0.5171 0.5817 0.4172 0.4547 0.4065 0.4593 0.4322 0.3677 0.3163 0.4399 0.4402 0.4327 0.4276 0.4622 0.4214 0.5152 0.4778 0.5655 0.6154 0.4682 0.5262 0.5791 0.5601 0.5544 0.6027 0.6533 0.6322 25 Image Label Prediction Image Wrong classification Label Controversial annotations IOU 0.35 (a) IOU 0.40 (b) (c) Figure 15 Some prediction situations with low IoU. (a) Wrong classification. (b) Thicker prediction mask. (c) Controversial annotations subject to subjective judgements. Issues affecting the model's accuracy can generally be categorized into three situations: In the first scenario, the model fails to recognize the semantics of a particular object, as illustrated in Figure 15 (a). This reflects a deficiency in the model. The second situation arises when the model correctly identifies a crack and successfully outputs its mask. However, due to the output mask being much coarser than the annotation, Precision and IoU are lower, while Recall is very high, as depicted in Figure 15 (b). This situation does not impact the normal recognition of cracks and is considered acceptable. The third type is that annotators, when providing high-quality annotations on highresolution images, may label not only the main crack but also adjacent minor cracks and defects. Whether these tiny defects should be annotated depends on the annotator's subjective judgment (Figure 15 (c)). After down-sampling, information about these minor defects is severely lost, rendering them unidentifiable. This situation often occurs in the annotation of asphalt road cracks, where asphalt and cracks have similar brightness and contrast ratio, leading to ambiguous situations. 7. Conclusions This paper fine-tuned the segment anything model using PEFT methods for crack segmentation. The proposed CrackSAM was pre-trained on over 11k images. Two new labelled datasets comprising 810 images were collected through smartphone and UAV for zero-shot. The main conclusions of this paper are as follows: (1) PEFT technology was utilized, with SAM’s image encoder being frozen, and trainable delta 26 (adapter and LoRA) introduced on the ViT backbone. Fine-tuning was applied to the head and delta. The pre-trained vision foundation model can be introduced into crack segmentation effectively. (2) The proposed CrackSAM based on PEFT improved the IoU score greatly compared to the traditional method of only fine-tuning the head. The fine-tuning of CrackSAM followed the scaling law, where using ViT-H as the backbone instead of ViT-B resulted in additional performance gains. The combination of ViT-H backbone and the PEFT method is the main reason for the successful performance of CrackSAM. (3) Excessive over-parameterization can possibly enhance performance on the test set but may not necessarily generalize to other datasets. Increasing the middle dimension of adapter, raising the rank of LoRA, applying LoRA on more positions often came with an increase in computational cost and a simultaneous decrease in generalization. Therefore, the design of fine-tuning entails a trade-off between complexity, performance, and generalization. (4) CrackSAM worked exceptionally well in cross-scale and cross-scenario situations, exhibiting strong robustness and generalization capabilities. In the evaluation on two artificially introduced noise scenarios and three previously unseen datasets, CrackSAM demonstrated a great improvement in IoU compared to the twelve SOTA models, ranging from 11.1% - 63.6%, 10.8% - 121.2%, 7.0% - 75.0%, 2.1% - 49.2%, and 4.1% - 45.2%. (5) Satisfactory results were achieved on the test set by almost all different models, but there was a significant difference in generalization. In complex environments with severe interference, noticeable advantages were demonstrated by CrackSAM. Considering the various factors that may affect crack segmentation in real-world deployment, it is essential to study the robustness and zeroshot capabilities of a newly proposed architecture. Considering the lack of large benchmark datasets in the field of crack segmentation which constrain the performance of AI models and their practical applications. The authors call for more open-source efforts and the establishment of a large-scale crack segmentation dataset with a unified standard. If deploying a lighter network is necessary, it is recommended to employ knowledge distillation to train a lightweight model with guidance from CrackSAM. Furthermore, other powerful visual foundation models like the self-supervised DINOv2 [33] can be leveraged to detect cracks. Such technique can also be extended to segment other types of structural defects, which enables achieving "segment everything" in the field of SHM. These will be the focus of future work. 27 8. Declarations 8.1. Funding The authors gratefully acknowledge the financial support provided by the National Natural Science Foundation of China, Grant No. 52308179. 8.2. Conflicts of interest There are no conflicts of interest for this paper. 8.3. Data availability All the utilized models, pre-trained weights, and the labelled datasets will be publicly available on https://github.com/KG-TSI-Civil/CrackSAM after acceptance. References [1] Zawad, Md Rahat Shahriar, et al. "A comparative review of image processing based crack detection techniques on civil engineering structures." Journal of Soft Computing in Civil Engineering 5.3 (2021): 58-74. [2] Wan, Kai Tai, and Christopher KY Leung. "Applications of a distributed fiber optic crack sensor for concrete structures." Sensors and actuators A: physical 135.2 (2007): 458-464. [3] Aggelis, D. G., et al. "Combined use of thermography and ultrasound for the characterization of subsurface cracks in concrete." Construction and Building Materials 24.10 (2010): 1888-1897. [4] Tashan, Jawdat, and R. Al-Mahaidi. "Detection of cracks in concrete strengthened with CFRP systems using infra-red thermography." Composites Part B: Engineering 64 (2014): 116-125. [5] Azimi, Mohsen, Armin Dadras Eslamlou, and Gokhan Pekcan. "Data-driven structural health monitoring and damage detection through deep learning: State-of-the-art review." Sensors 20.10 (2020): 2778. [6] Wang, Long, and Zijun Zhang. "Automatic detection of wind turbine blade surface cracks based on UAVtaken images." IEEE Transactions on Industrial Electronics 64.9 (2017): 7293-7303. [7] Liu, Yu‐Fei, et al. "Image‐based crack assessment of bridge piers using unmanned aerial vehicles and three‐dimensional scene reconstruction." Computer‐Aided Civil and Infrastructure Engineering 35.5 (2020): 511-529. [8] Gopalakrishnan, Kasthurirangan, et al. "Crack damage detection in unmanned aerial vehicle images of civil infrastructure using pre-trained deep learning model." Int. J. Traffic Transp. Eng 8.1 (2018): 1-14. [9] Sinha, Sunil K., and Paul W. Fieguth. "Automated detection of cracks in buried concrete pipe images." Automation in construction 15.1 (2006): 58-72. [10] Subirats, Peggy, et al. "Automation of pavement surface crack detection using the continuous wavelet transform." 2006 International Conference on Image Processing. IEEE, 2006. [11] Zhang, Lei, et al. "Road crack detection using deep convolutional neural network." 2016 IEEE international conference on image processing (ICIP). IEEE, 2016. [12] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017). [13] Chen, Junjie, Weisheng Lu, and Jinfeng Lou. "Automatic concrete defect detection and reconstruction by aligning aerial images onto semantic‐rich building information model." Computer‐Aided Civil and Infrastructure Engineering 38.8 (2023): 1079-1098. [14] Kondoa, Yuki, and Norimichi Ukita. "Joint Learning of Blind Super-Resolution and Crack Segmentation for Realistic Degraded Images." arXiv preprint arXiv:2302.12491 (2023). [15] Houlsby, Neil, et al. "Parameter-efficient transfer learning for NLP." International Conference on Machine Learning. PMLR, 2019. [16] Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021). [17] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015. [18] Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE 28 conference on computer vision and pattern recognition. 2017. [19] Zhao, Hengshuang, et al. "Pyramid scene parsing network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. [20] Chen, Liang-Chieh, et al. "Encoder-decoder with atrous separable convolution for semantic image segmentation." Proceedings of the European conference on computer vision (ECCV). 2018. [21] Zhang, Lingxin, Junkai Shen, and Baijie Zhu. "A research on an improved Unet-based concrete crack detection algorithm." Structural Health Monitoring 20.4 (2021): 1864-1879. [22] Ren, Yupeng, et al. "Image-based concrete crack detection in tunnels using deep fully convolutional networks." Construction and Building Materials 234 (2020): 117367. [23] Dais, Dimitris, et al. "Automatic crack classification and segmentation on masonry surfaces using convolutional neural networks and transfer learning." Automation in Construction 125 (2021): 103606. [24] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). [25] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF international conference on computer vision. 2021. [26] Xie, Enze, et al. "SegFormer: Simple and efficient design for semantic segmentation with transformers." Advances in Neural Information Processing Systems 34 (2021): 12077-12090. [27] Shamsabadi, Elyas Asadi, et al. "Vision transformer-based autonomous crack detection on asphalt and concrete surfaces." Automation in Construction 140 (2022): 104316. [28] Guo, Feng, et al. "Pavement crack detection based on transformer network." Automation in Construction 145 (2023): 104646. [29] Bommasani, Rishi, et al. "On the opportunities and risks of foundation models." arXiv preprint arXiv:2108.07258 (2021). [30] Kirillov, Alexander, et al. "Segment anything." arXiv preprint arXiv:2304.02643 (2023). [31] Wang, Xinlong, et al. "Seggpt: Segmenting everything in context." arXiv preprint arXiv:2304.03284 (2023). [32] Zou, Xueyan, et al. "Segment everything everywhere all at once." arXiv preprint arXiv:2304.06718 (2023). [33] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023). [34] Ahmadi, Mohsen, et al. "Application of segment anything model for civil infrastructure defect assessment." arXiv preprint arXiv:2304.12600 (2023). [35] Zhou, Zhong, Junjie Zhang, and Chenjie Gong. "Hybrid semantic segmentation for tunnel lining cracks based on Swin Transformer and convolutional neural network." Computer ‐ Aided Civil and Infrastructure Engineering (2023). [36] Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009. [37] Lau, Stephen LH, et al. "Automated pavement crack segmentation using u-net-based convolutional neural network." Ieee Access 8 (2020): 114892-114899. [38] Gao, Yuqing, et al. "Multiattribute multitask transformer framework for vision‐based structural health monitoring." Computer‐Aided Civil and Infrastructure Engineering (2023). [39] Wang, Wei, et al. "A survey of zero-shot learning: Settings, methods, and applications." ACM Transactions on Intelligent Systems and Technology (TIST) 10.2 (2019): 1-37. [40] Ding, Ning, et al. "Parameter-efficient fine-tuning of large-scale pre-trained language models." Nature Machine Intelligence 5.3 (2023): 220-235. [41] Li, Xiang Lisa, and Percy Liang. "Prefix-tuning: Optimizing continuous prompts for generation." arXiv preprint arXiv:2101.00190 (2021). [42] Lester, Brian, Rami Al-Rfou, and Noah Constant. "The power of scale for parameter-efficient prompt tuning." arXiv preprint arXiv:2104.08691 (2021). [43] Chen, Tianrun, et al. "SAM Fails to Segment Anything?--SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, and More." arXiv preprint arXiv:2304.09148 (2023). [44] Wu, Junde, et al. "Medical sam adapter: Adapting segment anything model for medical image segmentation." arXiv preprint arXiv:2304.12620 (2023). [45] Zhang, Kaidong, and Dong Liu. "Customized segment anything model for medical image segmentation." arXiv preprint arXiv:2304.13785 (2023). [46] Khanhha, n.d. Khanhha/crack_segmentation. GitHub. URL https://github.com/khanhha/crack_segmentation#Dataset (accessed 11.9.23). 29 [47] Yang, Fan, et al. "Feature pyramid and hierarchical boosting network for pavement crack detection." IEEE Transactions on Intelligent Transportation Systems 21.4 (2019): 1525-1535. [48] Eisenbach, Markus, et al. "How to get pavement distress detection ready for deep learning? A systematic approach." 2017 international joint conference on neural networks (IJCNN). IEEE, 2017. [49] Shi, Yong, et al. "Automatic road crack detection using random structured forests." IEEE Transactions on Intelligent Transportation Systems 17.12 (2016): 3434-3445. [50] Amhaz, Rabih, et al. "Automatic crack detection on two-dimensional pavement images: An algorithm based on minimal path selection." IEEE Transactions on Intelligent Transportation Systems 17.10 (2016): 2718-2729. [51] Zou, Qin, et al. "CrackTree: Automatic crack detection from pavement images." Pattern Recognition Letters 33.3 (2012): 227-238. [52] Liu, Yahui, et al. "DeepCrack: A deep hierarchical feature learning architecture for crack segmentation." Neurocomputing 338 (2019): 139-153. [53] Li, Yongshang, et al. "Real-time high-resolution neural network with semantic guidance for crack segmentation." Automation in Construction 156 (2023): 105112. [54] Tabernik, Domen, Matic Šuc, and Danijel Skočaj. "Automated detection and segmentation of cracks in concrete surfaces using joined segmentation and classification deep neural network." Construction and Building Materials 408 (2023): 133582. [55] Wang, Wenjun, and Chao Su. "Automatic concrete crack segmentation model based on transformer." Automation in Construction 139 (2022): 104275. [56] Hendrycks, Dan, and Kevin Gimpel. "Gaussian error linear units (gelus)." arXiv preprint arXiv:1606.08415 (2016). [57] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016). [58] He, Kaiming, et al. "Identity mappings in deep residual networks." Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer International Publishing, 2016. [59] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. [60] Howard, Andrew, et al. "Searching for mobilenetv3." Proceedings of the IEEE/CVF international conference on computer vision. 2019. [61] Sun, Ke, et al. "Deep high-resolution representation learning for human pose estimation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. [62] He, Tong, et al. "Bag of tricks for image classification with convolutional neural networks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. [63] Kulkarni, Shreyas, et al. "CrackSeg9k: a collection and benchmark for crack segmentation datasets and frameworks." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022. [64] Chen, Kai, et al. "MMDetection: Open mmlab detection toolbox and benchmark." arXiv preprint arXiv:1906.07155 (2019). [65] Cordts, Marius, et al. "The cityscapes dataset for semantic urban scene understanding." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. 30