Adventures of Trustworthy Vision-Language Models: A Survey Mayank Vatsa, Anubhooti Jain, Richa Singh arXiv:2312.04231v1 [cs.CV] 7 Dec 2023 IIT Jodhpur, India mvatsa@iitj.ac.in, jain.44@iitj.ac.in, richa@iitj.ac.in Abstract Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability. Introduction Inspired from the performance for language-based tasks (Vaswani et al. 2017; Devlin et al. 2019), transformers were proposed for vision-based tasks where they process images as patch tokens (Dosovitskiy et al. 2021). Even with the modality change the basic architecture remained the same. These architectures were further extended to accommodate both modalities, giving birth to transformer-based visionlanguage models (Figure 1). Their self-attention module makes convolutions unnecessary, with (Park and Kim 2022) stating that multi-head self-attention acts as low-pass filters while convolutions act like high-pass filters. Their impressive success has been attributed to their ability to model long-range dependencies and having weak inductive biases, leading to better generalization. (Long et al. 2022) discusses a general architecture for the Vision-Language Pretrained Models (VLPMs), breaking the architecture into four categories, namely, Vision-Language Raw Input Data, Vision-Language Representation, Vision-Language Interaction Model, and Vision-Language Representation. (Long et al. 2022; Du et al. 2022; Fields and Kennington 2023) surveys VLPMs based on their architecture, pre-training tasks and objectives, and downstream tasks, showcasing that Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Classification Head : frisbee Visual Question Answering Pre-training Token : green Cross-Modal Masked Vision and Language Modeling VLPM Text Input: What is lying on the grass? Text Input: A [MASK] frisbee on the grass. T T T T Text Tokens T V V V V V Vision Tokens Text Input Figure 1: An example of vision-language model pre-trained using Cross-Modal Vision-Language Modeling and finetuned for Visual Question Answering. VLPMs continue to grow not only in terms of accuracy but size as well, as the newer models have parameters in billions and can perform several tasks with human-like accuracy. As shown in Figure 2, compared to 2018, there has been a big surge in articles about “vision-language transformer” in 2022, nearly 9.5 times more, and an even larger increase, nearly 12.5 times more, in 2021. A similar trend is seen with the term ‘vision transformer,’ with roughly 15 times more articles in 2022 compared to 2018 and an astounding approximately 21 times more in 2021. Many of these models are trained on heavy open-web datasets and are finetuned for different tasks ranging from classificationbased to generative-based. (Ross, Katz, and Barbu 2021; Birhane, Prabhu, and Kahembwe 2021; Srinivasan and Bisk 2022) have shown that these heavy and high-performing models suffer from different biases like gender and cultural bias. A detailed review of one of the vision-language transformers by (Srinivasan and Bisk 2022) depicts gender bias, with purse being the preferred term for the female gender while briefcase being the preferred term for the male gender. Just like bias, cases can be made for robustness and interpretability, iterating a need for a proper study of transformer models. Efforts have been made to study transformers in this light for vision and language-based models individually, but collectively, there Vis Task-sepecific MLP head Vis-Lang Task-sepecific MLP head 2018 2019 Pre-trained Language Encoder Year Pre-trained Transformer Encoder Pre-trained Vision Encoder 2020 Co-attention 2021 Q 2022 Text Tokens 0 1000 2000 3000 Vision Tokens K V Text Tokens V K Q Vision Tokens 4000 Number of Papers Figure 2: Keyword Analysis for research papers pertaining to two keywords, ‘vision-language transformer’ (red) and ‘vision transformer’ (blue) from 2018 to 2022. are only a few studies so far. Hence, we present an extensive survey of these VLPMs from a dependability and trust point-of-view by curating different practices, methods, and models proposed for VLPMs, first expanding on bias, followed by robustness, and finally, interpretability. In the end, we also discuss open challenges in the field. With this study, we hope to present the current state of VLPMs regarding reliability and highlight some research gaps that can help alleviate the overall state of VLPMs. An Overview of VLPMs In VLPMs, both single and dual architecture models have emerged as powerful tools. Here, we present a brief overview of these architectures and various pre-training and downstream tasks. Single and Dual Architectures: While VLPMs have their own different architectures, they can be broadly categorized into two types of architectures (Figure 3). Single-stream models fuse both modalities early on with a single transformer using joint cross-modality like VisualBERT (Li et al. 2019) and ViLT (Kim, Son, and Kim 2021) transformer models. Dual-stream models, on the other hand, process the two modalities separately and are then modeled jointly, like ViLBERT (Lu et al. 2019) and LXMERT (Tan and Bansal 2019) models. VLPMs can also be divided on the basis of visual features extracted from the model, like region features, usually pulled from object detectors, used by models like ViLBERT (Lu et al. 2019), grid features used by models like Pixel-BERT (Huang et al. 2020), or patch projection used by models like ViLT (Kim, Son, and Kim 2021). Pre-training Tasks: Pre-training has been found to be very beneficial for transformers and, by extension, for VLPMs. The models are pre-trained on large datasets to solve different pre-training tasks in a supervised or selfsupervised fashion. VLPMs generally use image-caption pairs for pre-training using paired as well as unpaired open web datasets, depending on the pre-training task. One of the most common tasks used for pre-training in the language models is Cross-Modal Masked Language Modeling, and Figure 3: Generic single and dual-stream architecture for pre-trained vision-language transformer Models, The tokens represented in the figure are after including the positional embeddings. it can be easily mapped for cross-modality in the visionlanguage domain as well. The task is generally used in a self-supervised setting where some tokens are masked randomly, and the goal is to predict the masked tokens. Another common task is Cross-Modal Masked Region Modeling, where tokens are masked out in the visual sequence. Cross-modal alignment is a task where the goal is to try to pair image and text, also known as Image-Text Matching (ITM). Cross-modal contrastive Learning is another pretraining task quite similar to ITM but in a contrastive manner in the way that matched image-text pairs are pushed together and non-matched pairs are pushed apart using contrastive loss. The large datasets used for pre-training have been considered to be a cause of bias (Park and Choi 2022; Radford et al. 2021). Downstream Tasks: Once VLPMs are pre-trained, they are finetuned to perform specific downstream tasks such as Image Captioning, Visual Question Answering, Image Text Retrieval, Natural Language for Visual Reasoning, and Visual Commonsense Reasoning. Broadly, the tasks can be categorized as generative, classification, and retrieval tasks. Task-specific datasets are used for finetuning the model, where the heads of the VLPMs are modified based on the downstream task. VLPMs have shown impressive accuracy with these tasks. The learned representation helps finetune the model for specified tasks quickly, especially with the rich information flowing between the two modalities. We can draw two important observations from this overview of VLPMs: • The architecture of VLPMs differs significantly from CNNs. Consequently, it’s crucial to develop methods specifically tailored to the VLPM architecture rather than merely extending approaches originally designed for CNNs. This ensures a more accurate and equitable evaluation of their performance. • Most recent VLPMs undergo training on datasets derived from the open web, which is a combination of various sources. This amalgamation raises concerns about the potential incorporation of biases present in the content from the open web into the models themselves (Mittal et al. 2023). Table 1: Summarizing research studies that have proposed different bias metrics. Bias Study Social Bias (Gender and Race) (Ross, Katz, and Barbu 2021) Gender Bias (Srinivasan and Bisk 2021) Social Bias (Gender and Race) (Hirota, Nakashima, and Garcia 2022b) Social Bias (Zhang, Wang, and Sang 2022) Stereotypical Bias (Gender, Profession, Race, and Religion) (Zhou, Lai, and Jiang 2022) Quantifying bias before and after finetuning (Ranjit et al. 2023) Emotional and Racial Bias (Bakr et al. 2023) Models Under Review Bias Metric ViLBERT, VisualBERT Grounded SEAT and WEAT VLBERT NIC, SAT, Att2In, UpDn, Transformer, Oscar, NIC+ ALBEF, TCL, ViLT VisualBERT, LXMERT, ViLT, CLIP, ALBEF, FLAVA Association Scores ResNet, BiT, CLIP, MoCo, SimCLR Bias Transfer Score (BTS) NIC, SAT, Att2In, UpDn, Transformer, Oscar, NIC+ ImageCaptioner2 Bias and Fairness Fairness in AI systems has been primarily viewed as protecting sensitive attributes in a way that no group faces disadvantage or biased decision. Biases like gender or racial bias have proven harmful, especially when they affect humans in real life (Singh et al. 2022). VLPMs are as vulnerable to bias as their CNN counterparts. They deal with two modalities and often two-stage training, allowing them to introduce more biases like pre-training bias or bias against a particular modality. Literature has shown that VLPMs are heavily influenced by language modality and can sometimes be harmful. (Kervadec et al. 2021) showed this with reference to the Visual Question Answering (VQA) task. Data and Bias Data has been considered the primary source of bias as it is a representation of the world that the model is trying to learn. With VLPMs, this can be an even bigger issue as pre-training requires large datasets. Many well-known VLPMs today have been trained on large heavy datasets crawled from the Internet, giving less control and oversight during data collection. This can lead the dataset to learn harmful representations. (Zhao, Wang, and Russakovsky 2021) examines some widely used multimodal datasets for bias and shows offensive texts and stereotypes embedded within them. (Bhargava and Forsyth 2019) specifically examines dataset bias by studying the COCO dataset (Lin et al. 2014), a manually annotated dataset for the image captioning task. The authors not only depict gender and racial bias but also analyze recent captioning models to see the differences in the performance from a lens of bias. Some studies have looked at task-specific datasets as well, as (Hirota, Nakashima, and Garcia 2022a) analyze five Visual Question Answering (VQA) datasets for gender and racial bias. (Garcia et al. 2023) focuses on datasets crawled from the Internet without much oversight from a demographic point-of-view while also showcasing how societal bias is an issue on various tasks and datasets. Leakage in Image Captioning (LIC) CounterBias vision-language relevance score and vision-language bias score Bias Estimation and Mitigation (Sudhakar et al. 2021) studies biases present in vision transformers by visualizing self-attention modules, noting encoded bias in the query matrix. To study and mitigate these biases, they further proposed an alignment approach called TADeT. (Ross, Katz, and Barbu 2021) further measured social biases in the joint embeddings by proposing Grounded WEAT and SEAT while also proposing a new dataset for testing biases in the grounded setting. The study concludes that bias comes from the language modality, and vision modality does not help mitigate biases. Moreover, CLIP (Radford et al. 2021), a heavily used VLPM known for its zero-shot capabilities, conducted its own bias study, postulating that it may encode social biases owing to the large open dataset used for its training. The authors tested zeroshot and linear probe instances of the model to mark the potential sources of biases and harmful markers. (Zhang, Wang, and Sang 2022) proposes the CounterBias method and FairVLP framework to quantify social bias in VLPMs in a counterfactual manner while proposing a new dataset to measure gender bias. (Srinivasan and Bisk 2022) studies gender bias, particularly in the VL-BERT model, by modifying both language and vision modalities and getting association scores. They further create templates for entities to measure the bias in three instances - pre-training, visual context at inferencing, and language context at inferencing. It is particularly interesting as investigating the bias at different stages can not only help dissect the effectiveness of different modalities but can also allow examination of how VLPMs can evolve after the modalities integrate, giving a new perspective on merging the multiple modalities effectively. (Hirota, Nakashima, and Garcia 2022b) introduced a new metric, Leakage for Image Captioning (LIC), to measure bias towards a particular attribute for the task of image captioning. The metric requires annotations for the protected attribute and can also use embeddings that have pre-existing bias. Furthermore, VLStereoSet (Zhou, Lai, and Jiang 2022) measured stereotypical biases in VLPMs using probing tasks by testing their tendency to recognize stereotypical statements for anti-stereotypical images. The stereotype is based on four categories: gender, profession, race, and religion, making the VLPMs select the statements as captions. They also proposed two metrics called vision-language relevance score and vision-language bias score, using which they concluded that state-of-the-art VLPMs under consideration not only encode stereotypical bias but are more complex than language bias and need to be studied. Several studies have given mitigation techniques to deal with bias like (Hendricks et al. 2018; Amend, Wazzan, and Souvenir 2021; Zhao, Andrews, and Xiang 2023; Wang and Russakovsky 2023). As can be noticed in these studies, there are different components and parts of the entire vision-language processing pipeline that are put under consideration. Even when looking for societal biases – gender and racial, there is a lack of commonality, yet none of the observations and results can be denied as less crucial. We feel that there is a lack of standard metrics and common protocol in the bias for multimodal models so far. In Table 1, we have tried to summarize some of these studies, detailing the metrics they used and the models they examined for bias. VLPMs can encode bias with more opportunities to do so than unimodal models. Robustness While accuracy focuses on correctness, robustness focuses on security by assessing the model for vulnerabilities in adversarial settings (Singh et al. 2020). Like CNNs, transformers are vulnerable to adversarial attacks. We first discuss how transformers perform against their CNN counterparts. Many have formulated that transformers are more robust than CNNs, but we believe that architectural differences were not considered by the adversarial methods used for these studies. We discuss the robustness of VLPMs exclusively in a separate subsection. Transformers vs CNNs Several transformer architectures have performed better than CNNs, but are they more robust? (Bhojanapalli et al. 2021) measures the robustness of ViT architectures to answer this very question and compares them with their ResNet counterparts for the task of image classification. Perturbations are added to the input using adversarial settings to measure robustness. The robustness is measured in parts, starting with natural perturbations like blurring, digitizing, and adding Gaussian noise. It is then measured with respect to distribution shift and using adversarial attacks. All the comparisons are made across varying sizes of ViT and ResNet architecture, concluding that transformers have a slight edge compared to ResNets, and with sufficient data, VITs can outperform their ResNet counterparts. (Shao et al. 2022) studied the robustness of transformers by exposing them to white-box and transfer adversarial attacks, concluding that ViTs are more robust than CNNs. The study also observes that VITs have spurious correlations and are less sensitive to high-frequency perturbations. Adding tokens for learning high-frequency patterns in ViTs improves classification accuracy but reduces the robustness of the architecture. Hybrid architectures combining ViTs and CNNs can reduce the robustness gap between the two architectures. Most of the studies focus on transfer attacks in lieu of specific attacks for transformers. (Bai et al. 2021; Pinto, Torr, and Dokania 2022) studies the robustness between transformers and CNNs questioning previous studies (Bhojanapalli et al. 2021; Shao et al. 2022) that show transformers to be more robust than CNNs claiming unfair settings while comparing the architectures. The study shows that transformers are not more robust than CNNs, but on out-of-distribution samples, transformers outperform CNNs. (Mao et al. 2022) proposed a Robust Vision Transformer (RVT) after studying the components affecting the robustness of the model, proposing a new patch-wise augmentation and a position-aware attention scaling (PAAS) to boost the RVT other than modifying damaging elements in the architecture for better robustness. RVT can be used as a backbone or vision encoder for different VLPMs, just like the Trade-off between Robustness and Accuracy of Vision Transformers (TORA-ViTs) (Li and Xu 2023) that can combine predictive and robust features in a trade-off manner. (Mishra, Sachdeva, and Baral 2022) performed a comparative study to measure the robustness of pre-trained transformers on noisy data. The noisy data is created using poison attacks like label flipping and has been compared under adversarial filtering augmentation. They introduced a novel robustness metric called Mean Rate of change of Accuracy with change in Poisoning (MRAP), using which they observed that the models are not robust under adversarial filtering. In most of these studies, the comparison between CNNs and transformers is drawn from existing attacks proposed originally for CNNs, but it is important to devise attacks that exploit vulnerabilities of the latter, keeping in mind the critical architecture difference between the two. VLPMs and their Robustness VLPMs are studied under the robustness lens but not as extensively as unimodal transformers. (Li, Gan, and Liu 2020) studies VLPMs over linguistic variation, logical reasoning, visual content manipulation, and answer distribution shift. These models have already shown better performance in terms of accuracy. Still, for robustness, the authors propose an adversarial training strategy called MANGO or Multimodal Adversarial Noise Generator to fool the models. Further, efforts have been made to devise methods exclusively for transformers, like the Patch-wise Adversarial Removal (PAR) method (Shi and Han 2021) that processes each patch separately to generate adversarial samples in a black-box setting. The patches are processed based on noise sensitivity and can be extended to CNNs as well. (Li et al. 2021) proposed a new benchmark for adversarial robustness on the task of VQA. (Wei et al. 2022) proposed a dual attack framework, namely, the Pay No Attention (PNA) method and PatchOut Attack, to improve the transferability across transformers that skipped attention gradients in order to create adversarial samples. Since the attack framework is sensitive to the transformer architecture, the attacks consider both patches by perturbing only a subset of them at each iteration and attention module by skipping some attention gradients. Other than attacks, (Ma et al. 2022) investigated how VLPMs perform under data with missing or incomplete modalities (examining only one modality at a time) in terms of accuracy and were improved using different fusion strategies. They concluded that transformers are not only sensitive to missing modalities but also that there is no optimal fusion strategy as multimodal fusion affects the robustness of these models and is dependent on datasets. (Salin et al. 2022) analyzes VLPMs to get a better insight into the multimodal relationship using probing tasks, concluding that concepts like position and size are difficult for the models under consideration to understand. (Zhao et al. 2023) studies adversarial vulnerability in a black-box setting to perform a realistic adversarial study by manipulating visual inputs. (Schlarmann and Hein 2023) on the other hand, studied adversarial robustness for imperceivable attacks on VQA and Image captioning tasks for well-known multimodal foundation models and (Mao et al. 2023) studies the zero-shot adversarial robustness. The authors proposed a text-guided contrastive adversarial training (TeCoA) to be used along with finetuning to improve the zero-shot adversarial robustness. All these studies try to examine the robustness by either formulating transformer-specific attacks, proposing new benchmarks, carefully looking at different architectural components, or optimizing training strategies. However, a proper and common framework can better help compare the various VLPMs. The architectural difference alone makes this a difficult but essential task that needs to be looked at. Interpretability and Explainability Irrespective of the architecture, it is imperative that we can interpret as well as explain the decisions made by the model. Transformers have relied heavily on attention to provide that explanation. A few methods originally proposed for CNNs have been extended for transformers as well, like GradCAM (Selvaraju et al. 2017). We have categorized the proposed methods into two categories, namely, gradient and visualization-based methods, and probing tasks. While visualization-based methods usually use inter- and intra-modality interactions to visually explain the decisions, probing tasks are specifically designed to explain a particular aspect or component of the models and can be restrictive. Finally, we discuss attention and how reliable it is as an explanation. Gradient-based and Visualization-based Methods Among several explanation methods proposed in the literature, many have been extended to transformer-based models. We first present the different gradient and visualizationbased methods that are more in line with transformers and VLPMs. Attention maps are a well-known method for interpreting transformer models. Modifications of these methods have been proposed in the literature, like the Attention Rollout (Abnar and Zuidema 2020), which combined layers to get averaged attention. (Voita et al. 2019) modified the LRP method specifically for transformers overcoming the computational barriers. Further, Relevancy Map or HilaCAM (Chefer, Gur, and Wolf 2021) uses the self-attention and co-attention modules considering classification tokens appended during downstream tasks and associated values to generate a relevancy map tracking interactions between different modalities and backpropagating relevancies. The method applies to both unimodal and multimodal models. Apart from these methods, VL-InterpreT (Aflalo et al. 2022) is more like a tool that gives an interactive interface looking at interactions between modalities from a bottom-up perspective. It uses four modality attention heads: languageto-vision attention, vision-to-language attention, languageto-language attention, and vision-to-vision attention, allowing it to look at interactions within and between modalities. MULTIVIZ (Liang et al. 2022) is another method to analyze multimodal models interpreting unimodal interactions, cross-modal interactions, multi-modal representations, and multimodal prediction. gScoreCAM (Chen et al. 2022) studied the CLIP (Radford et al. 2021) model specifically to understand large multimodal models. Using gScoreCAM, objects can be visualized as seen by the model by linearly combining the highest gradients as attention. (Pan et al. 2021) proposes interpretability-aware redundancy reduction (IA − RED2 ) to make transformer costefficient while using human-understandable architecture. The study (Chefer, Schwartz, and Wolf 2022) manipulates the relevancy maps to alleviate the model’s robustness. Lower relevance is assigned to the background pixels, so the foreground is considered with more confidence. (Qiang et al. 2022) proposes the AttCAT explanation method that uses attentive class activation tokens built on encoded features, gradients, and attention weights to provide the explanation. B-cos transformers are proposed by (Böhle, Fritz, and Schiele 2023), which are highly interpretable, providing holistic explanations. (Nalmpantis et al. 2023) proposes another interpretation method called Vision DiffMask, which identifies the relevant input part for final prediction using a gating mechanism. A faithfulness test is also used to showcase the validity of this post-hoc method, concluding that there is a lack of faithfulness tests in the literature. (Choi, Jin, and Han 2023) proposes Adversarial Normalization: I can Visualize Everything (ICE) to visualize the transformer architecture effectively. It uses adversarial normalization and patch-wise classification for each token, separating background and foreground pixels. The most common theme in these methods is exploiting attention weights and gradients to make the information flow more targeted. Another theme is to extend available metrics by making them computationally effective. Probing Tasks Most of the explanation methods for VLPMs are based on probing tasks. These tasks are designed to study a particular aspect of the model and thus are hard to generalize. VALUE or Vision And Language Understanding Evaluation (Cao et al. 2020) method gives several probing tasks to understand how pre-training helps the learned representations. The authors made several important observations: (i) the pre-trained models attend to language more than vision, something that has been corroborated throughout the literature; (ii) there is a set of attention heads that capture crossmodal interactions; and (iii) plotting attention can depict interpretable visual relations as was corroborated in the previous section as well, among others. (Dahlgren Lindström et al. 2020) further proposes three probing tasks for visualsemantic space, which are relevant for image-caption pairs and train separate classifiers for probing. The tasks are (i) a direct probing task designed for the number of objects, (ii) a direct probing task for object categories, and (iii) a task for semantic congruence. (Hendricks and Nematzadeh 2021) furthermore proposes probing tasks for verb understanding by collecting image-sentence pairs with 421 verbs commonly found in the Conceptual Captions dataset (Sharma et al. 2018). (Salin et al. 2022) proposed a set of probing tasks to better understand the representations generated by vision-language models, comparing the representations at pre-trained and finetuned levels. Further, datasets are designed carefully for multimodal probing, trying to reduce dependency on bias while making predictions. While probing tasks are helpful and can answer meaningfully with regard to particular problems, they have to be carefully crafted for relevant results and are very specific. At times, extra models or classifiers are required for probing, making the probing tasks applicable to selected models only. Dissecting Attention As can be seen in this section so far, attention is heavily used in the methods proposed to explain and interpret VLPMs. In fact, attention is one of the main reasons why transformers have been attributed to working so well. However, recently, attention has been pointed out not to be a reliable parameter for explaining a model’s decision in some studies. For VLPMs, in particular, fusing the modalities can make it difficult to interpret how the attention is distributed and how it should be explained. (Serrano and Smith 2019) evaluated attention for text classification, concluding that while attention can be helpful with intermediate components, it is not a good indicator for a justification. Further, (Jain and Wallace 2019) studied the relationship between attention weights and the final decision for several NLP tasks and concluded that attention weights often do not relate to gradient-based methods for computing feature importance; hence, they do not provide helpful or meaningful explanations. While these methods concluded that attention is not reliable as a justification tool, the studies have been limited to language-based tasks and need a proper in-depth analysis given how heavily current methods rely on the mechanism to interpret the models. (Park and Choi 2022) computed a relation between the attention map and input-attribution method by proposing Input-Attribution and Attention Score Vector (IAV). It tried to combine attention with attributionbased methods to utilize both components as a justification tool. Such methods can help alleviate this mistrust of attention. (Sahiner et al. 2022) studies attention under convex duality that can help provide interpretability for the architecture. (Liu et al. 2022) takes polarity into consideration along with attention. The authors propose a faithfulness violation test that can help quantify the quality of different explanation methods. We believe that attention needs to be evaluated as an interpretability metric for more vision and vision-language tasks. Combining the module with other established methods, like attribution-based methods, or examining the methods on controlled benchmarks can help. Open Challenges and Opportunities The previous sections discuss several methods and techniques to make VLPMs fair, robust, explainable, and interpretable. However, they also highlighted a lack of specific architecture-based methods and standard protocols. Even with all the progress, there are several open challenges that require further development and analysis. Here, we discuss some of the open challenges for improving different aspects of the trustworthiness of VLPMs. Trustworthiness of VLPMs: The concept of trustworthiness as a whole is lacking in the current analysis of VLPMs. A formalized and standardized framework can help set the baselines for the growing number of transformer architectures. One basic need is to make these models just as trustworthy to ensure that their decisions can be trusted and relied upon while staying away from harmful biases like using faithfulness tests for quantifying the model’s explainability. As we continue to use these models for security-critical applications, we need to be able to depend on the models and their decisions. Examining Attention: Attention mechanisms are often used to explain how models make decisions by creating visual representations that provide reasoning behind these decisions. However, to better understand and interpret attention, especially in the context of vision and cross-modality, we need to thoroughly examine attention modules. Analyzing models under adversarial conditions can also help us gain valuable insights and improve our understanding of attention mechanisms. Additionally, attention is a critical factor in ensuring the trustworthiness of transformer models. Therefore, we should examine attention from three different angles: its impact on model performance, its role in explaining decisions, and its role in understanding the model’s reasoning. Probing the Vision Modality: The literature has time and again iterated that for VLPMs, decisions have a stronger influence from the language modality than the visual modality. We believe a big gap exists between a systematic review of how the vision modality affects decisions and how we can better utilize it to avoid language bias. While tasks like VQA have recognized language bias, VLPM as a generalized architecture has not been explored for this bias as extensively. Better pre-trained tasks aligning the vision modality along with cross-modality interactions can be a way forward for improving the generalization as well as the effect of the vision modality on the entire model. Moreover, vision plays a crucial role in understanding object semantics on tasks like object detection and semantic segmentation, and thus, their reduced influence in vision-language tasks can be seen as a disadvantage. Studying the alignment between vision and text modality can also be a way forward. Better Generalized Methods: There is a need for better generalized methods that can evaluate not only between CNNs and transformers but also between different architecture formats within transformers. Also, with increasing hybrid architectures, such methods can help create a better comparison framework, providing effective baselines for future studies. Some studies (Gui et al. 2022; Tang et al. 2023) have used one modality to guide the other while training or used one modality to train the multimodal models, which can allow correcting for bias or adversarial vulnerabilities. Cross-modality and Universal Weights: Transformer models are known for their similar architecture, even when processing different modalities. However, the pre-trained weights are not as easily adapted between the modalities, and alignment remains an open challenge. Aligning the two modalities can help improve the representations for VLPMs and better project the two modalities in a similar projection space. A universal model that can represent both modalities similarly can help with performance as well as robustness, however, there is still a gap in getting universal pre-trained weights that can adapt to different modalities and require further research. Strategic Pre-training: Pre-training has been demonstrated to be beneficial for transformers, but it is costly. It can be a tedious process that requires large datasets and pre-training tasks that utilize heavy computing power. We have also seen how these large datasets can be a potential source of bias. With better and more focused pre-training strategies (Zhou et al. 2020), the training cost can be reduced while improving task-aware performance. With proper strategies in place, bias at the pre-training stage can be mitigated or avoided during finetuning. Interplay of VLPMs with Audio Models: In several multimedia applications ranging from audio-visual scene comprehension to speech-driven image recognition and immersive human-computer interactions, the fusion of vision, language, and audio plays a pivotal role. Consequently, it becomes imperative to explore the interplay between audio models and VLPMs to enhance our capabilities in perception, understanding, and communication, thereby offering more enriched and immersive experiences. Responsible ML Datasets: The trustworthiness of VLPMs and transformer models is intricately tied to their training data. These algorithms learn patterns from the data they are exposed to, which may inadvertently incorporate any inherent flaws present in the data, thereby influencing their behavior. Therefore, it is important to understand the crucial role of Responsible Machine Learning Datasets (Mittal et al. 2023), encompassing aspects such as privacy (Chhabra et al. 2018) and adherence to regulatory standards. In addition, machine unlearning concepts should be explored to ensure these systems can adapt and comply with evolving regulatory norms. Discussion Despite the remarkable human-like performance demonstrated by Vision-Language Pre-trained Models (VLPMs) and Vision Transformers, it is of paramount importance not to underestimate the crucial dimension of trustworthiness. As VLPMs continue to gain widespread adoption on a global scale, a rigorous examination becomes imperative. This paper presents a comprehensive analysis of VLPMs, addressing three essential dimensions: bias/fairness, robustness, and explainability/interpretability. Firstly, we scrutinize biases within VLPMs, recognizing that while datasets often serve as the primary source of bias, biases can also seep into the models and algorithms themselves. Addressing this issue requires a thorough evaluation and mitigation study, a challenge further complicated by VLPMs’ multidimensional nature encompassing both vision and language. Establishing a robust framework is essential to conduct bias assessments tailored to these complex models effectively. Next, we discuss about the robustness of VLPMs. While VLPMs have been extensively compared to their CNN counterparts, a noticeable gap exists when it comes to architecture-specific studies that explore vulnerabilities unique to VLPMs. Finally, we explore VLPMs using visualization-based and probing methods, which, although limited in availability, provide valuable insights to enhance our comprehension of VLPMs’ inner workings. We also highlighted some of the open challenges confronting VLPMs. We hope that this study serves as a foundation for researchers to identify gaps and work towards enhancing both the performance and trustworthiness of these models. Acknowledgement The work is partially supported through the grant from Technology Innovation Hub (TIH) at IIT Jodhpur. M. Vatsa is partially supported through the Swarnajayanti Fellowship. References Abnar, S.; and Zuidema, W. H. 2020. Quantifying Attention Flow in Transformers. In ACL, 4190–4197. Aflalo, E.; Du, M.; Tseng, S.-Y.; Liu, Y.; Wu, C.; Duan, N.; and Lal, V. 2022. Vl-interpret: An interactive visualization tool for interpreting vision-language transformers. In IEEE CVPR, 21406–21415. Amend, J. J.; Wazzan, A.; and Souvenir, R. 2021. Evaluating Gender-Neutral Training Data for Automated Image Captioning. In IEEE International Conference on Big Data (Big Data), 1226–1235. Bai, Y.; Mei, J.; Yuille, A. L.; and Xie, C. 2021. Are transformers more robust than cnns? NeurIPs, 34: 26831–26843. Bakr, E. M.; Sun, P.; Li, L. E.; and Elhoseiny, M. 2023. ImageCaptioner2 : Image Captioner for Image Captioning Bias Amplification Assessment. CoRR, abs/2304.04874. Bhargava, S.; and Forsyth, D. A. 2019. Exposing and Correcting the Gender Bias in Image Captioning Datasets and Models. CoRR, abs/1912.00578. Bhojanapalli, S.; Chakrabarti, A.; Glasner, D.; Li, D.; Unterthiner, T.; and Veit, A. 2021. Understanding robustness of transformers for image classification. In IEEE CVPR. Birhane, A.; Prabhu, V. U.; and Kahembwe, E. 2021. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963. Böhle, M.; Fritz, M.; and Schiele, B. 2023. Holistically Explainable Vision Transformers. CoRR, abs/2301.08669. Cao, J.; Gan, Z.; Cheng, Y.; Yu, L.; Chen, Y.-C.; and Liu, J. 2020. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In ECCV, 565–580. Springer. Chefer, H.; Gur, S.; and Wolf, L. 2021. Generic attentionmodel explainability for interpreting bi-modal and encoderdecoder transformers. In IEEE CVPR, 397–406. Chefer, H.; Schwartz, I.; and Wolf, L. 2022. Optimizing Relevance Maps of Vision Transformers Improves Robustness. In NeurIPS. Chen, P.; Li, Q.; Biaz, S.; Bui, T.; and Nguyen, A. 2022. gScoreCAM: What objects is CLIP looking at? In ACCV. Chhabra, S.; Singh, R.; Vatsa, M.; and Gupta, G. 2018. Anonymizing k Facial Attributes via Adversarial Perturbations. In IJCAI, 656–662. Choi, H.; Jin, S.; and Han, K. 2023. Adversarial Normalization: I Can visualize Everything (ICE). In IEEE/CVF CVPR. Dahlgren Lindström, A.; Björklund, J.; Bensch, S.; and Drewes, F. 2020. Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case. In COLING. Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR. Du, Y.; Liu, Z.; Li, J.; and Zhao, W. X. 2022. A Survey of Vision-Language Pre-Trained Models. In IJCAI, 5436– 5443. Survey Track. Fields, C.; and Kennington, C. 2023. Vision Language Transformers: A Survey. CoRR, abs/2307.03254. Garcia, N.; Hirota, Y.; Wu, Y.; and Nakashima, Y. 2023. Uncurated Image-Text Datasets: Shedding Light on Demographic Bias. In IEEE/CVF CVPR, 6957–6966. Gui, L.; Huang, Q.; Hauptmann, A.; Bisk, Y.; and Gao, J. 2022. Training Vision-Language Transformers from Captions Alone. CoRR, abs/2205.09256. Hendricks, L. A.; Burns, K.; Saenko, K.; Darrell, T.; and Rohrbach, A. 2018. Women Also Snowboard: Overcoming Bias in Captioning Models. In ECCV. Hendricks, L. A.; and Nematzadeh, A. 2021. Probing Image-Language Transformers for Verb Understanding. In ACL/IJCNLP. Hirota, Y.; Nakashima, Y.; and Garcia, N. 2022a. Gender and Racial Bias in Visual Question Answering Datasets. In FAccT, 1280–1292. Hirota, Y.; Nakashima, Y.; and Garcia, N. 2022b. Quantifying Societal Bias Amplification in Image Captioning. In IEEE/CVF CVPR, 13440–13449. Huang, Z.; Zeng, Z.; Liu, B.; Fu, D.; and Fu, J. 2020. Pixelbert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849. Jain, S.; and Wallace, B. C. 2019. Attention is not Explanation. In ACL: Human Language Technologies, Volume 1 (Long and Short Papers), 3543–3556. Kervadec, C.; Antipov, G.; Baccouche, M.; and Wolf, C. 2021. Roses are red, violets are blue... but should vqa expect them to? In IEEE CVPR, 2776–2785. Kim, W.; Son, B.; and Kim, I. 2021. Vilt: Vision-andlanguage transformer without convolution or region supervision. In ICML, 5583–5594. Li, L.; Gan, Z.; and Liu, J. 2020. A closer look at the robustness of vision-and-language pre-trained models. arXiv preprint arXiv:2012.08673. Li, L.; Lei, J.; Gan, Z.; and Liu, J. 2021. Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models. In IEEE/CVF ICCV, 2022–2031. Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; and Chang, K.W. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Li, Y.; and Xu, C. 2023. Trade-off between Robustness and Accuracy of Vision Transformers. In IEEE/CVF CVPR. Liang, P. P.; Lyu, Y.; Chhablani, G.; Jain, N.; Deng, Z.; Wang, X.; Morency, L.-P.; and Salakhutdinov, R. 2022. MultiViz: An Analysis Benchmark for Visualizing and Understanding Multimodal Models. arXiv preprint arXiv:2207.00056. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; et al. 2014. Microsoft coco: Common objects in context. In ECCV. Liu, Y.; Li, H.; Guo, Y.; Kong, C.; Li, J.; and Wang, S. 2022. Rethinking Attention-Model Explainability through Faithfulness Violation Test. In ICML PMLR. Long, S.; Cao, F.; Han, S. C.; and Yang, H. 2022. Visionand-Language Pretrained Models: A Survey. In IJCAI, 5530–5537. Survey Track. Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPs, 32. Ma, M.; Ren, J.; Zhao, L.; Testuggine, D.; and Peng, X. 2022. Are Multimodal Transformers Robust to Missing Modality? In IEEE CVPR, 18177–18186. Mao, C.; Geng, S.; Yang, J.; Wang, X.; and Vondrick, C. 2023. Understanding Zero-shot Adversarial Robustness for Large-Scale Models. In ICLR. Mao, X.; Qi, G.; Chen, Y.; Li, X.; Duan, R.; Ye, S.; He, Y.; and Xue, H. 2022. Towards robust vision transformer. In IEEE CVPR, 12042–12051. Mishra, S.; Sachdeva, B. S.; and Baral, C. 2022. Pretrained Transformers Do not Always Improve Robustness. arXiv preprint arXiv:2210.07663. Mittal, S.; Thakral, K.; Singh, R.; Vatsa, M.; Glaser, T.; Canton-Ferrer, C.; and Hassner, T. 2023. On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms. CoRR, abs/2310.15848. Nalmpantis, A.; Panagiotopoulos, A.; Gkountouras, J.; Papakostas, K.; and Aziz, W. 2023. Vision DiffMask: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking. In IEEE/CVF CVPR. Pan, B.; Jiang, Y.; Panda, R.; Wang, Z.; Feris, R.; and Oliva, A. 2021. IA-RED2 : Interpretability-Aware Redundancy Reduction for Vision Transformers. CoRR, abs/2106.12620. Park, B.; and Choi, J. 2022. Explanation on Pretraining Bias of Finetuned Vision Transformer. arXiv preprint arXiv:2211.15428. Park, N.; and Kim, S. 2022. How Do Vision Transformers Work? In ICLR. Pinto, F.; Torr, P. H. S.; and Dokania, P. K. 2022. An Impartial Take to the CNN vs Transformer Robustness Contest. In ECCV, volume 13673, 466–480. Qiang, Y.; Pan, D.; Li, C.; Li, X.; Jang, R.; and Zhu, D. 2022. AttCAT: Explaining Transformers via Attentive Class Activation Tokens. In NeurIPS. Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; et al. 2021. Learning transferable visual models from natural language supervision. In ICML. Ranjit, J.; Wang, T.; Ray, B.; and Ordonez, V. 2023. Variation of Gender Biases in Visual Recognition Models Before and After Finetuning. CoRR, abs/2303.07615. Ross, C.; Katz, B.; and Barbu, A. 2021. Measuring Social Biases in Grounded Vision and Language Embeddings. In NAACL-HLT, 998–1008. Sahiner, A.; Ergen, T.; Ozturkler, B.; Pauly, J. M.; Mardani, M.; and Pilanci, M. 2022. Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers. In ICML PMLR, volume 162, 19050–19088. Salin, E.; Farah, B.; Ayache, S.; and Favre, B. 2022. Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective. In AAAI. Schlarmann, C.; and Hein, M. 2023. On the Adversarial Robustness of Multi-Modal Foundation Models. CoRR, abs/2308.10741. Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In IEEE ICCV, 618–626. Serrano, S.; and Smith, N. A. 2019. Is Attention Interpretable? In ACL, 2931–2951. Shao, R.; Shi, Z.; Yi, J.; Chen, P.-Y.; and Hsieh, C.-J. 2022. On the Adversarial Robustness of Vision Transformers. TMLR. Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL (Vol 1: Long Papers), 2556–2565. Shi, Y.; and Han, Y. 2021. Decision-based black-box attack against vision transformers via patch-wise adversarial removal. arXiv preprint arXiv:2112.03492. Singh, R.; Agarwal, A.; Singh, M.; Nagpal, S.; and Vatsa, M. 2020. On the Robustness of Face Recognition Algorithms Against Attacks and Bias. In AAAI, 13583–13589. Singh, R.; Majumdar, P.; Mittal, S.; and Vatsa, M. 2022. Anatomizing Bias in Facial Analysis. In AAAI, 12351– 12358. Srinivasan, T.; and Bisk, Y. 2021. Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models. CoRR, abs/2104.08666. Srinivasan, T.; and Bisk, Y. 2022. Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models. In GeBNLP, 77–85. Sudhakar, S.; Prabhu, V.; Krishnakumar, A.; and Hoffman, J. 2021. Mitigating bias in visual transformers via targeted alignment. BMVC. Tan, H.; and Bansal, M. 2019. LXMERT: Learning CrossModality Encoder Representations from Transformers. In EMNLP-IJCNLP. Tang, S.; Wang, Y.; Kong, Z.; Zhang, T.; Li, Y.; Ding, C.; Wang, Y.; Liang, Y.; and Xu, D. 2023. You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model. In IEEE/CVF CVPR 2023, 10781– 10791. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In Neural Information Processing Systems. Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; and Titov, I. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In ACL. Wang, A.; and Russakovsky, O. 2023. Overcoming Bias in Pretrained Models by Manipulating the Finetuning Dataset. CoRR, abs/2303.06167. Wei, Z.; Chen, J.; Goldblum, M.; Wu, Z.; Goldstein, T.; and Jiang, Y.-G. 2022. Towards transferable adversarial attacks on vision transformers. In AAAI, volume 36, 2668–2676. Zhang, Y.; Wang, J.; and Sang, J. 2022. Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models. In ACM Multimedia, 4996–5004. Zhao, D.; Andrews, J. T. A.; and Xiang, A. 2023. Men Also Do Laundry: Multi-Attribute Bias Amplification. In ICML PMLR, 42000–42017. Zhao, D.; Wang, A.; and Russakovsky, O. 2021. Understanding and Evaluating Racial Biases in Image Captioning. In IEEE/CVF ICCV, 14810–14820. Zhao, Y.; Pang, T.; Du, C.; Yang, X.; Li, C.; Cheung, N.; and Lin, M. 2023. On Evaluating Adversarial Robustness of Large Vision-Language Models. CoRR, abs/2305.16934. Zhou, K.; Lai, E.; and Jiang, J. 2022. VLStereoSet: A Study of Stereotypical Bias in Pre-trained Vision-Language Models. In IJCNLP, 527–538. Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J. J.; and Gao, J. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In AAAI, 13041–13049.