Graph Convolutions Enrich the Self-Attention in Transformers! G RAPH C ONVOLUTIONS E NRICH THE S ELF -ATTENTION IN T RANSFORMERS ! Hyowon Wi∗ Yonsei University wihyowon@yonsei.ac.kr arXiv:2312.04234v1 [cs.LG] 7 Dec 2023 Jeongwhan Choi∗ Yonsei University jeongwhan.choi@yonsei.ac.kr Yehjin Shin Yonsei University yehjin.shin@yonsei.ac.kr Jayoung Kim Yonsei University jayoung.kim@yonsei.ac.kr Kookjin Lee Arizona State University kookjin.lee@asu.edu Nathaniel Trask University of Pennsylvania ntrask@seas.upenn.edu Noseong Park† Yonsei University noseong@yonsei.ac.kr A BSTRACT Transformers, renowned for their self-attention mechanism, have achieved stateof-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose graphfilter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph pattern classification, speech recognition, and code classification. 1 I NTRODUCTION Transformers are arguably one of the best feats in the field of deep learning. They are now showing state-of-the-art performance in various fields, ranging from computer vision to natural language processing to graph pattern classification to speech recognition, and so forth (Vaswani et al., 2017; Devlin et al., 2019; Radford et al., 2018; 2019; Dosovitskiy et al., 2021; Touvron et al., 2021a; Zhou et al., 2021a; Liu et al., 2021; Gulati et al., 2020; Latif et al., 2023; Ying et al., 2021; Rampášek et al., 2022). Recently, there have been several studies conducted on better understanding them (Gong et al., 2021; Ali et al., 2023; Wang et al., 2022); there exists a common agreement among researchers that the self-attention is one of the keys leading to the success. However, there also exist several studies pointing out potential limitations of the self-attention (Zhou et al., 2021a; Dong et al., 2021; Gong et al., 2021; Guo et al., 2023). For instance, Shi et al. (2022) revealed an analogy between the self-attention and the residual graph convolutional network (GCN), showing that BERT also suffers from a notorious problem of GCNs, called oversmoothing, i.e., tokens’ latent representations become similar to each other at the deeper layers. In every selfattention layer, value vectors are aggregated in a weighted average manner since each row-wise sum of the attention matrix is always 1. Although each self-attention layer has its own attention matrix, this aggregation method causes the oversmoothing problem, not only in Transformers but also in graph neural networks (Oono & Suzuki, 2020; Cai & Wang, 2020; Wang et al., 2021a; Rusch et al., 2023; Keriven, 2022; Zhou et al., 2021a; Gong et al., 2021; Yan et al., 2022; Noci et al., 2022; ∗ † Equal Contribution. Corresponding Author. 1 Graph Convolutions Enrich the Self-Attention in Transformers! Shi et al., 2022; Wang et al., 2022; Ali et al., 2023; Wu et al., 2023a;b). However, we confine our discussion to the oversmoothing of Transformers (cf. Section 2). Being inspired by them, we redesign the self-attention from the perspective of graph signal processing (GSP). However, performing graph convolutions in the self-attention layer may incur non-trivial computational overheads. Therefore, our key design point is to learn a general but effective graph filter with minimal overheads. In general, a graph filter on a graph G is represented by a polynomial expression based on its adjacency or Laplacian matrix — in this regard, the existing self-attention mechanism can be understood as the simplest graph filter with Ā only, where Ā ∈ [0, 1]n×n means a learned attention matrix and n is the number of input tokens. Our proposed graph filter consists of an identity term and two matrix polynomial terms, Ā and ĀK — one can design better filters with more polynomial terms but we avoid it since Transformers already require very large computation. The K-th power, ĀK , may also require high computation when the number of tokens is large. To avoid this, we further approximate ĀK using the elementwise first-order Taylor approximation. Therefore, one can consider that our proposed graph filter is the very next complicated filter after the one used by the original self-attention mechanism. However, its efficacy is tremendous for various Transformers in various fields (cf. Figure 1). Our proposed filter enriches the self-attention with more diverse frequency information (see Figure 2(a)) — low (resp. high) frequency signals on G mean neighboring nodes have similar (resp. different) values. Therefore, our method is able to not only effectively address the oversmoothing problem but also learn better latent representations for downstream tasks. There exist a couple of prior works to enrich the self-attention mechanism with high frequency information (Wang et al., 2022; Bai et al., 2022). In comparison with them, our proposed graph filter is distinctive in the following aspects: i) our proposed filter is more effective and shows better performance with comparable computational overheads, ii) our proposed filter is well-aligned with recent advancements in the GCN community — in other words, some graph filters used by recent advanced GCN methods are special cases of our proposed graph filter, which is not the case for prior works, and iii) other methods were typically studied for certain domains only whereas we test our method in 6 domains — for instance, DiversePatch (Gong et al., 2021) works only for Vision Transformers. Automatic Speech Recognition (LibriSpeech 960h) Causal Language Modeling (PTB) 19.45 2.31 Image Classfication (ImageNet-1k) 81.1 2.42 79.8 19.51 60.34 Natural Language Understanding 64.11 (CoLA) 62.88 0.124 64.39 Code Classification (Devign) 0.119 Backbones + GFSA Graph Classification (ZINC) Figure 1: Our proposed GFSA performs better than selected Transformers in various domains. We achieve these results with only tens to hundreds of additional parameters to Transformers. We replace the self-attention layer of selected Transformers in various fields with our proposed graph filter-based layer without changing other parts. Therefore, the accuracy increases in them are solely by our proposed graph filter-based self-attention. These enriched Transformers increase the model performance by 1.63% for image classification, 6.25% for natural language understanding, 0.31% for causal language modeling, 4.03% for graph classification, 4.76% for speech recognition, and 2.40% for code classification (see Figure 1). We summarize our core contributions as follows: • We provide a novel perspective on self-attention as a graph filter. This perspective allows us to design more effective self-attention that can address the oversmoothing problem. • We propose a graph filter-based self-attention (GFSA) mechanism that is general yet effective. GFSA learns a graph filter with an identity term and two polynomial terms, which is more effective than the simple self-attention mechanism. • We demonstrate that GFSA can significantly improve the performance of Transformers on a variety of tasks. GFSA achieves improved results on computer vision, natural language processing, graph pattern classification, speech recognition, and code classification. 2 Graph Convolutions Enrich the Self-Attention in Transformers! 2 BACKGROUND & R ELATED W ORK 2.1 S ELF -ATTENTION IN T RANSFORMERS The core building block of the Transformer architecture is the self-attention mechanism, which enables the model to learn attention patterns over its input tokens (Vaswani et al., 2017). The selfattention mechanism, denoted as SA : Rn×d → Rn×d , can be expressed as follows:  XW (XW )⊺  key qry √ SA(X) = softmax XWval = ĀXWval , (1) d where X ∈ Rn×d is the input feature matrix, Wkey ∈ Rd×d , Wqry ∈ Rd×d , and Wval ∈ Rd×d are the key, query, and value trainable parameters, respectively, and d is the dimension of each token. The self-attention mechanism allows the model to weigh the importance of each token in the input sequence relative to the others, enabling the model to capture long-range contextual information better. The Transformer architecture includes multiple layers, each with a multi-head self-attention layer followed by a position-wise feed-forward layer. 2.2 S ELF -ATTENTION AND G RAPH C ONVOLUTIONAL F ILTER The self-attention matrix used in Transformers can be seen as a symmetrically normalized adjacency matrix of a graph (Shi et al., 2022; Guo et al., 2023; Maskey et al., 2023). A weighted graph G with adjacency matrix A can be constructed by using the input tokens as n nodes and the edge weights between node i and node j as exp((XWqry )⊺ (XWkey )). We can rewrite the self-attention matrix Ā exp (XWqry )⊺ i (XWkey )j as Pd exp (XW . This allows Ā to be interpreted as the symmetrically normalized ad⊺ qry )i (XWkey )k k=1 P jacency matrix. In other words, Ā = D −1 A, where D = diag(d1 , d2 , . . . , dn ) and di = j Ai,j . Our new attention method is designed on top of graph signal processing (GSP) which has a close connection to discrete signal processing (DSP) (Sandryhaila & Moura, 2013; 2014). In DSP, a discrete signal with a length of n can be represented by a vector x ∈ Rn . Let g ∈ Rn be a filter that we want to apply to x. The convolution x ∗ g can be written as follows: n X yi = xj gi−j , (2) j=1 where the index, denoted as i, refers to the i-th element in each vector. GSP can be understood as a generalized concept of DSP. Signals are defined on the nodes of a graph, and the graph’s structure influences signal processing operations. In addition, the graph convolution filter with n nodes can be written with a shift operator S as follows: y= K X wk S k x = Hx, (3) k=0 where x ∈ Rn is a 1-dimensional graph signal, K is the maximum order of polynomial, and wk ∈ [−∞, ∞] is a coefficient. S is an n × n matrix where (i, j)-th element is non-zero if and only if there is an edge from node i to j. Two representative samples of S are adjacency and Laplacian PK matrices. The linear and shift-invariant graph filter H is the same as k=0 wk S k with a large enough value of K, which is called matrix polynomial (Marques et al., 2020). We note that this graph filtering operation can be extended to d-dimensional cases as in Equation 1. Being inspired by (Zou et al., 2022; Maskey et al., 2023), we rely on the singular value domain analysis to understand the low/high-pass filter characteristics of filters (cf. Fig. 2). See more discussion in Appendix N. In the context of self-attention within Transformers, the core part of the self-attention in Equation 1, i.e., ĀX, can be considered as a d-dimensional graph filter with Ā only, where H = Ā. Our goal in this paper is design a simple (for computational purposes) yet effective form of H. 2.3 OVERSMOOTHING IN GCN S AND T RANSFORMERS Oversmoothing is a phenomenon observed in deep learning models, especially in GCNs (Kipf & Welling, 2017; Veličković et al., 2018). As information is aggregated over multiple layers for mul3 DeiT DeiT + GFSA 0.8 Cosine Similarity Normalized Magnitude 1.0 0.6 0.4 0.2 0.0 −100 0.8 0.6 0.4 DeiT DeiT + GFSA 0.2 −50 0 Frequency 50 (a) Filter response 100 1 6 12 18 Normalized Singular Value Graph Convolutions Enrich the Self-Attention in Transformers! 24 Layer Index (b) Cosine similarity DeiT DeiT + GFSA 0.9 0.6 0.3 0.0 0 50 100 150 Singular Value Index (c) Singular value Figure 2: Filter frequency response, cosine similarity, and singular values on ImageNet-1k for DeiTS and DeiT-S + GFSA. Details and more visualizations are in Appendix D. tiple nodes (tokens), latent representations tend to become similar to each other, leading to a loss of distinctiveness in the representations (Oono & Suzuki, 2020; Zhou et al., 2020; Rusch et al., 2023). Surprisingly, an oversmoothing-like phenomenon is also observed in Transformers (Wang et al., 2022; Shi et al., 2022). Unlike CNNs, Transformers can not benefit from simply deepening layers after a certain depth. Earlier studies hypothesize that this may be due to issues such as the attention or feature collapse or due to uniformity among patches or tokens (Zhou et al., 2021a; Gong et al., 2021; Yan et al., 2022). Dong et al. (2021) also point out that the output of a pure Transformer, i.e., an attention mechanism without skip connections or MLPs, tends to converge to a rank-1 matrix (Dong et al., 2021). This analysis is followed by (Noci et al., 2022), which suggests that rank collapses incur vanishing gradients of attention queries and keys. In this context, the self-attention acts as a low-pass filter, since the self attention calculates the weighted average of the value vectors of tokens. Wang et al. (2022, Theorem 1) also reveal that the self-attention is a low-pass filter, continuously reducing high-frequency information. This nature contributes to the oversmoothing phenomenon as unique high-frequency features are lost in deeper layers of the network, further worsening the uniformity of token representations. Therefore, we extend the term “oversmoothing” to describe the degeneration challenge observed in Transformers. There have been proposed many empirical countermeasures for Vision Transformer (ViT), such as patch diversification (Zhou et al., 2021b; Gong et al., 2021), rank collapse alleviation (Zhou et al., 2021a; Zhang et al., 2021), and training stabilization (Touvron et al., 2021a; Zhai et al., 2023). Similar alleviating methods have been also proposed in the field of NLP, such as unsupervised learning (Chen et al., 2023), and resolve the oversmoothing and the token uniformity (or information diffusion) problems (Dong et al., 2021; Yan et al., 2022). There are studies on utilizing high frequency information via frequency domain analyses (Wang et al., 2022; Bai et al., 2022), but they are not designed on top of graph filtering perspectives. This paper addresses the oversmoothing problem with graph filters since the self-attention mechanism is a basic graph filtering operation as seen in the previous subsection. 3 P ROPOSED M ETHOD 3.1 G RAPH F ILTER - BASED S ELF -ATTENTION L AYERS Let Ā ∈ [0, 1]n×n , where n is the number of tokens in the input to the self-attention layer, be a selfattention matrix. Since transformers use multi-head self-attentions, there are multiple such matrices. For simplicity but without loss of generality, we discuss only one head in one layer. In the perspective of GSP, Ā corresponds to a special case where i) nodes are completely connected, and ii) Ā is normalized. Ā can be used as a shift operator. When K is large, however, it is hard to calculate ĀK . We use the following element-wise first-order Taylor approximation: ĀK ≈ Ā + (K − 1)(Ā2 − Ā), 2 (4) where Ā − Ā is an approximated derivative at the position of K = 1 (see details in Appendix P). At the end, we propose to use the following graph filter where the two lowest-order terms and one 4 Graph Convolutions Enrich the Self-Attention in Transformers! high-order term of the matrix polynomial are used: H := w0 I + w1 Ā + wK (Ā + (K − 1)(Ā2 − Ā)), (5) where w0 , w1 , wK are coefficients and K is a hyper-parameter. The coefficients can be learnable weights and we learn them with gradient descent algorithms. Our proposed graph filter-based self-attention (GFSA) is defined with the graph filter H as follows: GFSA(X) = HXWval . (6) We replace the original self-attention layer in various Transformers with the proposed graph filterbased layer without changing other parts. Therefore, GFSA can be plugged into any Transformers that rely on the self-attention. 3.2 T HEORETICAL A NALYSIS In this subsection, we provide a theorem that provides an upper bound on the error introduced by approximating the power of a matrix, specifically using the first-order Taylor expansion. The theorem specifically analyzes the error for the matrix ĀK when approximated using a first-order Taylor expansion. Theorem 3.1 (Error bound for first-order Taylor approximation). Define the error term as the difference between the exact value and the first-order Taylor approximation of ĀK : EK = ||ĀK − (Ā + (K − 1)(Ā2 − Ā))||, (7) where || · || denotes the L∞ norm. Then, the error EK can be bounded as follows: EK ≤ 2K. (8) The error bound provides an upper limit for the difference between the true value of ĀK and its approximation using the given formula. When the row-sum of Ā is 1 due to softmax. The proof of Theorem. 3.1 is in Appendix A. Computing high-order matrix powers can be computationally intensive, especially for large matrices such as the self-attention in natural language processing where the number of tokens (words) can be tens of thousands. The first-order Taylor approximation provides a simpler computation that can significantly reduce the required computational resources and time. Theorem. 3.1 ensures that even with this simplification, the error can be bounded and is predictable. 3.3 D ISCUSSION This subsection provides a discussion on GFSA in terms of the comparison with other models and the mitigation of the oversmoothing problem. Comparison to Transformers In the field of computer vision, there have been recent researches on adjusting the frequency response of ViT. HAT (Bai et al., 2022) improves ViT by directly increasing the high-frequency components of images through adversarial training, i.e., modifying inputs. Wang et al. (2022) suggest that they re-balance the low and high-frequency components of the attention map and feature map, respectively (Wang et al., 2022). In the NLP domain, Shi et al. (2022) redesign BERT with multiple fusions in (Shi et al., 2022), inspired by JKNet (Xu et al., 2018). Comparison to GCNs Comparisons to GCNs that can be interpreted as graph filters (Kipf & Welling, 2017; Defferrard et al., 2016; Gasteiger et al., 2019) are inevitable. GFSA without a highorder term is analogous to ChebNet (Defferrard et al., 2016) with K = 1. In addition, GFSA reduces to the vanilla GCN (Kipf & Welling, 2017) when K = 1, w0 = 0, w1 = 1. GPR-GNN (Chien et al., 2021), which approximates graph convolutions using the monomial basis, is identical to GFSA if our GFSA only considers up to an order of 1 and learns coefficients. When we use only a high-order term and wK is learned to negative value, GFSA can become similar to the reaction-diffusion layer of GREAD (Choi et al., 2023), ĀX + β(Ā − Ā2 ), depending on the higher order terms. 5 Graph Convolutions Enrich the Self-Attention in Transformers! How to alleviate the oversmoothing problem? The key leading to the behaviors of our proposed filter is the coefficients {w0 , w1 , wK } — note that in the self-attention of Transformers, w0 = wK = 0 and w1 = 1. Since our method can learn any appropriate values for them for a downstream task, it can reduce to low-pass-only, high-pass-only, or combined filters. In particular, our filter reduces to GPR-GNN and GREAD, two popular method preventing the oversmoothing for graph convolutional networks, in certain learned settings on {w0 , w1 , wK }. We use Ā2 − Ā to approximate ĀK (see Equation 4). As mentioned earlier, the low/high-pass filter characteristics are decided by {w0 , w1 , wK }. Therefore, the approximated ĀK by (K − 1)(Ā2 − Ā) and the coefficients {w0 , w1 , wK } can alleviate the oversmoothing problem of the self-attention by constituting an appropriate graph filter for a downstream task. 4 E XPERIMENTS In this section, we demonstrate the effectiveness of GFSA through a series of experiments. These experiments encompass various tasks: i) language understanding and causal language modeling, ii) image classification, iii) graph classification, and iv) code classification. We replace the selfattention of base Transformers in those fields with our GFSA. Our modification adds only tens to hundreds of parameters, which are negligible in comparison with the original size of base models. We summarize the runtime of base models vs. base models with GFSA in Appendix L. 4.1 E XPERIMENTS ON L ANGUAGE M ODELS To investigate the efficacy of GFSA, we present experiments on natural language understanding and causal language modeling tasks in the subsections below. 4.1.1 NATURAL L ANGUAGE U NDERSTANDING Experimental Setting. We integrate GFSA into three pre-trained large language models: BERT, ALBERT, and RoBERTa. We evaluate them on the GLUE benchmark, which includes three categories of natural language understanding tasks: i) single-sentence tasks CoLA and SST-2; ii) similarity and paraphrasing tasks MRPC, QQP, and STS-B; iii) natural language inference tasks MNLI, QNLI, and RTE. For the MNLI task, we experiment on both the matched (MNLI-m) and mismatched (MNLI-mm) versions. Following (Devlin et al., 2019), we report Matthews correlation for CoLA, F1 scores for QQP and MRPC, Spearman correlations for STS-B, and accuracy scores for the other tasks. For each task, we select the best hyperparameters for GFSA, and the other hyperparameters are fixed. We compare our GFSA with ContraNorm (Guo et al., 2023), one of the related methods that address oversmoothing. We finetune ContraNorm with the recommended hyperparameters in the original paper. We initialize with pretrained language model and finetune with added GFSA for 5 epochs. The detailed experimental settings are in Appendix E.1. Results. The results are shown in Table 1. When GFSA was plugged into backbones, average performance scores improved across all models over pure backbones. This indicates that GFSA is effective in both large models like BERT and RoBERTa, as well as relatively smaller models like ALBERT. It is worth mentioning that in the case of RoBERTa finetuned on the CoLA dataset, there is a large margin increase from 60.34% to 64.11%, which is a 3.77% improvement with only 144 additional parameters. When compared to ContraNorm, GFSA shows a greater performance improvement on average. Figure 4 in Appendix D shows that these performance enhancements can be attributed to addressing the oversmoothing issue through the designed graph filter. 4.1.2 C AUSAL L ANGUAGE M ODELING Experimental Setting. We also validate the effectiveness of GFSA on causal language modeling problems. We finetune GPT2 on the following three datasets: Penn Treebank (PTB) (Marcus et al., 1993), WikiText-2 and WikiText-103 (Merity et al., 2016). Following the evaluation method in (Yao et al., 2022), we finetune models for 15 epochs with PTB, 4 epochs with WikiText-103, and 10 epochs with WikiText-2, and report the perplexity for sensitivity metric, i.e., higher perplexity means higher sensitivity. The detailed experimental settings are in Appendix F.1. 6 Graph Convolutions Enrich the Self-Attention in Transformers! Table 1: Results comparison on validation set of GLUE tasks. Avg denotes the average performance on all the tasks. #Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg Method BERTBASE (Devlin et al., 2019) 110M 56.79 93.81 88.70 88.32 88.16 84.96/84.15 91.63 66.06 82.51 BERTBASE + ContraNorm 110M 59.89 93.92 89.88 88.51 88.36 85.11/84.50 91.84 69.31 83.48 BERTBASE + GFSA 110M 59.56 94.15 90.60 88.46 88.33 85.12/85.06 91.95 68.95 83.58 ALBERTBASE (Lan et al., 2019) ALBERTBASE + ContraNorm ALBERTBASE + GFSA 57.86 92.32 91.80 85.30 90.37 85.37/84.37 91.76 76.90 84.01 57.45 93.00 92.83 87.78 90.55 85.06/84.57 92.28 78.70 84.69 60.21 93.23 92.47 87.79 90.63 85.29/84.92 92.17 78.70 85.05 11M 11M 11M RoBERTaBASE (Liu et al., 2019) 125M 60.34 94.84 92.28 88.86 89.99 87.94/87.30 92.57 78.70 85.87 RoBERTaBASE + ContraNorm 125M 63.06 95.41 93.17 88.91 90.34 87.88/87.40 92.82 80.51 86.61 RoBERTaBASE + GFSA 125M 64.11 95.41 93.52 89.09 90.35 87.99/87.54 92.97 80.14 86.79 Table 2: Results comparison on GPT-2 finetuned with GFSA Method #Params PTB WikiText-2 WikiText-103 Avg GPT2 (Radford et al., 2019) GPT2 + GFSA 117M 117M 19.513 19.450 20.966 20.923 15.939 15.919 18.806 18.764 Results. Table 33 shows the perplexity on PTB, WitiText-2, and WikiText-103. Across all datasets, GPT2 with GFSA consistently outperforms the vanilla GPT2. Our GFSA improves the average perplexity from 18.806 to 18.764. Note that the performance improvements are made with only 144 additional learnable parameters for 12 layers with 12 heads, which are negligible in comparison with the size of GPT2. 4.2 E XPERIMENTS ON V ISION T RANSFORMERS Experimental Setting. We aim to demonstrate the efficacy of our GFSA across a spectrum of ViT backbones with different depth settings and training modes. We choose DeiT (Touvron et al., 2021a), CaiT (Touvron et al., 2021b), and Swin (Liu et al., 2021) as backbones, and models are trained from scratch. When training the 12-layer DeiT, we follow the same training recipe, hyperparameters, and data augmentation from (Touvron et al., 2021a). When training the 24-layer DeiT, we follow the settings of (Gong et al., 2021) and set the dropout rate to 0.2. For Cait, we apply our proposed GFSA only to the patch embedding layer. For detailed experimental settings, see Appendix G.1. Results. All experimental evaluations are summarized in Table 3. We compare various state-ofthe-art models on the ImageNet-1k benchmark. We select them from three classes: CNN only, CNN + Transformer, and pure Transformer. In the Transformer category, we only test with lighter models having comparable numbers of parameters, such as ViT-S and DeiT-S. The results show that the proposed GFSA successfully enhances Deit, CaiT, and Swin across all depth settings and training methods. GFSA provides additional parameters less than 72 for 12-layer DeiT while improving the top-1 accuracy by 1.63%. Compared to existing techniques, improvements by GFSA already surpasses DeepViT-24B (0.6%) (Zhou et al., 2021a), LayerScale (0.7%) (Touvron et al., 2021b), LateInsertion (0.6%) (Touvron et al., 2021b), and HAT (Bai et al., 2022) (1.38%). To sum up, we observed that both shallow and deep ViTs can achieve the following benefits from GFSA: i) The filter response plot shows GFSA can preserve higher-frequency representation (cf. Figure 2 (a)) and ii) Figure 2 (b) shows that GFSA can mitigate the increase in the cosine similarity of representation as the layer gets deeper. We also show experiments with the same ContraNorm setting in Appendix G.4. 4.3 E XPERIMENTS ON AUTOMATIC S PEECH R ECOGNITION Experimental Setting. We conduct automatic speech recognition (ASR) experiments on the LibriSpeech 1 dataset (Panayotov et al., 2015), which consists of audio recordings paired with their 1 http://www.openslr.org/12 7 Graph Convolutions Enrich the Self-Attention in Transformers! Table 3: Compared with state-of-the-art models on ImageNet-1k dataset. The number in (↑) indicates the performance improvement over the base model. Category Method CNN ResNet-152 (He et al., 2016) DenseNet-201 (Huang et al., 2017) Input Size #Layers #Params mTop-1 Acc 224 224 152 201 230M 77M 78.1 77.6 CNN + CVT-21 (Wu et al., 2021) Transformer Refiner (Zhou et al., 2021b) 224 224 21 16 32M 86M 82.5 81.2 ViT-S/16 (Dosovitskiy et al., 2021) ViT-B/16 (Dosovitskiy et al., 2021) DeiT-S (Touvron et al., 2021a) DeiT-S Distilled (Touvron et al., 2021a) DeiT-S + LayerScale (Touvron et al., 2021b) DeiT-S + LateInsertion (Touvron et al., 2021b) DeiT-S + ClassAttention (Touvron et al., 2021b) DeiT-S + AttnScale (Wang et al., 2022) DeiT-S + FeatScale (Wang et al., 2022) Transformer DeiT-S + HAT (Bai et al., 2022) DeiT-S + Diverse (Chen et al., 2022) DeiT-S + ContraNorm (Guo et al., 2023) Swin-S (Liu et al., 2021) 224 224 224 224 224 224 224 224 224 224 224 224 224 12 12 12 12 12 12 12 12 12 12 12 12 12 49M 86M 22M 22M 22M 22M 22M 22M 22M 22M 22M 22M 50M 78.1 79.8 79.8 81.2 80.5 80.5 80.6 80.7 80.9 80.9 80.6 80.4 82.9 T2T-ViT-24 (Yuan et al., 2021) DeepViT-24B (Zhou et al., 2021a) DeiT-S (Touvron et al., 2021a) CaiT-S (Touvron et al., 2021b) DeiT-S + DiversePatch (Gong et al., 2021) DeiT-S + LayerScale (Touvron et al., 2021b) DeiT-S + AttnScale (Wang et al., 2022) DeiT-S + FeatScale (Wang et al., 2022) DeiT-S + ContraNorm (Guo et al., 2023) 224 224 224 224 224 224 224 224 224 24 24 24 24 24 12 24 24 24 64M 36M 43M 47M 44M 22M 44M 44M 43M 82.3 80.1 80.5 82.6 82.2 82.4 81.1 81.3 80.7 DeiT-S + GFSA DeiT-S + GFSA CaiT-S + GFSA Swin-S + GFSA 224 224 224 224 12 24 24 12 43M 43M 43M 50M 81.1 (↑ 1.3) 81.5 (↑ 1.0) 82.8 (↑ 0.2) 83.0 (↑ 0.1) GFSA Table 4: Results for ASR training on LibriSpeech 100h and 960h with GFSA Method #Params LibriSpeech 100h LibriSpeech 960h test-clean WER test-other WER test-clean WER test-other WER Transformer Transformer + GFSA 71.5M 71.5M 11.02 10.30 25.42 24.30 2.42 2.31 5.50 5.49 Branchformer (Peng et al., 2022) 109.8M Branchformer + GFSA 109.8M 9.63 9.60 22.43 22.25 2.13 2.11 5.00 4.94 transcriptions. For implementation, we follow the recipes of SpeechBrain (Ravanelli et al., 2021) and describe the detailed settings in Appendix I.1. For our experiments, we use Branchformer (Peng et al., 2022), one of state-of-the-art models for speech recognition, as well as a pure Transformer. Results. Table 4 compares word error rates (WERs) on LibriSpeech 100h and 960h. For LibriSpeech 100h, Transformer+GFSA achieves 10.30/25.30 on the test clean/other set, which is a 6.53% improvement over the Transformer model for the WER of test clean. For LibriSpeech 960h, Transformer+GFSA shows a WER result of 2.31 in test clean, a 4.55% improvement over Transformer. Figure 7 in Appendix I.2 depicts the learning curves of train loss and valid loss when using GFSA, showing the effectiveness of our proposed filter. 8 Graph Convolutions Enrich the Self-Attention in Transformers! Table 5: Experimental results and number of parameters on ZINC Method #Params MAE Graphormer 500K 0.1240 Graphormer + GFSA 500K 0.1189 4.4 Table 6: Experimental results and number of parameters on PCQM4M and PCQM4Mv2 Method #Params PCQM4M PCQM4Mv2 Train Validate Train Validate Graphormer 48.3M 0.0535 0.1286 0.0250 0.0862 Graphormer + GFSA 48.3M 0.0312 0.1193 0.0249 0.0860 E XPERIMENTS ON G RAPH T RANSFORMERS Experimental Setting. To confirm the efficacy of GFSA on graph transformers, we conduct experiments on a commonly used graph-level prediction dataset (i.e., ZINC) from benchmarking-GNN leaderboards (Dwivedi et al., 2023) and the recent OGB-LSC quantum chemistry regression (i.e., PCQM4M-LSC) challenge (Hu et al., 2021), which is currently the biggest graph-level prediction dataset and contains more than 3.8M graphs in total. We choose Graphormer (Ying et al., 2021) as a backbone architecture and use its framework 2 for implementation. Mean absolute errors (MAEs) are reported for both regression tasks. For detailed experimental setting, see Appendix H.1. Results. Tables 5 and 6 show that Graphormer with GFSA consistently outperforms the vanilla model across all datasets. As can be seen from the train MAE values of Table 6, using GFSA allows the model to learn more effectively during the training process. Notably, plugging GFSA improves the performance by a large margin, e.g., 7.20% relative validate MAE decline in PCQM4M. Due to space constraints, results with standard deviation are included in table 14 and 15 in Appendix H.2. 4.5 E XPERIMENTS ON C ODE C LASSIFICATION Table 7: Code defect accuracy for models with Experimental Setting. We conduct a code plugged in GFSA. The number in (↑) indicates the defect detection task based on Devign dataset improvement rate (%) over the base model. provided by (Zhou et al., 2019). We use RoBERTa (Liu et al., 2019), CodeBERT (Feng Accuracy et al., 2020), PLBART (Ahmad et al., 2021), Method 62.88 and CodeT5 (Wang et al., 2021b) as our back- RoBERTa (Liu et al., 2019) 64.39 (↑ 2.40%) bone models. The detailed settings are in Ap- RoBERTa + GFSA pendix J.1. CodeBERT (Feng et al., 2020) 64.31 CodeBERT + GFSA 64.49 (↑ 0.12%) Results. Table 7 shows the accuracy of all PLBART (Ahmad et al., 2021) 62.63 models; GFSA results better than the base mod- PLBART + GFSA 62.96 (↑ 0.52%) els. The biggest improvement is 2.40% for RoBERTa. In the case of CodeT5-base, us- CodeT5-small (Wang et al., 2021b) 63.25 CodeT5-small + GFSA 63.69 (↑ 0.70%) ing GFSA shows an accuracy of 64.75, an imCodeT5-base (Wang et al., 2021b) 63.51 provement of 1.95% from 63.51. CodeT5CodeT5-base + GFSA 64.75 (↑ 1.95%) small+GFSA has only about 100 additional parameters compared to CodeT5-small with 60M parameters, and even more impressively, it surpasses the accuracy of CodeT5-base. In Appendix J.2, we include case studies for this task. We also report the results of code clone detection task in Appendix K. In Table 17, CodeT5-small+GFSA outperforms CodeT5-small with an F1 of 94.36. 4.6 RUNTIME OVERHEADS Our GFSA does not significantly increase training time. We summarize the runtime of base models vs. base models with GFSA in Appendix L. On GLUE tasks in natural language understanding, BERT with GFSA improves average performance from 82.51% to 83.58% (see Table. 1) with an overhead of less than 3 minutes in total training time. Considering the performance improvements, we think the increases in training time are negligible. 2 https://github.com/microsoft/Graphormer 9 Graph Convolutions Enrich the Self-Attention in Transformers! 5 F INAL R EMARKS Our proposed graph filter-based self-attention (GFSA) mechanism achieves high performance with improvements on a variety of tasks in computer vision, natural language processing, graph pattern classification, speech recognition, and code classification. GFSA is a simple yet effective method that enriches the self-attention in Transformers with more diverse frequency information. This enables GFSA to address the oversmoothing problem and learn better latent representations for downstream tasks. However, our GFSA does not bring significant overheads in those Transformers’ empirical runtime complexity. One can use more complicated graph filters to enhance accuracy more, but our goal is to find a balance between accuracy enhancements and overheads in runtime complexity. We believe that GFSA is a promising new direction for improving Transformers. It is simple to implement and can be used in conjunction with other techniques to further improve the performance of Transformers. We hope that our work will inspire more research on graph filter-based self-attention and its applications. E THICAL S TATEMENTS In terms of the broader impact of this research on society, we do not see the very negative impacts that might be expected. However, this paper may have implications for the carbon footprint and accessibility of learning algorithms. The computations required for machine learning research are rapidly growing, resulting in a larger carbon footprint (Schwartz et al., 2020). Our study improves performance and increases runtime very slightly, but the runtime increase is not very significant. However, in future research, it will also be important to study and improve our GFSA by taking carbon footprints into account. GFSA improves the performance of existing Transformer-based models, which can have many positive impacts on society through services that utilize natural language processing, computer vision, and speech recognition. However, it will also be important to improve GFSA by considering other dimensions of AI, such as robustness to adversarial examples, fairness, and explainability. R EPRODUCIBILITY S TATEMENT To ensure the reproducibility and completeness of this paper, we include the Appendix with 12 sections. Appendix A provides the complete proof of Theorem. 3.1. Appendix B contains the empirical study on the error related to Theorem. 3.1. Appendix C provides our PyTorch style pseudo code for our GFSA method. The pseudo code helps to implement our GFSA to any Transformers used a pure self-attention. All experiments in the paper are reproducible with additional implementation details provided in Appendices E to K. R EFERENCES Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2655–2668, 2021. Ameen Ali, Tomer Galanti, and Lior Wolf. Centered self-attention layers. arXiv preprint arXiv: 2306.01610, 2023. Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pp. 1–18. Springer, 2022. Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge. TAC, 7:8, 2009. 10 Graph Convolutions Enrich the Self-Attention in Transformers! Edward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series. In Advances in Neural Information Processing Systems (NeurIPS), 2019. Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318, 2020. Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017. Nuo Chen, Linjun Shou, Ming Gong, Jian Pei, Bowen Cao, Jianhui Chang, Daxin Jiang, and Jia Li. Alleviating over-smoothing for unsupervised sentence representation. arXiv preprint arXiv:2305.06154, 2023. Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, and Zhangyang Wang. The principle of diversity: Training stronger vision transformers calls for reducing all levels of redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12020–12030, 2022. Zihan Chen, Hongbo Zhang, Xiaoji Zhang, and Leqi Zhao. Quora question pairs, 2018. Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized PageRank graph neural network. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. Jeongwhan Choi, Seoyoung Hong, Noseong Park, and Sung-Bae Cho. Gread: Graph neural reaction-diffusion networks. In International Conference on Machine Learning (ICML), pp. 5722–5747. PMLR, 2023. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), 2016. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https: //aclanthology.org/N19-1423. Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005. Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), pp. 2793–2803. PMLR, 2021. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1–48, 2023. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536–1547, 2020. Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. 11 Graph Convolutions Enrich the Self-Attention in Transformers! Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753, 2021. Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML), volume 32, pp. 1764–1772. PMLR, 22–24 Jun 2014. Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020. Xiaojun Guo, Yifei Wang, Tianqi Du, and Yisen Wang. ContraNorm: A contrastive learning perspective on oversmoothing and beyond. In Proceedings of the International Conference on Learning Representations (ICLR), 2023. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. OGBLSC: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017. John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52 (7):1757–1768, 2012. Nicolas Keriven. Not too little, not too much: a theoretical analysis of graph (over) smoothing. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp. 2268–2281, 2022. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019. Siddique Latif, Aun Zaidi, Heriberto Cuayahuitl, Fahad Shamshad, Moazzam Shoukat, and Junaid Qadir. Transformers in speech processing: A survey. arXiv preprint arXiv:2303.11607, 2023. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021. Ilya Loshchilov and Frank Hutter. arXiv:1711.05101, 2017. Decoupled weight decay regularization. arXiv preprint Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. 1993. 12 Graph Convolutions Enrich the Self-Attention in Transformers! Antonio G. Marques, Santiago Segarra, and Gonzalo Mateos. Signal processing on directed graphs: The role of edge directionality when processing and learning from network data. IEEE Signal Processing Magazine, 37(6):99–116, 2020. Sohir Maskey, Raffaele Paolino, Aras Bacho, and Gitta Kutyniok. A fractional graph laplacian approach to oversmoothing. In Advances in Neural Information Processing Systems (NeurIPS), 2023. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016. Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp. 27198–27211, 2022. Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015. Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech 2019, 2019. Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe. Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning (ICML), pp. 17627–17643. PMLR, 2022. Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp. 14501–14515, 2022. Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, SungLin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624. T. Konstantin Rusch, Michael M. Bronstein, and Siddhartha Mishra. A survey on oversmoothing in graph neural networks. arXiv preprint arXiv: Arxiv-2303.10993, 2023. Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. IEEE transactions on signal processing, 61(7):1644–1656, 2013. Aliaksei Sandryhaila and Jose MF Moura. Discrete signal processing on graphs: Frequency analysis. IEEE Transactions on Signal Processing, 62(12):3042–3054, 2014. Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green AI. Communications of the ACM, 63(12):54–63, 2020. 13 Graph Convolutions Enrich the Self-Attention in Transformers! Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen Lee, and James T Kwok. Revisiting over-smoothing in bert from the perspective of graph. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pp. 10347–10357. PMLR, 2021a. Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 32–42, 2021b. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems (NeurIPS), volume 30, 2017. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Multi-hop attention graph neural network. In IJCAI, 2021a. Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. Wei Wang, Ming Yan, and Chen Wu. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. arXiv preprint arXiv:1811.11934, 2018. Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271, 2020. doi: 10.1109/ SANER48275.2020.9054857. Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. CodeT5: Identifier-aware unified pretrained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8696–8708, 2021b. Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. Ross Wightman. Pytorch image models. pytorch-image-models, 2019. https://github.com/rwightman/ Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017. Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 22–31, 2021. Xinyi Wu, Amir Ajorlou, Zihui Wu, and Ali Jadbabaie. Demystifying oversmoothing in attentionbased graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2023a. Xinyi Wu, Zhengdao Chen, William Wei Wang, and Ali Jadbabaie. A non-asymptotic analysis of oversmoothing in graph neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2023b. 14 Graph Convolutions Enrich the Self-Attention in Transformers! Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning (ICML), pp. 5453–5462, 2018. Hanqi Yan, Lin Gui, Wenjie Li, and Yulan He. Addressing token uniformity in transformers via singular value transformation. In Uncertainty in Artificial Intelligence, pp. 2181–2191. PMLR, 2022. Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, and Yuxiong He. Random-ltd: Random and layerwise token dropping brings efficient training for large-scale transformers. arXiv preprint arXiv:2211.11586, 2022. Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp. 28877–28888, 2021. Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 558–567, 2021. Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Joshua M. Susskind. Stabilizing transformer training by preventing attention entropy collapse. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202, pp. 40770–40803. PMLR, 2023. Aston Zhang, Alvin Chan, Yi Tay, Jie Fu, Shuohang Wang, Shuai Zhang, Huajie Shao, Shuochao Yao, and Roy Ka-Wei Lee. On orthogonality constraints for transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 375–382, 2021. Ying Zhang, Mohammad Pezeshki, Philémon Brakel, Saizheng Zhang, César Laurent, Yoshua Bengio, and Aaron Courville. Towards end-to-end speech recognition with deep convolutional neural networks. Interspeech 2016, 2016. Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021a. Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie Jin, Qibin Hou, and Jiashi Feng. Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714, 2021b. Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020. Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in neural information processing systems (NeurIPS), volume 32, 2019. Chunya Zou, Andi Han, Lequan Lin, and Junbin Gao. A simple yet effective svd-gcn for directed graphs. arXiv preprint arXiv:2205.09335, 2022. 15 Graph Convolutions Enrich the Self-Attention in Transformers! Appendix Table of Contents A Proof of Theorem. 3.1 17 B Empirical Error Bound 17 C Implementation of GFSA 18 D Oversmoothing and Additional Visualizations 18 E Natural Language Understanding E.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . E.2 Sensitivity to K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 F Causal Language Modeling F.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . F.2 Sensitivity to K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 G Image Classification G.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . G.2 FLOPs & Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.3 Sensitivity to K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.4 Additional Experiments with Guo et al. (2023)’s setting . . . . . . . . . . . . . 22 22 22 22 23 H Graph Classification H.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . H.2 Experimental results with standard deviation . . . . . . . . . . . . . . . . . . . 24 24 24 I Automatic Speech Recognition I.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . I.2 Training Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 27 J Code Defect Detection J.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . J.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 28 K Code Clone Detection K.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . K.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 29 L Time Complexity and Empirical Runtime Analysis 29 M Inference Time 31 N Frequency Analyses in the Singular Value Domain 33 O Matrix Polynomial vs. Graph Fourier Transform 33 P Approximation of ĀK in Equation 4 33 Q GFSA in Selected Layers: A Strategy for Model Efficiency 34 16 Graph Convolutions Enrich the Self-Attention in Transformers! A P ROOF OF T HEOREM . 3.1 Proof. We begin by observing that the L∞ norm of Ā2 is bounded by: ∥Ā2 ∥ ≤ ∥Ā∥2 . (9) By using the induction principle, it follows that for any positive integer k, ∥Āk ∥ ≤ ∥Ā∥k . (10) Given that Ā is a matrix normalized with softmax: i) all the elements of Ā lie within [0, 1], and ii) the sum of all the elements in Ā is equal to 1 (potentially less than 1 if an attention mask is used). Then we can calculate the upper bound of L∞ norm for Ā as follows: X ∥Ā∥ = max |Āi,j | ≤ max 1 = 1. (11) i i j Now, considering the error term EK as given by Theorem 3.1, and applying the triangle inequality for matrix norms: EK = ∥ĀK − (Ā + (K − 1)(Ā2 − Ā))∥ K (12) 2 ≤ ∥Ā ∥ + ∥Ā∥ + (K − 1)(∥Ā ∥ + ∥Ā∥). (13) From the prior expressions, we deduce that ∥ĀK ∥ ≤ ∥Ā∥K ≤ 1K = 1. Thus, the upper bound of the error term EK becomes as follows: EK ≤ 1 + 1 + (K − 1)(1 + 1) = 2K. B (14) E MPIRICAL E RROR B OUND In this section, we empirically compare the theoretical error bounds in Theorem. 3.1 with the actual errors. Figure 3 shows the theoretical upper bound of EK and the actual EK for the attention of the last layer in a trained pure backbone. DeiT-S and Graphormer are trained on ImageNet-1k and ZINC, respectively, and BERT is finetuned on STS-B. It can be observed that for all models, the actual EK is always smaller than the theoretical error bound. 20.0 20.0 Theorem Actual 17.5 15.0 12.5 12.5 10.0 10.0 10.0 EK 15.0 12.5 7.5 7.5 7.5 5.0 5.0 5.0 2.5 2.5 2.5 0.0 0.0 2 6 K (a) DeiT-S 10 Theorem Actual 17.5 15.0 EK EK 20.0 Theorem Actual 17.5 0.0 2 6 K 10 (b) BERT Figure 3: Theoretical and actual values of EK 17 2 6 K (c) Graphormer 10 Graph Convolutions Enrich the Self-Attention in Transformers! C I MPLEMENTATION OF GFSA The pseudo code of our GFSA is shown in Algorithm 1. For implememtation, w0 and w1 can be set as hyperparameters optionally. Algorithm 1 PyTorch-style pseudocode for GFSA w 0 = torch.zeros(h) w 1 = torch.ones(h) w K = torch.zeros(h) I = torch.eyes(n)[None, None, ...] def GFSA (att, K) att: original self-attention att K: high order term att K = att + (K-1) * (torch.mm(att,att)-att) gf att: GFSA attention gf att = w 0[None, :, None, None] * I + w 1[None, :, None, None] * att + w K[None, :, None, None] * att K return gf att D OVERSMOOTHING AND A DDITIONAL V ISUALIZATIONS In Figure 2, we show the visualizations of oversmoothing characteristics in DeiT. We also provide visualizations in other domains. We show the filter response, cosine similarity, and singular value of BERT finetuned on STS-B dataset of GLUE tasks in Figure 4 and Graphormer finetuned on ZINC dataset in Figure 5. To characterize self-attention, we first analyze the filter response of self-attention in the frequency domain. We follow the method used by Wang et al. (2022) for spectral visualization of the selfattention matrix. As shown in Figure 2 (a), Deit has a near-zero magnitude for the high frequencies, which is characteristic of a low-frequency filter and is likely to result in oversmoothing when applied multiple times. We follow the calculation method of Guo et al. (2023) for cosine similarity. As shown in Figure 2 (b), the higher similarity as the layers of the model get deeper is related to the oversmoothing problem. To further analyze this issue, we also consider the dimensionality collapse in Transformerbased models. We plot the singular value distribution of the feature in the last block. As shown in Figure 2 (c), insignificant, near-zero values dominate the feature distribution. As layers get deeper, the similarity of features increases and dimensional collapse occurs. The oversmoothing problem is the same in BERT and Graphormer, as shown in Figure 4 and Figure 5. 0.8 0.6 0.4 0.2 0.8 0.6 BERT BERT + GFSA 0.4 0.0 −40 −20 0 Frequency 20 (a) Filter reponse 40 Normalized Singular Value BERT BERT + GFSA Cosine Similarity Normalized Magnitude 1.0 1 4 8 Layer Index (b) Cosine Similarity 12 BERT BERT + GFSA 0.9 0.6 0.3 0.0 0 10 20 30 40 50 60 Singular Value Index (c) Singular Value Figure 4: Filter frequency response, cosine similarity, and singular values on STS-B for BERT and BERT+GFSA 18 Graph Convolutions Enrich the Self-Attention in Transformers! Graphormer Graphormer + GFSA 0.8 0.6 0.4 0.2 Normalized Singular Value Graphormer Graphormer + GFSA 0.8 Cosine Similarity Normalized Magnitude 1.0 0.6 0.4 0.2 0.0 −20 −15 −10 −5 0 5 Frequency (a) Filter reponse 10 15 0.0 Graphormer Graphormer + GFSA 0.9 0.6 0.3 0.0 1 4 8 Layer Index (b) Cosine Similarity 12 0 2 4 6 8 Singular Value Index (c) Singular Value Figure 5: Filter frequency response, cosine similarity, and singular values on ZINC for Graphormer and Graphormer+GFSA E NATURAL L ANGUAGE U NDERSTANDING E.1 D ETAILED E XPERIMENTAL S ETTINGS Dataset. The benchmark datasets we used are listed below. C O LA The Corpus of Linguistic Acceptability (Warstadt et al., 2019) consists of English acceptability judgments drawn from books and journal articles. The target task is a binary classification task, and each sentence is determined to be grammatically acceptable or not. SST-2 The Stanford Sentiment Treebank (Socher et al., 2013) is a dataset in which each sentence is sourced from movie reviews and accompanied by human annotations of their sentiment. The target task is binary sentiment classification for a single sentence. MRPC The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs, which are automatically extracted from online news sources and annotated by humans. The target is to determine whether the sentences in the pair are semantically equivalent. QQP The Quora Question Pairs (Chen et al., 2018) dataset is a collection of question pairs from the community question-answering website Quora. The target is to determine whether the questions in the pair are semantically equivalent. STS-B The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data with human annotation. The target is a regression task to predict a similarity score from 0 to 5. MNLI The Multi-Genre Natural Language Inference Corpus (Williams et al., 2017) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The standard test set consists of private labels from the authors and evaluates both the matched (in-domain) and mismatched (cross-domain) sections. QNLI The Stanford Question Answering (Wang et al., 2018) dataset is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph contains the answer to the corresponding question written by an annotator. The task is to determine whether the context sentence contains the answer to the question. RTE The Recognizing Textual Entailment (Bentivogli et al., 2009) dataset comes from a series of annual textual entailment challenges. The target task is a binary entailment classification task. BERT. BERT (Devlin et al., 2019) consists with 12 layers, 12 heads, 768 hidden size, 512 maximum sequence length, and MLP dimension of 3072. 19 Graph Convolutions Enrich the Self-Attention in Transformers! 58 95.0 91.5 94.5 91.0 94.0 90.5 F1 score 59 Accuracy Matthews Correlation 60 93.5 93.0 92.5 57 3 4 5 6 7 K 8 9 2 10 3 (a) CoLA 4 5 6 K 7 8 9 10 2 88.0 87.5 87.0 4 5 6 K 7 8 6 K 7 8 9 10 8 9 10 8 9 10 85.5 88.5 88.0 87.5 87.0 9 10 85.0 84.5 84.0 83.5 86.5 86.5 5 86.0 Accuracy Spearman Correlation 88.5 3 4 (c) MRPC 89.0 2 3 (b) SST-2 89.0 F1 score 89.0 88.0 91.5 2 89.5 88.5 92.0 56 90.0 83.0 2 3 (d) QQP 4 5 6 K 7 8 9 10 2 (e) STS-B 3 4 5 6 K 7 (f) MNLI-m 86.0 69.0 84.5 84.0 91.5 91.0 83.5 90.5 83.0 90.0 2 3 4 5 6 K 7 8 (g) MNLI-mm 9 10 Accuracy 92.0 85.0 Accuracy Accuracy 69.5 92.5 85.5 68.5 68.0 67.5 67.0 66.5 66.0 2 3 4 5 6 K (h) QNLI 7 8 9 10 2 3 4 5 6 K 7 (i) RTE Figure 6: Sensitivity results on various K with BERTBASE finetuned on GLUE tasks ALBERT. ALBERT (Lan et al., 2019) consists with 12 layers, 12 heads, 768 hidden size, 512 maximum sequence length, 128 embedding dimension, and MLP dimension of 3072. RoBERTa. RoBERTa (Liu et al., 2019) consists with 12 layers, 12 heads, 768 hidden size, 514 maximum sequence length, and MLP dimension of 3072. Training. For implementation, we adopt HuggingFace framework. We trained all models with 5 epochs with 32 batch size. The linear learning rate decay is used and initial learning rate is set to 2 × 10−5 . We use AdamW (Loshchilov & Hutter, 2017) optimizer, and weight decay is set to 0. All models are trained on 1 GPU and of NVIDIA RTX A5000 24GB. E.2 S ENSITIVITY TO K In this section, we explore the influence of the polynomial order, denoted as K, in our GFSA, conducting experiments on BERTBASE finetuned with GLUE tasks. We search for values of K from 2 to 10, and the results are presented in Figure 6. For each dataset, Optimal K exists and the performance of models using GFSA is generally robust to changes in K. 20 Graph Convolutions Enrich the Self-Attention in Transformers! F C AUSAL L ANGUAGE M ODELING F.1 D ETAILED E XPERIMENTAL S ETTINGS Dataset. The benchmark datasets we used are listed below. PTB Penn Treebank (Marcus et al., 1993) dataset is a collection of text documents that have been extensively annotated with linguistic information, primarily syntactic and grammatical structures. W IKI T EXT WikiText (Merity et al., 2016) dataset is a collection of over 100 million tokens extracted from the set of verified good and featured articles on Wikipedia. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. GPT2. GPT-2 (Radford et al., 2019) is a Transformer pretrained on a very large corpus of English data in a self-supervised fashion without any human labelling on dataset. It automatically generate inputs and labels from those texts, and trained to guess the next word in sentences. For implementation, we adopt HuggingFace Framework 3 . For all experiments, GPT2 has 12 layers with 12 attention heads, 768 hidden size and 1024 maximum sequence length, resulting in a total of 117 million parameters. Training. We finetune GPT2 with 4 batch size, 5 × 10−5 learning rate and linear learning weight decay using adamW (Loshchilov & Hutter, 2017) optimizer. We also apply dropout with probability 0.1. Following (Yao et al., 2022), we train models for 15 epochs with PTB, 4 epochs with WikiText103 and 10 epochs with WikiText-2. We use sensitivity metric, i.e., perplexity, which is a commonly used metric to evaluate the performance of language models, particularly in language modeling and text generation tasks. perplexity measures how well a language modeling can predict a sequence of words in a given text or a test dataset. All the experiments are conducted on 1 GPU and of NVIDIA RTX 3090 24GB. F.2 S ENSITIVITY TO K We conducted a sensitivity study on K of GPT-2 across all datasets, and the results are presented in Table 8. For PTB and WikiText-2, GFSA exhibits the best performance when K is high, typically around 8 or 9. However, for WikiText-103, GFSA achieves the best perplexity when K is small, specifically when K is 3 or 4. Table 8: Results comparison on GPT-2 finetuned with GFSA 3 Method #Params PTB WikiText-2 WikiText-103 GPT2 (Radford et al., 2019) GPT2 + GFSA(K = 2) GPT2 + GFSA(K = 3) GPT2 + GFSA(K = 4) GPT2 + GFSA(K = 5) GPT2 + GFSA(K = 6) GPT2 + GFSA(K = 7) GPT2 + GFSA(K = 8) GPT2 + GFSA(K = 9) 117M 117M 117M 117M 117M 117M 117M 117M 117M 19.513 19.459 19.456 19.453 19.452 19.451 19.450 19.450 19.450 20.966 20.929 20.927 20.927 20.925 20.925 20.925 20.924 20.923 15.939 15.920 15.919 15.919 15.920 15.920 15.921 15.921 15.921 https://github.com/huggingface/transformers 21 Graph Convolutions Enrich the Self-Attention in Transformers! G I MAGE C LASSIFICATION G.1 D ETAILED E XPERIMENTAL S ETTINGS Our code is implemented based on the timm library (Wightman, 2019). In the case of our training recipe, it is the same as experimental setting of Wang et al. (2022) that follows the training recipes of Touvron et al. (2021a) and Touvron et al. (2021b). To apply our GFSA to existing base models such as DeiT, Cait, and Swin, we consider a range of K between 2 and 5. For 12-layer DeiT, we follow the same hyperparameters from Wang et al. (2022). We set the dropout rate to 0 and 0.2 for 12-layer and 24-layer DeiT, respectively. For CaiT, we apply our GFSA on only to the patch embedding layer. All other hyper-parameters are kept consistent with the original papers of DeiT Touvron et al. (2021a), CaiT (Touvron et al., 2021b) and, Swin (Liu et al., 2021). All models are trained on NVIDIA RTX 3090 24GB. G.2 FLOP S & T HROUGHPUT In Table 9, we report the number of FLOPs and throughput. With GFSA plugged in, the FLOP count is either the same or no different. For DeiT-S with 24 layers, which shows a slight FLOP increase with GFSA plugged in. However, for the rest of the settings, the models have the same number of Flops. For throughput, it tends to decrease because calculating the high-order term is an additional cost. Table 9: Experimental evalutation of GFSA plugged into DeiT-S, CaiT-S, and Swin-S Backbone Method Input Size #Layers #Params #FLOPs #Throughput Top-1 Acc DeiT-S DeiT-S + GFSA 224 224 12 12 22.0M 22.0M 4.57G 4.57G 856.07 614.54 79.8 81.1 (↑ 1.3) DeiT-S DeiT-S + GFSA 224 224 24 24 43.3M 43.3M 9.09G 9.10G 423.68 314.75 80.5 81.5 (↑ 1.0) CaiT CaiT-S CaiT-S + GFSA 224 224 24 24 46.9M 47.0M 9.34G 9.34G 574.66 406.96 82.6 82.8 (↑ 0.2) Swin Swin-S Swin-S + GFSA 224 224 24 24 49.6M 49.6M 8.75G 8.75G 912.38 714.60 82.9 83.0 (↑ 0.1) DeiT G.3 S ENSITIVITY TO K We also perform the sensitivity analysis for K. Tables 10 and 11 show the results of sensitivity analysis for DeiT-S and CaiT-S with GFSA plugged in. For 12-layer DeiT-S, GFSA performance of 81.12 is highest when K = 3. When the GFSA has a K of 2, the performance is worse than the original DeiT-S, but when the K is 3 or higher, the performance is better than the original DeiT-S, and most surprisingly, the performance is better than the 24-layer DeiT-S. CaiT-S shows the highest performance of 82.84 when K = 4. For CaiT-S, the accuracy is slightly lower than that of the original CaiT-S when K = 2, but it starts to exceed the accuracy of CaiT-S when K is 3 or higher. Table 10: Sensitivity to K for 12-layer DeiT-S + Table 11: Sensitivity to K for 24-layer CaiT-S + GFSA GFSA K 2 3 4 K 5 Top-1 Acc (%) 79.27 81.12 80.86 81.07 2 3 4 Top-1 Acc (%) 82.54 82.65 82.84 22 Graph Convolutions Enrich the Self-Attention in Transformers! G.4 A DDITIONAL E XPERIMENTS WITH G UO ET AL . (2023)’ S SETTING To make a fair comparison with ContraNorm (Guo et al., 2023), one of the related studies that mitigates oversmoothing, we run additional experiments to match their experimental setup. Experimental Setting. We follow the training recipe used by Guo et al. (2023), which is a slightly modified version of Touvron et al. (2021a)’s recipe. Guo et al. (2023) use AdamW optimizer with cosine learning rate decay. We select the DeiT-T and DeiT-S for ImageNet-100 and ImageNet-1k, respectively. “T” and “S” denote tiny and small model sizes, respectively. For all experiments, the image size is set to be 224x224. We train each model for 300 epochs and the batch size is set to 1024. For ContraNorm, we train with their recommended hyperparameters. All models are trained on 4 GPUs and of NVIDIA RTX A6000 48GB. Experimental Results. In Table 12, DeiT-T and DeiT-S with GFSA outperform vanilla DeiT-T and DeiT-S in all layer settings. On ImageNet-100, GFSA improves the performance of DeiT-T with 12 layers by 1.52%. The largest gain is a 4.88% improvement on 16-layer DeiT-T. This shows that the effect of GFSA is larger than the effect of ContraNorm. For ImageNet-1k, surprisingly, with 16 layers, GFSA is able to increase the performance of DeiT-S by 80.83%, meaning that GFSA brings benefits with a 3.23% improvement. Table 12: Experiment results on ImageNet-100 and ImageNet-1k Dataset Method #Layers=12 #Layers=16 #Layers=24 ImageNet-100 DeiT-T DeiT-T + ContraNorm DeiT-T + GFSA 76.52 77.03 77.68 75.34 78.72 79.02 76.76 78.12 78.64 ImageNet-1k DeiT-S DeiT-S + ContraNorm DeiT-S + GFSA 77.32 77.80 79.86 78.25 79.04 80.83 77.69 78.67 79.15 Sensitivity to K. In Table 13, we experiment with a sensitivity analysis for K. For ImageNet-100, the performance of GFSA generally improves when K is 4 or 5. On the other hand, GFSA performs better at lower K for settings that are layers 16 and 24 for ImagNet-1k. Table 13: Experiment results on ImageNet-100 and ImageNet-1k Dataset Method K #Layers=12 #Layers=16 #Layers=24 ImageNet-100 DeiT-T + GFSA DeiT-T + GFSA DeiT-T + GFSA DeiT-T + GFSA 2 3 4 5 76.92 77.41 77.01 77.68 78.14 77.76 79.02 78.14 78.40 78.09 78.64 78.64 ImageNet-1k DeiT-S + GFSA DeiT-S + GFSA DeiT-S + GFSA 2 3 4 79.84 79.85 79.86 80.83 79.39 79.44 79.15 79.07 79.10 23 Graph Convolutions Enrich the Self-Attention in Transformers! H G RAPH C LASSIFICATION H.1 D ETAILED E XPERIMENTAL S ETTINGS Dataset. The benchmark datasets we used are listed below. ZINC ZINC (Irwin et al., 2012) is the most popular real-world molecular dataset to predict graph property regression for constrained solubility, an important chemical property for designing generative GNNs for molecules. Uniform sampling is adopted for data splitting. We use a ZINC-subset of small-scale dataset. PCQM4M PCQM4M-LSC (Hu et al., 2021) is 2D molecular graphs, which is one of the most practically relevant quantum chemical properties of molecule science. The task is to predict DFT(density functional theory)-calculated HOMO-LUMO energy gap of molecules given their graphs. PCQM4M-LSC is unprecedentedly large in scale comparing to other labeled graph-level prediction datasets, which contain more than 3.8M graphs. We use PCQM4M and PCQM4Mv2 large-scale datasets. Graphormer. Following Graphormer (Ying et al., 2021), we use Graphormer for PCQM4M and GraphormerSLIM for ZINC. Graphormer consists of 12 encoder layers, 80 encoder embedding dimension, and 768 MLP dimension. It employs 32 encoder heads and 24 hidden dimension for each head. GraphormerSLIM consists of 12 encoder layers, 80 encoder embedding dimension, and 80 MLP dimension. It employs 8 encoder heads and 10 hidden dimension for each head. Training. We adopt a training recipe from Graphormer (Ying et al., 2021). We use adamW (Loshchilov & Hutter, 2017) optimizer with 0.9 and 0.999 coefficients for running averages of gradient and its square, and use Mean Absolute Error (MAE) as loss function. We use polynomial learning rate decay, with initial learning rate set to 2 × 10−4 and end learning rate set to 1 × 10−9 . For ZINC, we set batch size as 256, max epochs as 10k, and warm-up stage step as 40k. For PCQM4M and PCQM4Mv2, we set batch size as 1024, max epochs as 300, and warm-up stage step as 60k. All models are trained on 1 GPU and of NVIDIA RTX 3090 24GB. We conduct experiments with 4 different seeds. H.2 E XPERIMENTAL RESULTS WITH STANDARD DEVIATION Table 14: Experimental results and number of parameters on ZINC Method #Params MAE Graphormer 500K 0.1240±0.006 Graphormer + GFSA 500K 0.1189±0.002 Table 15: Experimental results and number of parameters on PCQM4M and PCQM4Mv2 Method #Params PCQM4M Train Validate PCQM4Mv2 Train Validate Graphormer 48.3M 0.0535±0.038 0.1286±0.016 0.0250±0.000 0.0862±0.000 Graphormer + GFSA 48.3M 0.0312±0.001 0.1193±0.000 0.0249±0.000 0.0860±0.000 We conducted experiments following the experimental environments of Graphormer Ying et al. (2021) using 4 different seeds. Due to space constraints, only the mean values are reported in Tables 5 and 6. In this section, we report the results with mean and standard deviations in Tables 14 and 15. 24 Graph Convolutions Enrich the Self-Attention in Transformers! I AUTOMATIC S PEECH R ECOGNITION I.1 D ETAILED E XPERIMENTAL S ETTINGS Dataset. We conduct experiments on the LibriSpeech 4 dataset (Panayotov et al., 2015), which consists of audio recordings paired with their transcriptions. The LibriSpeech dataset has approximately 1,000 hours of read English speech with a sampling rate of 16 kHz. We keep the original 16,000Hz sampling rate and compute 80-dim log-Mel filterbanks for a 25ms sliding window, strided by 10ms. The filterbank features are then normalized to zero mean and unit variance per input sequence. For implementation, we follow the recipes of SpeechBrain (Ravanelli et al., 2021). Evaluation Metric. Word error rate (WER (%)) is derived from the Levenshtein distance and compares a reference to a hypothesized word-level transcription. It is calculated by summing the number of word insertions, deletions, substitutions and dividing it by the total number of words in the reference transcription. Vanilla Transformer. We use a vanilla Transformer to apply our GFSA. For implementation, we use a SpeechBrain (Ravanelli et al., 2021) framework. The vanilla Transformer consists of i) 1D convolution to perform striding, ii) Transformer encoder with 12 layers, 4 heads, embedding dimension of 512, MLP dimension of 2048, and post-LayerNorm iii) decoder with 6 layers, 4 heads, embedding dimension of 512, MLP dimension of 2048, joint beamsearch, and iv) external Transformer language model with 12 layers, 12 heads, embedding dimension of 768, and MLP dimension of 3072. Branchformer. We use one of the SOTA models, Branchformer (Peng et al., 2022) to plug-in our GFSA. Branchformer has two parallel branches, one for capturing global interactions using attention and the other for more localized context using convolutional gating MLP. The Branchformer architecture for speech recognition consists of i) 1D convolution to perform striding, ii) Branchformer encoder with 18 layers, 8 heads, embedding dimension of 512, and MLP dimension of 3072, iii) decoder with 6 layers, 8 heads, embedding dimension of 512, a convolutional spatial gating unit (CSGU) dimension of 3072, joint beamsearch, and iv) external Transformer language model with 12 layers, 12 heads, embedding dimension of 768, and MLP dimension of 3072. Training. We follow a training recipe from SpeechBrain (Ravanelli et al., 2021). The standard LibriSpeech validation sets (dev-clean and dev-other) are used to tune all parameters and select the best models. Test sets (test-clean and test-other) are used only to report final WER performance. We train the pure Transformer for 100 epochs and the Branchformer for 120 epochs with a batch size of 16. We use a data augmentation method on all models using SpecAugment (Park et al., 2019). SpecAugment applies time and frequency masking as well as time warping to the input spectrum. For Branchformer, we use AdamW (Loshchilov & Hutter, 2017) optimizer with 0.9 and 0.98 coefficients for computing running averages of gradient and its square. The learning rate and weight decay in all models are 0.0008 and 0.01, respectively. We use a connectionist temporal classification (CTC) loss Graves & Jaitly (2014); Zhang et al. (2016). We also apply dropout with probability 0.1 and label smoothing with weight 0.1 to mitigate overfitting. We fix the random seed as 74443 on all experiments. All models are trained on 1 GPU and of NVIDIA RTX A6000 48GB. Hyperparameters. In Table 16, we describe main hyperparameters used in the automatic speech recognition task. For Transformer+GFSA and Branchformer+GFSA, we also report the best K hyperparameter. 4 http://www.openslr.org/12 25 Graph Convolutions Enrich the Self-Attention in Transformers! Table 16: Main hyperparameters used in ASR Model Experimental Setting Transformer Encoder: Transformer (12 layers) Decoder: Transformer (6 layers) + (CTC/ATT joint) beamsearch + TransformerLM Augmentation: SpecAugment Features: 40 fbanks Pretraining: no Dropout: 0.1 Batchnorm: yes Number of epochs: 100 Batch size: 32 Learning rate: 0.0008 LR scheduler: Noam Optimizer: Adam Loss: CTC + KLdiv (Label Smoothing loss) CTC weight: 0.3 Transformer+GFSA Encoder: Transformer (12 layers) Decoder: Transformer (6 layers) + (CTC/ATT joint) beamsearch + TransformerLM Augmentation: SpecAugment Features: 40 fbanks Pretraining: no Dropout: 0.1 Batchnorm: yes Number of epochs: 100 Batch size: 32 Learning rate: 0.0008 LR scheduler: Noam Optimizer: Adam Loss: CTC + KLdiv (Label Smoothing loss) CTC weight: 0.3 K: 2 Branchformer Encoder: Branchformer Decoder: Transformer (6 layers) + (CTC/ATT joint) beamsearch + TransformerLM Augmentation: SpecAugment Features: 40 fbanks Pretraining: no Dropout: 0.1 Batchnorm: yes Number of epochs: 120 Batch size: 16 Learning rate: 0.0008 LR scheduler: Noam Optimizer: AdamW with coefficients 0.9 and 0.98 Loss: CTC + KLdiv (Label Smoothing loss) CTC weight: 0.3 Branchformer+GFSA Encoder: Branchformer Decoder: Transformer (6 layers) + (CTC/ATT joint) beamsearch + TransformerLM Augmentation: SpecAugment Features: 40 fbanks Pretraining: no Dropout: 0.1 Batchnorm: yes Number of epochs: 120 Batch size: 16 Learning rate: 0.0008 LR scheduler: Noam Optimizer: AdamW with coefficients 0.9 and 0.98 Loss: CTC + KLdiv (Label Smoothing loss) CTC weight: 0.3 K: 3 26 Graph Convolutions Enrich the Self-Attention in Transformers! I.2 T RAINING C URVE We compare the training and validation curves for LibriSpeech 100h in Figure 7. The training loss curve of GFSA is lower than the pure Transformer. GFSA stabilizes the loss curve of pure Transformer slightly earlier. 15 10 20 15 10 5 0 Transformer Transformer+GFSA 25 Valid Loss Train Loss 30 Transformer Transformer+GFSA 20 5 0 25 50 Epoch 75 100 (a) Training loss curve 0 0 25 50 Epoch 75 100 (b) Validation loss curve Figure 7: Training curve on LibriSpeech 100h J C ODE D EFECT D ETECTION J.1 D ETAILED E XPERIMENTAL S ETTINGS Dataset. We use Devign dataset provided by (Zhou et al., 2019), which is a binary classification task to evaluate whether a C language code is vulnerable to software systems or not. Implementation. We build our experiments on top of the open-sourced code 5 and recipes provided by Wang et al. (2021b). RoBERTa. RoBERTa (Liu et al., 2019) is an encoder-only model trained with masked language modeling on code. All hyperparameters are consistent with the training method in the source code of Wang et al. (2021b). PLBART. PLBART (Ahmad et al., 2021) is an encoder-decoder model based on BART (Lewis et al., 2020) architecture. PLBART can support understanding and generation tasks. All hyperparameters are consistent with the training method in the source code of Wang et al. (2021b). CodeBert. CodeBERT (Feng et al., 2020) is a model trained on masked language modeling and replaced token detection. CodeBERT is a bimodal pretrained model based on Transformer with 12 layers for programming language and natural language. All hyperparameters are consistent with the training method in the source code of Wang et al. (2021b). CodeT5. CodeT5 is an encoder-decoder framework with the same architecture as T5 (Raffel et al., 2020). It aims to derive generic representations for programming language and natural language via pre-training on unlabeled source code. CodeT5-small has 6 encoder layers, 6 decoder layers, 8 attention heads, 512 dimensional hidden states, and 60M parameters. The other models have 12 encoder layers, 12 decoder layers, 12 attention heads, 768 dimensional hidden states, and 220M parameters. All hyperparameters are consistent with the training method in the source code of Wang et al. (2021b). Training. The pre-trained models mentioned above are applied to this downstream task. We add GFSA directly on top of self-attention. We finetune baselines and GFSA models for 10 epochs with a batch size of 16. We use early stopping strategy with a patience of 2. Models generate binary labels from unigram sequences at the decoder for defect detection task. We employ accuracy for evaluating the code defect detection task. All models are trained on 1 GPU and of NVIDIA RTX A6000 48GB. 5 https://github.com/salesforce/CodeT5 27 Graph Convolutions Enrich the Self-Attention in Transformers! J.2 C ASE S TUDY In Listing 1, we show one of case for code snippets of defects in QEMU 6 that CodeT5-base does not predict correctly, but that only CodeT5-base+GFSA predicts. The commit message 7 for this case is as follow: Needed for changing cpu has work() argument type to CPUState, used in h cede(). h cede() is the hypercall that asks the hypervisor to shut down the CPU. Previously, this hypercall simply passed the CPUID, so the hypervisor did not know what state the CPU was in. This change allows the hypervisor to know whether the CPU is actually performing work. If the CPU is performing a task, the hypervisor waits for the CPU to complete the task. In this context, accurately predicting defects like the one above is very important, and applying GFSA to CodeT5-base helps in terms of performance improvement. @@ -204,7 +204,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba, 2 - static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr 3 + static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr 4 , target_ulong opcode, target_ulong *args) 5 { 6 target_ulong liobn = args[0]; 7 target_ulong ioba = args[1]; 8 target_ulong tce = args[2]; 9 VIOsPAPRDevice *dev = spapr_vio_find_by_reg(spapr->vio_bus, liobn); 10 VIOsPAPR_RTCE *rtce; 11 if (!dev) { 12 hcall_dprintf("LIOBN 0x" TARGET_FMT_lx " does not exist\n", liobn); 13 return H_PARAMETER; 14 } 15 ioba &= ˜(SPAPR_VIO_TCE_PAGE_SIZE - 1); 16 #ifdef DEBUG_TCE 17 fprintf(stderr, "spapr_vio_put_tce on %s ioba 0x" TARGET_FMT_lx " TCE 0x" TARGET_FMT_lx "\n", dev->qdev.id, ioba, tce); 18 #endif 19 if (ioba >= dev->rtce_window_size) { 20 hcall_dprintf("Out-of-bounds IOBA 0x" TARGET_FMT_lx "\n", ioba); 21 return H_PARAMETER; 22 } 23 rtce = dev->rtce_table + (ioba >> SPAPR_VIO_TCE_PAGE_SHIFT); 24 rtce->tce = tce; 25 return H_SUCCESS; 26 } 1 Listing 1: An example commit history for defects in Devign dataset K C ODE C LONE D ETECTION K.1 D ETAILED E XPERIMENTAL S ETTINGS Dataset. Code clone detection aims to measure the similarity between two code snippets and predict whether they have the same functionality. We experiment with the Java data provided by Wang et al. (2020). Implementation. We build our experiments on top of the open-sourced code 8 and recipes provided by Wang et al. (2021b). 6 https://www.qemu.org https://github.com/qemu/qemu/commit/b13ce26d3e8c6682044ae84920f2417b30ce356b 8 https://github.com/salesforce/CodeT5 7 28 Graph Convolutions Enrich the Self-Attention in Transformers! Training. We finetune both CodeT5 and CodeT5+GFSA for one epoch with a batch size of 16. We also use early stopping with patience of 2. CodeT5 and CodeT5+GFSA encode source code and take the representation to calculate similarity of two code snippets. We employ F1 score for evaluating this task. All models are trained on 1 GPU and of NVIDIA RTX A6000 48GB. K.2 E XPERIMENT R ESULT Table 17 shows results comparing CodeT5 and CodeT5 with GFSA. The result shows that by using our GFSA, CodeT5 models improve their performance. CodeT5-small+GFSA provides a 0.61% improvment over Code5T-small. Table 17: Results on the code clone detection task Method L Clone F1 CodeT5-small (Wang et al., 2021b) CodeT5-small + GFSA 94.36 94.94 (↑ 0.61%) CodeT5-base (Wang et al., 2021b) CodeT5-base + GFSA 94.31 94.92 (↑ 0.64%) T IME C OMPLEXITY AND E MPIRICAL RUNTIME A NALYSIS Time Complexity. The time complexity of original self-attention is O(n2 d). But our GFSA has a high order term. Therefore, the time complexity of GFSA has O(n2 d + n3 ). If n is smaller than d, the time complexity approaches O(n2 d), which is the complexity of original self-attention. Empirical Runtime Analysis. We report the training time of various methods with GFSA in Tables 31 to 23. In general, the training time of methods with GFSA is slightly longer than that of existing methods. For example, the Transformer for the automatic speech recognition task increases from 190.5 seconds to 191.6 seconds on Librispeech 100h dataset, as increases of only 1 second. Instead of computing higher-order polynomial terms, our GFSA approximates them, with only a small increase in runtime, which is not very significant. Table 18: Training time (seconds per epoch) on GLUE tasks. s denotes the abbreviation for second. Avg denotes the average training time across all tasks. Datasets #Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg BERTBASE (Devlin et al., 2019) BERTBASE + GFSA 110M 110M 17s 19s 182s 192s 17s 1483s 24s 19s 1571s 25s 2004s 2147s 580s 18s 541s 621s 20s 577s ALBERTBASE (Lan et al., 2019) ALBERTBASE + GFSA 11M 11M 15s 16s 188s 197s 20s 1506s 25s 21s 1604s 26s 2072s 2219s 612s 19s 557s 659s 21s 595s RoBERTaBASE (Liu et al., 2019) 125M RoBERTaBASE + GFSA 125M 17s 19s 190s 200s 18s 1492s 25s 19s 1580s 26s 2012s 2151s 593s 18s 546s 634s 20s 581s Table 19: Training time (seconds per epoch) on causal language modeling tasks. Method #Params PTB WikiText-2 WikiText-103 Avg GPT2 (Radford et al., 2019) GPT2 + GFSA 117M 117M 89.1s 160.3s 195.7s 354.2s 9638.4s 17424.6s 3307.8s 5979.7s 29 Graph Convolutions Enrich the Self-Attention in Transformers! Table 20: Training time (seconds per epoch) on ImageNet-1k Backbone Method #Layers #Params #FLOPs #Throughput Runtime DeiT DeiT-S DeiT-S + GFSA DeiT-S DeiT-S + GFSA 12 12 24 24 22.0M 22.0M 43.3M 43.3M 4.57G 4.57G 9.09G 9.10G 856.07 614.54 423.68 314.75 551s 814s 1508s 1798s CaiT CaiT-S CaiT-S + GFSA 24 24 46.9M 47.0M 9.34G 9.34G 574.66 406.96 1530s 1624s Swin Swin-S Swin-S + GFSA 24 24 49.6M 49.6M 8.75G 8.75G 912.38 714.60 1897s 1970s Table 21: Training time (seconds per epoch) on graph classficiation tasks Method ZINC PCQM4M PCQM4Mv2 Graphormer (Ying et al., 2021) Graphormer + GFSA 9s 9s 740s 896s 817s 955s Table 22: Training time (seconds per epoch) on LibriSpeech datasets Method LibriSpeech 100h LibriSpeech 960h Transformer Transformer + GFSA 190.5s 191.6s 3049.3s 3398.4s Branchformer (Peng et al., 2022) Branchformer + GFSA 248.5s 254.3s 4990.1s 4999.3s Table 23: Training time (seconds per epoch) on the code defect prediction task Method Runtime RoBERTa (Liu et al., 2019) RoBERTa + GFSA CodeBERT (Feng et al., 2020) CodeBERT + GFSA PLBART (Ahmad et al., 2021) PLBART + GFSA CodeT5-small (Wang et al., 2021b) CodeT5-small + GFSA 543.96s 537.79s 555.28s 561.43s 467.80s 470.19s 301.11s 309.04s CodeT5-base (Wang et al., 2021b) CodeT5-base + GFSA 362.28s 373.22s 30 Graph Convolutions Enrich the Self-Attention in Transformers! M I NFERENCE T IME We report the inference time of various methods with GFSA in Tables 24 to 29. Table 24: Inference time on GLUE tasks. s denotes the abbreviation for second. Avg denotes the average training time across all tasks. Datasets #Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg BERTBASE (Devlin et al., 2019) BERTBASE + GFSA 110M 110M 1.0s 1.1s 1.4s 1.4s 1.2s 48.7s 1.9s 1.2s 52.3s 2.0s 15.5s 16.8s 10.1s 1.2s 10.0s 11.0s 1.3s 11.0s ALBERTBASE (Lan et al., 2019) ALBERTBASE + GFSA 11M 11M 1.1s 1.2s 1.6s 1.7s 1.4s 58.4s 2.2s 1.4s 62.1s 2.3s 18.4s 19.7s 12.1s 1.3s 12.0s 13.1s 1.4s 13.0s RoBERTaBASE (Liu et al., 2019) 125M RoBERTaBASE + GFSA 125M 1.0s 1.1s 1.4s 1.4s 1.1s 47.0s 1.9s 1.2s 50.4s 2.0s 15.0s 16.3s 9.9s 1.2s 10.0s 10.8s 1.2s 11.0s Table 25: Inference time on causal language modeling tasks. Method #Params PTB WikiText-2 WikiText-103 Avg GPT2 (Radford et al., 2019) GPT2 + GFSA 117M 117M 3.2s 5.5s 7.4s 12.9s 7.4s 12.9s 6.0s 10.4s Table 26: Inference time on ImageNet-1k Backbone Method #Layers Inference Time DeiT DeiT-S DeiT-S + GFSA DeiT-S DeiT-S + GFSA 12 12 24 24 52s 53s 68s 69s CaiT CaiT-S CaiT-S + GFSA 24 24 105s 107s Swin Swin-S Swin-S + GFSA 24 24 17s 17s 31 Graph Convolutions Enrich the Self-Attention in Transformers! Table 27: Inference time on graph classficiation tasks Method ZINC PCQM4M PCQM4Mv2 Graphormer (Ying et al., 2021) Graphormer + GFSA 8s 8s 99s 117s 31s 29s Table 28: Inference time on LibriSpeech datasets Method LibriSpeech 100h LibriSpeech 960h Transformer Transformer + GFSA 328.1s 329.5s 323.7s 343.3s Branchformer (Peng et al., 2022) Branchformer + GFSA 299.4s 305.5s 328.7s 354.1s Table 29: Inference time on the code defect prediction task Method Inference Time RoBERTa (Liu et al., 2019) RoBERTa + GFSA 22.4s 23.9s CodeBERT (Feng et al., 2020) CodeBERT + GFSA 23.8s 24.1s PLBART (Ahmad et al., 2021) PLBART + GFSA 37.7s 39.3s CodeT5-small (Wang et al., 2021b) CodeT5-small + GFSA 78.2s 82.5s CodeT5-base (Wang et al., 2021b) CodeT5-base + GFSA 83.2s 88.5s 32 Graph Convolutions Enrich the Self-Attention in Transformers! N F REQUENCY A NALYSES IN THE S INGULAR VALUE D OMAIN Graph signal processing (GSP) (Sandryhaila & Moura, 2013; 2014) can be understood as a generalized concept of DSP — in other words, DSP is a special case of GSP where a line graph with n nodes is used and therefore, the graph Fourier transform (GFT) of the line graph is identical to the discrete Fourier transform. In the definition of GFT, we assume that the graph shift operator (GSO) S is diagonalizable. Considering the eigendecomposition of the GSO S = V ⊺ ΛV with eigenvector V , we can write the graph filter output as follows: y= K X k=0 wk S k x = K X V ⊺ wk Λk V x = V ⊺ k=0 K X  wk Λk V x = V ⊺ g(Λ)V x, (15) k=0 where x ∈ Rn is a 1-dimensional graph signal, Λ is a diagonal matrix with eigenvalues, and wk ∈ [−∞, ∞] is a coefficient. However, one can use the singular value decomposition, S = V ⊺ ΛU , when the GSO is not diagonalizable but symmetrically normalized instead of the eigendecomposition (Maskey et al., 2023). Both the singular value decomposition and the eigendecomposition project the original signal onto a set of basis, but they use different basis sets. In the singular value decomposition, we sort the set of basis in ascending order of their eigenvalues, i.e., Λ, and perform frequency domain-like analyses (Zou et al., 2022; Maskey et al., 2023). Since the self-attention matrix’s row-wise sum is always 1, the following is the case: Ā = D −1 A = 1 n A, where n is the number of tokens. Maskey et al. (2023) define the following symmetrically −1/2 −1/2 normalized adjacency matrix (SNA): Din ADout . Since the degree of every node is n in the −1/2 −1/2 self-attention matrix, the following is the case: Din ADout = D −1/2 AD −1/2 = √1n A √1n = 1 n A = Ā. Therefore, the self-attention matrix is a special case of SNAs. O M ATRIX P OLYNOMIAL VS . G RAPH F OURIER T RANSFORM There are two paradigms of implementing graph filters: i) matrix polynomial, which does not require diagonalizability, and ii) graph Fourier transform, which uses the eigendecomposition for diagonalizable adjacency matrices or the Jordan decomposition for non-diagonalizable adjacency matrices. Those two paradigms have their own weaknesses: i) the matrix polynomial approach requires explicit matrix multiplications, and ii) the graph Fourier transform approach requires expansive spectral decompositions. The matrix polynomial is preferred when the number of matrix multiplications is not many. Otherwise, the graph Fourier transform approach may be better since the matrix multiplication can be simplified after the eigendecomposition or the Jordan decomposition or the Jordan normal form. Among those two, we use the first matrix polynomial approach with only three non-zero coefficients {w0 , w1 , wK } since it does not require the complicated spectral decomposition. Since we do not rely on any explicit spectral decomposition but on the matrix polynomial, any adjacency matrix can be used. P A PPROXIMATION OF ĀK IN E QUATION 4 The general formulation of first-order taylor approximation at point a is as follows: f (x) ≃ f (a) + f ′ (a)(x − a). (16) We approximate f (K) = ĀK with first-order taylor approximation at point a = 1: f (K) = ĀK ≃ f (1) + f ′ (1)(K − 1). (17) Inspired by Brouwer et al. (2019), which approximates the derivative of hidden states of RNNs as the difference between hidden states, we effectively approximate f ′ (1), the derivative of ĀK 33 Graph Convolutions Enrich the Self-Attention in Transformers! estimated at the position of K = 1, with the difference term as f (2) − f (1) = Ā2 − Ā1 . Then the approximated ĀK becomes as follows: f (K) = ĀK ≃ f (1) + (Ā2 − Ā)(K − 1) = Ā + (K − 1)(Ā2 − Ā). (18) This approximation has two advantages: 1. The approximation for ĀK with Ā and Ā2 provides a simpler computation that can significantly reduce the required computational resources and time (see theoretical analysis in Section 3.2). 2. When approximating derivative of ĀK with difference term for graph filter design, our filter can reduce to GPR-GNN and GREAD, two popular GNN methods preventing the oversmoothing for graph convolutional networks, which are special cases of our design (see discussion in Section 3.3). In other words, our method is more general than those two GNNs. Q GFSA IN S ELECTED L AYERS : A S TRATEGY FOR M ODEL E FFICIENCY Because GFSA requires more calculation than the original self-attention, the runtime after using GFSA increases. While the experiments in the main paper used GFSA on all layers of Transformer, we report a strategy to alleviate the computational burden of the Transformer models by applying GFSA only to some layers. For this purpose, GFSA is used only on even-numbered layers. In Tables 30 to 34, the results show that this strategy can reduce the increase in runtime and maintain performance compared to using GFSA for all layers. Table 30: Comparison of performance using GFSA on all layers vs. GFSAeven on even layers for GLUE tasks #Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg Datasets BERTBASE (Devlin et al., 2019) 110M 56.79 93.81 88.70 88.32 88.16 84.96/84.15 91.63 66.06 82.51 BERTBASE + GFSA 110M 59.56 94.15 90.60 88.46 88.33 85.12/85.06 91.95 68.95 83.58 BERTBASE + GFSAeven 110M 58.80 93.69 90.50 88.47 88.27 85.13/85.02 91.65 70.76 83.59 Table 31: Comparison of training time (seconds per epoch) using GFSA on all layers vs. GFSAeven on even layers for GLUE tasks #Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg Datasets BERTBASE (Devlin et al., 2019) 110M BERTBASE + GFSA 110M 17s 19s 182s 192s 17s 1483s 24s 19s 1571s 25s 2004s 2147s 580s 18s 541s 621s 20s 577s BERTBASE + GFSAeven 17s 185s 18s 1506s 24s 2061s 595s 18s 553s 110M Table 32: Comparison of performance using GFSA on all layers vs. GFSAeven on even layers for causal language modeling tasks Method #Params PTB WikiText-2 WikiText-103 Avg GPT2 (Radford et al., 2019) GPT2 + GFSA 117M 117M 19.513 19.450 20.966 20.923 15.939 15.919 18.806 18.764 GPT2 + GFSAeven 117M 19.453 20.926 15.923 18.767 34 Graph Convolutions Enrich the Self-Attention in Transformers! Table 33: Comparison of training time (seconds per epoch) using GFSA on all layers vs. GFSAeven on even layers for causal language modeling tasks Method #Params PTB WikiText-2 WikiText-103 Avg GPT2 (Radford et al., 2019) GPT2 + GFSA 117M 117M 89.1s 160.3s 195.7s 354.2s 9638.4s 17424.6s 3307.8s 5979.7s GPT2 + GFSAeven 117M 127.4s 279.1s 13761.4s 4722.6s Table 34: Comparison of using GFSA on all layers vs. GFSAeven on even layers for ImageNet-1k Method Input Size #Layers #Params Top-1 Acc Runtime DeiT-S DeiT-S + GFSA 224 224 12 12 43M 43M 79.8 81.1 551s 814s DeiT-S + GFSAeven 224 12 43M 81.0 595s 35