Graph Convolutions Enrich the Self-Attention in Transformers!

G RAPH C ONVOLUTIONS E NRICH
THE S ELF -ATTENTION IN T RANSFORMERS !
Hyowon Wi∗
Yonsei University
wihyowon@yonsei.ac.kr

arXiv:2312.04234v1 [cs.LG] 7 Dec 2023

Jeongwhan Choi∗
Yonsei University
jeongwhan.choi@yonsei.ac.kr

Yehjin Shin
Yonsei University
yehjin.shin@yonsei.ac.kr

Jayoung Kim
Yonsei University
jayoung.kim@yonsei.ac.kr
Kookjin Lee
Arizona State University
kookjin.lee@asu.edu

Nathaniel Trask
University of Pennsylvania
ntrask@seas.upenn.edu

Noseong Park†
Yonsei University
noseong@yonsei.ac.kr

A BSTRACT
Transformers, renowned for their self-attention mechanism, have achieved stateof-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep
Transformer models is the oversmoothing problem, where representations across
layers converge to indistinguishable values, leading to significant performance
degradation. We interpret the original self-attention as a simple graph filter and
redesign it from a graph signal processing (GSP) perspective. We propose graphfilter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers
in various fields, including computer vision, natural language processing, graph
pattern classification, speech recognition, and code classification.

1

I NTRODUCTION

Transformers are arguably one of the best feats in the field of deep learning. They are now showing
state-of-the-art performance in various fields, ranging from computer vision to natural language
processing to graph pattern classification to speech recognition, and so forth (Vaswani et al., 2017;
Devlin et al., 2019; Radford et al., 2018; 2019; Dosovitskiy et al., 2021; Touvron et al., 2021a; Zhou
et al., 2021a; Liu et al., 2021; Gulati et al., 2020; Latif et al., 2023; Ying et al., 2021; Rampášek et al.,
2022). Recently, there have been several studies conducted on better understanding them (Gong
et al., 2021; Ali et al., 2023; Wang et al., 2022); there exists a common agreement among researchers
that the self-attention is one of the keys leading to the success.
However, there also exist several studies pointing out potential limitations of the self-attention (Zhou
et al., 2021a; Dong et al., 2021; Gong et al., 2021; Guo et al., 2023). For instance, Shi et al. (2022)
revealed an analogy between the self-attention and the residual graph convolutional network (GCN),
showing that BERT also suffers from a notorious problem of GCNs, called oversmoothing, i.e.,
tokens’ latent representations become similar to each other at the deeper layers. In every selfattention layer, value vectors are aggregated in a weighted average manner since each row-wise sum
of the attention matrix is always 1. Although each self-attention layer has its own attention matrix,
this aggregation method causes the oversmoothing problem, not only in Transformers but also in
graph neural networks (Oono & Suzuki, 2020; Cai & Wang, 2020; Wang et al., 2021a; Rusch et al.,
2023; Keriven, 2022; Zhou et al., 2021a; Gong et al., 2021; Yan et al., 2022; Noci et al., 2022;
∗
†

Equal Contribution.
Corresponding Author.

1

Graph Convolutions Enrich the Self-Attention in Transformers!

Shi et al., 2022; Wang et al., 2022; Ali et al., 2023; Wu et al., 2023a;b). However, we confine our
discussion to the oversmoothing of Transformers (cf. Section 2).
Being inspired by them, we redesign the self-attention from the perspective of graph signal processing (GSP). However, performing graph convolutions in the self-attention layer may incur non-trivial
computational overheads. Therefore, our key design point is to learn a general but effective graph
filter with minimal overheads. In general, a graph filter on a graph G is represented by a polynomial
expression based on its adjacency or Laplacian matrix — in this regard, the existing self-attention
mechanism can be understood as the simplest graph filter with Ā only, where Ā ∈ [0, 1]n×n means
a learned attention matrix and n is the number of input tokens.
Our proposed graph filter consists of an identity term and two matrix polynomial terms, Ā and ĀK
— one can design better filters with more polynomial terms but we avoid it since Transformers already require very large computation. The K-th power, ĀK , may also require high computation
when the number of tokens is large. To avoid this, we further approximate ĀK using the elementwise first-order Taylor approximation. Therefore, one can consider that our proposed graph filter is
the very next complicated filter after the one used by the original self-attention mechanism. However, its efficacy is tremendous for various Transformers in various fields (cf. Figure 1).
Our proposed filter enriches the self-attention with more diverse frequency information (see Figure 2(a)) — low (resp. high) frequency signals on G mean neighboring nodes have similar (resp.
different) values. Therefore, our method is able to not only effectively address the oversmoothing
problem but also learn better latent representations for downstream tasks.
There exist a couple of prior works to enrich the self-attention mechanism with high frequency information (Wang et al., 2022; Bai
et al., 2022). In comparison with them, our
proposed graph filter is distinctive in the following aspects: i) our proposed filter is more
effective and shows better performance with
comparable computational overheads, ii) our
proposed filter is well-aligned with recent advancements in the GCN community — in other
words, some graph filters used by recent advanced GCN methods are special cases of our
proposed graph filter, which is not the case for
prior works, and iii) other methods were typically studied for certain domains only whereas
we test our method in 6 domains — for instance, DiversePatch (Gong et al., 2021) works
only for Vision Transformers.

Automatic Speech
Recognition
(LibriSpeech 960h)

Causal Language
Modeling
(PTB)
19.45

2.31

Image
Classfication
(ImageNet-1k) 81.1

2.42
79.8

19.51

60.34

Natural Language
Understanding
64.11 (CoLA)

62.88 0.124
64.39

Code
Classification
(Devign)

0.119

Backbones
+ GFSA

Graph
Classification
(ZINC)

Figure 1: Our proposed GFSA performs better
than selected Transformers in various domains.
We achieve these results with only tens to hundreds of additional parameters to Transformers.

We replace the self-attention layer of selected Transformers in various fields with our proposed
graph filter-based layer without changing other parts. Therefore, the accuracy increases in them are
solely by our proposed graph filter-based self-attention. These enriched Transformers increase the
model performance by 1.63% for image classification, 6.25% for natural language understanding,
0.31% for causal language modeling, 4.03% for graph classification, 4.76% for speech recognition,
and 2.40% for code classification (see Figure 1). We summarize our core contributions as follows:
• We provide a novel perspective on self-attention as a graph filter. This perspective allows
us to design more effective self-attention that can address the oversmoothing problem.
• We propose a graph filter-based self-attention (GFSA) mechanism that is general yet effective. GFSA learns a graph filter with an identity term and two polynomial terms, which is
more effective than the simple self-attention mechanism.
• We demonstrate that GFSA can significantly improve the performance of Transformers on
a variety of tasks. GFSA achieves improved results on computer vision, natural language
processing, graph pattern classification, speech recognition, and code classification.
2

Graph Convolutions Enrich the Self-Attention in Transformers!

2

BACKGROUND & R ELATED W ORK

2.1

S ELF -ATTENTION IN T RANSFORMERS

The core building block of the Transformer architecture is the self-attention mechanism, which
enables the model to learn attention patterns over its input tokens (Vaswani et al., 2017). The selfattention mechanism, denoted as SA : Rn×d → Rn×d , can be expressed as follows:
 XW (XW )⊺ 
key
qry
√
SA(X) = softmax
XWval = ĀXWval ,
(1)
d
where X ∈ Rn×d is the input feature matrix, Wkey ∈ Rd×d , Wqry ∈ Rd×d , and Wval ∈ Rd×d are
the key, query, and value trainable parameters, respectively, and d is the dimension of each token.
The self-attention mechanism allows the model to weigh the importance of each token in the input
sequence relative to the others, enabling the model to capture long-range contextual information
better. The Transformer architecture includes multiple layers, each with a multi-head self-attention
layer followed by a position-wise feed-forward layer.
2.2

S ELF -ATTENTION AND G RAPH C ONVOLUTIONAL F ILTER

The self-attention matrix used in Transformers can be seen as a symmetrically normalized adjacency
matrix of a graph (Shi et al., 2022; Guo et al., 2023; Maskey et al., 2023). A weighted graph G with
adjacency matrix A can be constructed by using the input tokens as n nodes and the edge weights
between node i and node j as exp((XWqry )⊺ (XWkey )). We can rewrite the self-attention matrix Ā
exp (XWqry )⊺
i (XWkey )j
as Pd exp (XW
. This allows Ā to be interpreted as the symmetrically normalized ad⊺
qry )i (XWkey )k
k=1
P
jacency matrix. In other words, Ā = D −1 A, where D = diag(d1 , d2 , . . . , dn ) and di = j Ai,j .
Our new attention method is designed on top of graph signal processing (GSP) which has a close
connection to discrete signal processing (DSP) (Sandryhaila & Moura, 2013; 2014). In DSP, a
discrete signal with a length of n can be represented by a vector x ∈ Rn . Let g ∈ Rn be a filter that
we want to apply to x. The convolution x ∗ g can be written as follows:
n
X
yi =
xj gi−j ,
(2)
j=1

where the index, denoted as i, refers to the i-th element in each vector.
GSP can be understood as a generalized concept of DSP. Signals are defined on the nodes of a graph,
and the graph’s structure influences signal processing operations. In addition, the graph convolution
filter with n nodes can be written with a shift operator S as follows:
y=

K
X

wk S k x = Hx,

(3)

k=0

where x ∈ Rn is a 1-dimensional graph signal, K is the maximum order of polynomial, and wk ∈
[−∞, ∞] is a coefficient. S is an n × n matrix where (i, j)-th element is non-zero if and only if
there is an edge from node i to j. Two representative samples of S are adjacency and Laplacian
PK
matrices. The linear and shift-invariant graph filter H is the same as k=0 wk S k with a large
enough value of K, which is called matrix polynomial (Marques et al., 2020). We note that this graph
filtering operation can be extended to d-dimensional cases as in Equation 1. Being inspired by (Zou
et al., 2022; Maskey et al., 2023), we rely on the singular value domain analysis to understand the
low/high-pass filter characteristics of filters (cf. Fig. 2). See more discussion in Appendix N.
In the context of self-attention within Transformers, the core part of the self-attention in Equation 1,
i.e., ĀX, can be considered as a d-dimensional graph filter with Ā only, where H = Ā. Our goal
in this paper is design a simple (for computational purposes) yet effective form of H.
2.3

OVERSMOOTHING IN GCN S AND T RANSFORMERS

Oversmoothing is a phenomenon observed in deep learning models, especially in GCNs (Kipf &
Welling, 2017; Veličković et al., 2018). As information is aggregated over multiple layers for mul3

DeiT
DeiT + GFSA

0.8

Cosine Similarity

Normalized Magnitude

1.0

0.6
0.4
0.2
0.0
−100

0.8
0.6
0.4
DeiT
DeiT + GFSA

0.2
−50

0
Frequency

50

(a) Filter response

100

1

6

12

18

Normalized Singular Value

Graph Convolutions Enrich the Self-Attention in Transformers!

24

Layer Index

(b) Cosine similarity

DeiT
DeiT + GFSA

0.9

0.6

0.3

0.0
0

50

100

150

Singular Value Index

(c) Singular value

Figure 2: Filter frequency response, cosine similarity, and singular values on ImageNet-1k for DeiTS and DeiT-S + GFSA. Details and more visualizations are in Appendix D.
tiple nodes (tokens), latent representations tend to become similar to each other, leading to a loss of
distinctiveness in the representations (Oono & Suzuki, 2020; Zhou et al., 2020; Rusch et al., 2023).
Surprisingly, an oversmoothing-like phenomenon is also observed in Transformers (Wang et al.,
2022; Shi et al., 2022). Unlike CNNs, Transformers can not benefit from simply deepening layers
after a certain depth. Earlier studies hypothesize that this may be due to issues such as the attention or
feature collapse or due to uniformity among patches or tokens (Zhou et al., 2021a; Gong et al., 2021;
Yan et al., 2022). Dong et al. (2021) also point out that the output of a pure Transformer, i.e., an
attention mechanism without skip connections or MLPs, tends to converge to a rank-1 matrix (Dong
et al., 2021). This analysis is followed by (Noci et al., 2022), which suggests that rank collapses
incur vanishing gradients of attention queries and keys.
In this context, the self-attention acts as a low-pass filter, since the self attention calculates the
weighted average of the value vectors of tokens. Wang et al. (2022, Theorem 1) also reveal that the
self-attention is a low-pass filter, continuously reducing high-frequency information. This nature
contributes to the oversmoothing phenomenon as unique high-frequency features are lost in deeper
layers of the network, further worsening the uniformity of token representations. Therefore, we
extend the term “oversmoothing” to describe the degeneration challenge observed in Transformers.
There have been proposed many empirical countermeasures for Vision Transformer (ViT), such as
patch diversification (Zhou et al., 2021b; Gong et al., 2021), rank collapse alleviation (Zhou et al.,
2021a; Zhang et al., 2021), and training stabilization (Touvron et al., 2021a; Zhai et al., 2023). Similar alleviating methods have been also proposed in the field of NLP, such as unsupervised learning (Chen et al., 2023), and resolve the oversmoothing and the token uniformity (or information
diffusion) problems (Dong et al., 2021; Yan et al., 2022). There are studies on utilizing high frequency information via frequency domain analyses (Wang et al., 2022; Bai et al., 2022), but they are
not designed on top of graph filtering perspectives. This paper addresses the oversmoothing problem
with graph filters since the self-attention mechanism is a basic graph filtering operation as seen in
the previous subsection.

3

P ROPOSED M ETHOD

3.1

G RAPH F ILTER - BASED S ELF -ATTENTION L AYERS

Let Ā ∈ [0, 1]n×n , where n is the number of tokens in the input to the self-attention layer, be a selfattention matrix. Since transformers use multi-head self-attentions, there are multiple such matrices.
For simplicity but without loss of generality, we discuss only one head in one layer.
In the perspective of GSP, Ā corresponds to a special case where i) nodes are completely connected,
and ii) Ā is normalized. Ā can be used as a shift operator. When K is large, however, it is hard to
calculate ĀK . We use the following element-wise first-order Taylor approximation:
ĀK ≈ Ā + (K − 1)(Ā2 − Ā),
2

(4)

where Ā − Ā is an approximated derivative at the position of K = 1 (see details in Appendix P).
At the end, we propose to use the following graph filter where the two lowest-order terms and one
4

Graph Convolutions Enrich the Self-Attention in Transformers!

high-order term of the matrix polynomial are used:
H := w0 I + w1 Ā + wK (Ā + (K − 1)(Ā2 − Ā)),

(5)

where w0 , w1 , wK are coefficients and K is a hyper-parameter. The coefficients can be learnable
weights and we learn them with gradient descent algorithms.
Our proposed graph filter-based self-attention (GFSA) is defined with the graph filter H as follows:
GFSA(X) = HXWval .

(6)

We replace the original self-attention layer in various Transformers with the proposed graph filterbased layer without changing other parts. Therefore, GFSA can be plugged into any Transformers
that rely on the self-attention.
3.2

T HEORETICAL A NALYSIS

In this subsection, we provide a theorem that provides an upper bound on the error introduced
by approximating the power of a matrix, specifically using the first-order Taylor expansion. The
theorem specifically analyzes the error for the matrix ĀK when approximated using a first-order
Taylor expansion.
Theorem 3.1 (Error bound for first-order Taylor approximation). Define the error term as the difference between the exact value and the first-order Taylor approximation of ĀK :
EK = ||ĀK − (Ā + (K − 1)(Ā2 − Ā))||,

(7)

where || · || denotes the L∞ norm. Then, the error EK can be bounded as follows:
EK ≤ 2K.

(8)

The error bound provides an upper limit for the difference between the true value of ĀK and its
approximation using the given formula. When the row-sum of Ā is 1 due to softmax. The proof of
Theorem. 3.1 is in Appendix A.
Computing high-order matrix powers can be computationally intensive, especially for large matrices
such as the self-attention in natural language processing where the number of tokens (words) can
be tens of thousands. The first-order Taylor approximation provides a simpler computation that can
significantly reduce the required computational resources and time. Theorem. 3.1 ensures that even
with this simplification, the error can be bounded and is predictable.
3.3

D ISCUSSION

This subsection provides a discussion on GFSA in terms of the comparison with other models and
the mitigation of the oversmoothing problem.
Comparison to Transformers In the field of computer vision, there have been recent researches
on adjusting the frequency response of ViT. HAT (Bai et al., 2022) improves ViT by directly increasing the high-frequency components of images through adversarial training, i.e., modifying inputs.
Wang et al. (2022) suggest that they re-balance the low and high-frequency components of the attention map and feature map, respectively (Wang et al., 2022). In the NLP domain, Shi et al. (2022)
redesign BERT with multiple fusions in (Shi et al., 2022), inspired by JKNet (Xu et al., 2018).
Comparison to GCNs Comparisons to GCNs that can be interpreted as graph filters (Kipf &
Welling, 2017; Defferrard et al., 2016; Gasteiger et al., 2019) are inevitable. GFSA without a highorder term is analogous to ChebNet (Defferrard et al., 2016) with K = 1. In addition, GFSA reduces
to the vanilla GCN (Kipf & Welling, 2017) when K = 1, w0 = 0, w1 = 1. GPR-GNN (Chien et al.,
2021), which approximates graph convolutions using the monomial basis, is identical to GFSA if
our GFSA only considers up to an order of 1 and learns coefficients. When we use only a high-order
term and wK is learned to negative value, GFSA can become similar to the reaction-diffusion layer
of GREAD (Choi et al., 2023), ĀX + β(Ā − Ā2 ), depending on the higher order terms.
5

Graph Convolutions Enrich the Self-Attention in Transformers!

How to alleviate the oversmoothing problem? The key leading to the behaviors of our proposed filter is the coefficients {w0 , w1 , wK } — note that in the self-attention of Transformers,
w0 = wK = 0 and w1 = 1. Since our method can learn any appropriate values for them for a
downstream task, it can reduce to low-pass-only, high-pass-only, or combined filters. In particular,
our filter reduces to GPR-GNN and GREAD, two popular method preventing the oversmoothing
for graph convolutional networks, in certain learned settings on {w0 , w1 , wK }. We use Ā2 − Ā to
approximate ĀK (see Equation 4). As mentioned earlier, the low/high-pass filter characteristics are
decided by {w0 , w1 , wK }. Therefore, the approximated ĀK by (K − 1)(Ā2 − Ā) and the coefficients {w0 , w1 , wK } can alleviate the oversmoothing problem of the self-attention by constituting
an appropriate graph filter for a downstream task.

4

E XPERIMENTS

In this section, we demonstrate the effectiveness of GFSA through a series of experiments. These
experiments encompass various tasks: i) language understanding and causal language modeling,
ii) image classification, iii) graph classification, and iv) code classification. We replace the selfattention of base Transformers in those fields with our GFSA. Our modification adds only tens to
hundreds of parameters, which are negligible in comparison with the original size of base models.
We summarize the runtime of base models vs. base models with GFSA in Appendix L.
4.1

E XPERIMENTS ON L ANGUAGE M ODELS

To investigate the efficacy of GFSA, we present experiments on natural language understanding and
causal language modeling tasks in the subsections below.
4.1.1

NATURAL L ANGUAGE U NDERSTANDING

Experimental Setting. We integrate GFSA into three pre-trained large language models: BERT,
ALBERT, and RoBERTa. We evaluate them on the GLUE benchmark, which includes three categories of natural language understanding tasks: i) single-sentence tasks CoLA and SST-2; ii) similarity and paraphrasing tasks MRPC, QQP, and STS-B; iii) natural language inference tasks MNLI,
QNLI, and RTE. For the MNLI task, we experiment on both the matched (MNLI-m) and mismatched
(MNLI-mm) versions. Following (Devlin et al., 2019), we report Matthews correlation for CoLA,
F1 scores for QQP and MRPC, Spearman correlations for STS-B, and accuracy scores for the other
tasks. For each task, we select the best hyperparameters for GFSA, and the other hyperparameters
are fixed. We compare our GFSA with ContraNorm (Guo et al., 2023), one of the related methods
that address oversmoothing. We finetune ContraNorm with the recommended hyperparameters in
the original paper. We initialize with pretrained language model and finetune with added GFSA for
5 epochs. The detailed experimental settings are in Appendix E.1.
Results. The results are shown in Table 1. When GFSA was plugged into backbones, average
performance scores improved across all models over pure backbones. This indicates that GFSA
is effective in both large models like BERT and RoBERTa, as well as relatively smaller models
like ALBERT. It is worth mentioning that in the case of RoBERTa finetuned on the CoLA dataset,
there is a large margin increase from 60.34% to 64.11%, which is a 3.77% improvement with only
144 additional parameters. When compared to ContraNorm, GFSA shows a greater performance
improvement on average. Figure 4 in Appendix D shows that these performance enhancements can
be attributed to addressing the oversmoothing issue through the designed graph filter.
4.1.2

C AUSAL L ANGUAGE M ODELING

Experimental Setting. We also validate the effectiveness of GFSA on causal language modeling
problems. We finetune GPT2 on the following three datasets: Penn Treebank (PTB) (Marcus et al.,
1993), WikiText-2 and WikiText-103 (Merity et al., 2016). Following the evaluation method in (Yao
et al., 2022), we finetune models for 15 epochs with PTB, 4 epochs with WikiText-103, and 10
epochs with WikiText-2, and report the perplexity for sensitivity metric, i.e., higher perplexity means
higher sensitivity. The detailed experimental settings are in Appendix F.1.
6

Graph Convolutions Enrich the Self-Attention in Transformers!

Table 1: Results comparison on validation set of GLUE tasks. Avg denotes the average performance
on all the tasks.
#Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg

Method

BERTBASE (Devlin et al., 2019) 110M 56.79 93.81 88.70 88.32 88.16 84.96/84.15 91.63 66.06 82.51
BERTBASE + ContraNorm
110M 59.89 93.92 89.88 88.51 88.36 85.11/84.50 91.84 69.31 83.48
BERTBASE + GFSA
110M 59.56 94.15 90.60 88.46 88.33 85.12/85.06 91.95 68.95 83.58
ALBERTBASE (Lan et al., 2019)
ALBERTBASE + ContraNorm
ALBERTBASE + GFSA

57.86 92.32 91.80 85.30 90.37 85.37/84.37 91.76 76.90 84.01
57.45 93.00 92.83 87.78 90.55 85.06/84.57 92.28 78.70 84.69
60.21 93.23 92.47 87.79 90.63 85.29/84.92 92.17 78.70 85.05

11M
11M
11M

RoBERTaBASE (Liu et al., 2019) 125M 60.34 94.84 92.28 88.86 89.99 87.94/87.30 92.57 78.70 85.87
RoBERTaBASE + ContraNorm
125M 63.06 95.41 93.17 88.91 90.34 87.88/87.40 92.82 80.51 86.61
RoBERTaBASE + GFSA
125M 64.11 95.41 93.52 89.09 90.35 87.99/87.54 92.97 80.14 86.79

Table 2: Results comparison on GPT-2 finetuned with GFSA
Method

#Params

PTB

WikiText-2

WikiText-103

Avg

GPT2 (Radford et al., 2019)
GPT2 + GFSA

117M
117M

19.513
19.450

20.966
20.923

15.939
15.919

18.806
18.764

Results. Table 33 shows the perplexity on PTB, WitiText-2, and WikiText-103. Across all
datasets, GPT2 with GFSA consistently outperforms the vanilla GPT2. Our GFSA improves the
average perplexity from 18.806 to 18.764. Note that the performance improvements are made with
only 144 additional learnable parameters for 12 layers with 12 heads, which are negligible in comparison with the size of GPT2.
4.2

E XPERIMENTS ON V ISION T RANSFORMERS

Experimental Setting. We aim to demonstrate the efficacy of our GFSA across a spectrum of ViT
backbones with different depth settings and training modes. We choose DeiT (Touvron et al., 2021a),
CaiT (Touvron et al., 2021b), and Swin (Liu et al., 2021) as backbones, and models are trained from
scratch. When training the 12-layer DeiT, we follow the same training recipe, hyperparameters, and
data augmentation from (Touvron et al., 2021a). When training the 24-layer DeiT, we follow the
settings of (Gong et al., 2021) and set the dropout rate to 0.2. For Cait, we apply our proposed
GFSA only to the patch embedding layer. For detailed experimental settings, see Appendix G.1.
Results. All experimental evaluations are summarized in Table 3. We compare various state-ofthe-art models on the ImageNet-1k benchmark. We select them from three classes: CNN only, CNN
+ Transformer, and pure Transformer. In the Transformer category, we only test with lighter models
having comparable numbers of parameters, such as ViT-S and DeiT-S. The results show that the
proposed GFSA successfully enhances Deit, CaiT, and Swin across all depth settings and training
methods. GFSA provides additional parameters less than 72 for 12-layer DeiT while improving
the top-1 accuracy by 1.63%. Compared to existing techniques, improvements by GFSA already
surpasses DeepViT-24B (0.6%) (Zhou et al., 2021a), LayerScale (0.7%) (Touvron et al., 2021b),
LateInsertion (0.6%) (Touvron et al., 2021b), and HAT (Bai et al., 2022) (1.38%). To sum up, we
observed that both shallow and deep ViTs can achieve the following benefits from GFSA: i) The filter
response plot shows GFSA can preserve higher-frequency representation (cf. Figure 2 (a)) and ii)
Figure 2 (b) shows that GFSA can mitigate the increase in the cosine similarity of representation as
the layer gets deeper. We also show experiments with the same ContraNorm setting in Appendix G.4.
4.3

E XPERIMENTS ON AUTOMATIC S PEECH R ECOGNITION

Experimental Setting. We conduct automatic speech recognition (ASR) experiments on the LibriSpeech 1 dataset (Panayotov et al., 2015), which consists of audio recordings paired with their
1

http://www.openslr.org/12

7

Graph Convolutions Enrich the Self-Attention in Transformers!

Table 3: Compared with state-of-the-art models on ImageNet-1k dataset. The number in (↑) indicates the performance improvement over the base model.
Category

Method

CNN

ResNet-152 (He et al., 2016)
DenseNet-201 (Huang et al., 2017)

Input Size #Layers #Params mTop-1 Acc
224
224

152
201

230M
77M

78.1
77.6

CNN +
CVT-21 (Wu et al., 2021)
Transformer Refiner (Zhou et al., 2021b)

224
224

21
16

32M
86M

82.5
81.2

ViT-S/16 (Dosovitskiy et al., 2021)
ViT-B/16 (Dosovitskiy et al., 2021)
DeiT-S (Touvron et al., 2021a)
DeiT-S Distilled (Touvron et al., 2021a)
DeiT-S + LayerScale (Touvron et al., 2021b)
DeiT-S + LateInsertion (Touvron et al., 2021b)
DeiT-S + ClassAttention (Touvron et al., 2021b)
DeiT-S + AttnScale (Wang et al., 2022)
DeiT-S + FeatScale (Wang et al., 2022)
Transformer
DeiT-S + HAT (Bai et al., 2022)
DeiT-S + Diverse (Chen et al., 2022)
DeiT-S + ContraNorm (Guo et al., 2023)
Swin-S (Liu et al., 2021)

224
224
224
224
224
224
224
224
224
224
224
224
224

12
12
12
12
12
12
12
12
12
12
12
12
12

49M
86M
22M
22M
22M
22M
22M
22M
22M
22M
22M
22M
50M

78.1
79.8
79.8
81.2
80.5
80.5
80.6
80.7
80.9
80.9
80.6
80.4
82.9

T2T-ViT-24 (Yuan et al., 2021)
DeepViT-24B (Zhou et al., 2021a)
DeiT-S (Touvron et al., 2021a)
CaiT-S (Touvron et al., 2021b)
DeiT-S + DiversePatch (Gong et al., 2021)
DeiT-S + LayerScale (Touvron et al., 2021b)
DeiT-S + AttnScale (Wang et al., 2022)
DeiT-S + FeatScale (Wang et al., 2022)
DeiT-S + ContraNorm (Guo et al., 2023)

224
224
224
224
224
224
224
224
224

24
24
24
24
24
12
24
24
24

64M
36M
43M
47M
44M
22M
44M
44M
43M

82.3
80.1
80.5
82.6
82.2
82.4
81.1
81.3
80.7

DeiT-S + GFSA
DeiT-S + GFSA
CaiT-S + GFSA
Swin-S + GFSA

224
224
224
224

12
24
24
12

43M
43M
43M
50M

81.1 (↑ 1.3)
81.5 (↑ 1.0)
82.8 (↑ 0.2)
83.0 (↑ 0.1)

GFSA

Table 4: Results for ASR training on LibriSpeech 100h and 960h with GFSA
Method

#Params

LibriSpeech 100h

LibriSpeech 960h

test-clean WER test-other WER test-clean WER test-other WER
Transformer
Transformer + GFSA

71.5M
71.5M

11.02
10.30

25.42
24.30

2.42
2.31

5.50
5.49

Branchformer (Peng et al., 2022) 109.8M
Branchformer + GFSA
109.8M

9.63
9.60

22.43
22.25

2.13
2.11

5.00
4.94

transcriptions. For implementation, we follow the recipes of SpeechBrain (Ravanelli et al., 2021)
and describe the detailed settings in Appendix I.1. For our experiments, we use Branchformer (Peng
et al., 2022), one of state-of-the-art models for speech recognition, as well as a pure Transformer.

Results. Table 4 compares word error rates (WERs) on LibriSpeech 100h and 960h. For LibriSpeech 100h, Transformer+GFSA achieves 10.30/25.30 on the test clean/other set, which is a
6.53% improvement over the Transformer model for the WER of test clean. For LibriSpeech 960h,
Transformer+GFSA shows a WER result of 2.31 in test clean, a 4.55% improvement over Transformer. Figure 7 in Appendix I.2 depicts the learning curves of train loss and valid loss when using
GFSA, showing the effectiveness of our proposed filter.
8

Graph Convolutions Enrich the Self-Attention in Transformers!

Table 5: Experimental results and
number of parameters on ZINC
Method

#Params MAE

Graphormer
500K 0.1240
Graphormer + GFSA 500K 0.1189

4.4

Table 6: Experimental results and number of parameters
on PCQM4M and PCQM4Mv2
Method

#Params

PCQM4M

PCQM4Mv2

Train Validate Train Validate
Graphormer
48.3M 0.0535 0.1286 0.0250 0.0862
Graphormer + GFSA 48.3M 0.0312 0.1193 0.0249 0.0860

E XPERIMENTS ON G RAPH T RANSFORMERS

Experimental Setting. To confirm the efficacy of GFSA on graph transformers, we conduct experiments on a commonly used graph-level prediction dataset (i.e., ZINC) from benchmarking-GNN
leaderboards (Dwivedi et al., 2023) and the recent OGB-LSC quantum chemistry regression (i.e.,
PCQM4M-LSC) challenge (Hu et al., 2021), which is currently the biggest graph-level prediction
dataset and contains more than 3.8M graphs in total. We choose Graphormer (Ying et al., 2021) as
a backbone architecture and use its framework 2 for implementation. Mean absolute errors (MAEs)
are reported for both regression tasks. For detailed experimental setting, see Appendix H.1.
Results. Tables 5 and 6 show that Graphormer with GFSA consistently outperforms the vanilla
model across all datasets. As can be seen from the train MAE values of Table 6, using GFSA allows
the model to learn more effectively during the training process. Notably, plugging GFSA improves
the performance by a large margin, e.g., 7.20% relative validate MAE decline in PCQM4M. Due to
space constraints, results with standard deviation are included in table 14 and 15 in Appendix H.2.
4.5

E XPERIMENTS ON C ODE C LASSIFICATION
Table 7: Code defect accuracy for models with
Experimental Setting. We conduct a code plugged in GFSA. The number in (↑) indicates the
defect detection task based on Devign dataset improvement rate (%) over the base model.
provided by (Zhou et al., 2019). We use
RoBERTa (Liu et al., 2019), CodeBERT (Feng
Accuracy
et al., 2020), PLBART (Ahmad et al., 2021), Method
62.88
and CodeT5 (Wang et al., 2021b) as our back- RoBERTa (Liu et al., 2019)
64.39 (↑ 2.40%)
bone models. The detailed settings are in Ap- RoBERTa + GFSA
pendix J.1.
CodeBERT (Feng et al., 2020)
64.31
CodeBERT + GFSA

64.49 (↑ 0.12%)

Results. Table 7 shows the accuracy of all PLBART (Ahmad et al., 2021)
62.63
models; GFSA results better than the base mod- PLBART + GFSA
62.96 (↑ 0.52%)
els. The biggest improvement is 2.40% for
RoBERTa. In the case of CodeT5-base, us- CodeT5-small (Wang et al., 2021b) 63.25
CodeT5-small + GFSA
63.69 (↑ 0.70%)
ing GFSA shows an accuracy of 64.75, an imCodeT5-base
(Wang
et
al.,
2021b)
63.51
provement of 1.95% from 63.51. CodeT5CodeT5-base
+
GFSA
64.75
(↑ 1.95%)
small+GFSA has only about 100 additional parameters compared to CodeT5-small with 60M
parameters, and even more impressively, it surpasses the accuracy of CodeT5-base. In Appendix J.2,
we include case studies for this task. We also report the results of code clone detection task in Appendix K. In Table 17, CodeT5-small+GFSA outperforms CodeT5-small with an F1 of 94.36.
4.6

RUNTIME OVERHEADS

Our GFSA does not significantly increase training time. We summarize the runtime of base models
vs. base models with GFSA in Appendix L. On GLUE tasks in natural language understanding,
BERT with GFSA improves average performance from 82.51% to 83.58% (see Table. 1) with an
overhead of less than 3 minutes in total training time. Considering the performance improvements,
we think the increases in training time are negligible.
2

https://github.com/microsoft/Graphormer

9

Graph Convolutions Enrich the Self-Attention in Transformers!

5

F INAL R EMARKS

Our proposed graph filter-based self-attention (GFSA) mechanism achieves high performance with
improvements on a variety of tasks in computer vision, natural language processing, graph pattern
classification, speech recognition, and code classification. GFSA is a simple yet effective method
that enriches the self-attention in Transformers with more diverse frequency information. This enables GFSA to address the oversmoothing problem and learn better latent representations for downstream tasks. However, our GFSA does not bring significant overheads in those Transformers’
empirical runtime complexity. One can use more complicated graph filters to enhance accuracy
more, but our goal is to find a balance between accuracy enhancements and overheads in runtime
complexity.
We believe that GFSA is a promising new direction for improving Transformers. It is simple to implement and can be used in conjunction with other techniques to further improve the performance of
Transformers. We hope that our work will inspire more research on graph filter-based self-attention
and its applications.

E THICAL S TATEMENTS
In terms of the broader impact of this research on society, we do not see the very negative impacts
that might be expected. However, this paper may have implications for the carbon footprint and
accessibility of learning algorithms. The computations required for machine learning research are
rapidly growing, resulting in a larger carbon footprint (Schwartz et al., 2020). Our study improves
performance and increases runtime very slightly, but the runtime increase is not very significant.
However, in future research, it will also be important to study and improve our GFSA by taking
carbon footprints into account.
GFSA improves the performance of existing Transformer-based models, which can have many positive impacts on society through services that utilize natural language processing, computer vision,
and speech recognition. However, it will also be important to improve GFSA by considering other
dimensions of AI, such as robustness to adversarial examples, fairness, and explainability.

R EPRODUCIBILITY S TATEMENT
To ensure the reproducibility and completeness of this paper, we include the Appendix with 12 sections. Appendix A provides the complete proof of Theorem. 3.1. Appendix B contains the empirical
study on the error related to Theorem. 3.1. Appendix C provides our PyTorch style pseudo code for
our GFSA method. The pseudo code helps to implement our GFSA to any Transformers used a pure
self-attention. All experiments in the paper are reproducible with additional implementation details
provided in Appendices E to K.

R EFERENCES
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation. In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.
2655–2668, 2021.
Ameen Ali, Tomer Galanti, and Lior Wolf. Centered self-attention layers. arXiv preprint arXiv:
2306.01610, 2023.
Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision
transformers by revisiting high-frequency components. In European Conference on Computer
Vision, pp. 1–18. Springer, 2022.
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing
textual entailment challenge. TAC, 7:8, 2009.
10

Graph Convolutions Enrich the Self-Attention in Transformers!

Edward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. GRU-ODE-Bayes: Continuous
modeling of sporadically-observed time series. In Advances in Neural Information Processing
Systems (NeurIPS), 2019.
Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks. arXiv preprint
arXiv:2006.13318, 2020.
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task
1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint
arXiv:1708.00055, 2017.
Nuo Chen, Linjun Shou, Ming Gong, Jian Pei, Bowen Cao, Jianhui Chang, Daxin Jiang, and
Jia Li. Alleviating over-smoothing for unsupervised sentence representation. arXiv preprint
arXiv:2305.06154, 2023.
Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, and Zhangyang Wang. The principle
of diversity: Training stronger vision transformers calls for reducing all levels of redundancy.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
12020–12030, 2022.
Zihan Chen, Hongbo Zhang, Xiaoji Zhang, and Leqi Zhao. Quora question pairs, 2018.
Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized PageRank
graph neural network. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
Jeongwhan Choi, Seoyoung Hong, Noseong Park, and Sung-Bae Cho. Gread: Graph neural
reaction-diffusion networks. In International Conference on Machine Learning (ICML), pp.
5722–5747. PMLR, 2023.
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks
on graphs with fast localized spectral filtering. In Advances in Neural Information Processing
Systems (NeurIPS), 2016.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June
2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https:
//aclanthology.org/N19-1423.
Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In
Third International Workshop on Paraphrasing (IWP2005), 2005.
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure
attention loses rank doubly exponentially with depth. In International Conference on Machine
Learning (ICML), pp. 2793–2803. PMLR, 2021.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the
International Conference on Learning Representations (ICLR), 2021.
Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and
Xavier Bresson. Benchmarking graph neural networks. Journal of Machine Learning Research,
24(43):1–48, 2023.
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou,
Bing Qin, Ting Liu, Daxin Jiang, et al. CodeBERT: A pre-trained model for programming and
natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020,
pp. 1536–1547, 2020.
Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph
learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
11

Graph Convolutions Enrich the Self-Attention in Transformers!

Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. Vision transformers with
patch diversification. arXiv preprint arXiv:2104.12753, 2021.
Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural
networks. In Proceedings of the 31st International Conference on Machine Learning (ICML),
volume 32, pp. 1764–1772. PMLR, 22–24 Jun 2014.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo
Wang, Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer
for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
Xiaojun Guo, Yifei Wang, Tianqi Du, and Yisen Wang. ContraNorm: A contrastive learning perspective on oversmoothing and beyond. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, 2016.
Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. OGBLSC: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430,
2021.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 4700–4708, 2017.
John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. Zinc:
a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52
(7):1757–1768, 2012.
Nicolas Keriven. Not too little, not too much: a theoretical analysis of graph (over) smoothing.
In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp. 2268–2281,
2022.
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR),
2017.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint
arXiv:1909.11942, 2019.
Siddique Latif, Aun Zaidi, Heriberto Cuayahuitl, Fahad Shamshad, Moazzam Shoukat, and Junaid
Qadir. Transformers in speech processing: A survey. arXiv preprint arXiv:2303.11607, 2023.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension. In Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the
IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
Ilya Loshchilov and Frank Hutter.
arXiv:1711.05101, 2017.

Decoupled weight decay regularization.

arXiv preprint

Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated
corpus of english: The penn treebank. 1993.
12

Graph Convolutions Enrich the Self-Attention in Transformers!

Antonio G. Marques, Santiago Segarra, and Gonzalo Mateos. Signal processing on directed graphs:
The role of edge directionality when processing and learning from network data. IEEE Signal
Processing Magazine, 37(6):99–116, 2020.
Sohir Maskey, Raffaele Paolino, Aras Bacho, and Gitta Kutyniok. A fractional graph laplacian
approach to oversmoothing. In Advances in Neural Information Processing Systems (NeurIPS),
2023.
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
models. arXiv preprint arXiv:1609.07843, 2016.
Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien
Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse.
In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp. 27198–27211,
2022.
Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for
node classification. In Proceedings of the International Conference on Learning Representations
(ICLR), 2020.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus
based on public domain audio books. In 2015 IEEE international conference on acoustics, speech
and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015.
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and
Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition.
Interspeech 2019, 2019.
Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe. Branchformer: Parallel mlp-attention
architectures to capture local and global context for speech recognition and understanding. In
International Conference on Machine Learning (ICML), pp. 17627–17643. PMLR, 2022.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. In Advances in
Neural Information Processing Systems (NeurIPS), volume 35, pp. 14501–14515, 2022.
Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch,
Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, SungLin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris,
Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. SpeechBrain: A general-purpose
speech toolkit, 2021. arXiv:2106.04624.
T. Konstantin Rusch, Michael M. Bronstein, and Siddhartha Mishra. A survey on oversmoothing in
graph neural networks. arXiv preprint arXiv: Arxiv-2303.10993, 2023.
Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. IEEE transactions
on signal processing, 61(7):1644–1656, 2013.
Aliaksei Sandryhaila and Jose MF Moura. Discrete signal processing on graphs: Frequency analysis.
IEEE Transactions on Signal Processing, 62(12):3042–3054, 2014.
Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green AI. Communications of the
ACM, 63(12):54–63, 2020.
13

Graph Convolutions Enrich the Self-Attention in Transformers!

Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen Lee, and
James T Kwok. Revisiting over-smoothing in bert from the perspective of graph. In Proceedings
of the International Conference on Learning Representations (ICLR), 2022.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment
treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and
Hervé Jégou. Training data-efficient image transformers & distillation through attention. In
International Conference on Machine Learning (ICML), pp. 10347–10357. PMLR, 2021a.
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going
deeper with image transformers. In Proceedings of the IEEE/CVF international conference on
computer vision, pp. 32–42, 2021b.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems (NeurIPS), volume 30, 2017.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. Graph Attention Networks. In Proceedings of the International Conference on Learning
Representations (ICLR), 2018.
Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Multi-hop attention graph neural network. In IJCAI, 2021a.
Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep
vision transformers via the fourier domain analysis: From theory to practice. In Proceedings of
the International Conference on Learning Representations (ICLR), 2022.
Wei Wang, Ming Yan, and Chen Wu. Multi-granularity hierarchical attention fusion networks for
reading comprehension and question answering. arXiv preprint arXiv:1811.11934, 2018.
Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. Detecting code clones with graph neural
network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference
on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271, 2020. doi: 10.1109/
SANER48275.2020.9054857.
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. CodeT5: Identifier-aware unified pretrained encoder-decoder models for code understanding and generation. In Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing, pp. 8696–8708, 2021b.
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments.
Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
Ross Wightman.
Pytorch image models.
pytorch-image-models, 2019.

https://github.com/rwightman/

Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for
sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. CvT:
Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international
conference on computer vision, pp. 22–31, 2021.
Xinyi Wu, Amir Ajorlou, Zihui Wu, and Ali Jadbabaie. Demystifying oversmoothing in attentionbased graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS),
2023a.
Xinyi Wu, Zhengdao Chen, William Wei Wang, and Ali Jadbabaie. A non-asymptotic analysis
of oversmoothing in graph neural networks. In Proceedings of the International Conference on
Learning Representations (ICLR), 2023b.
14

Graph Convolutions Enrich the Self-Attention in Transformers!

Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie
Jegelka. Representation learning on graphs with jumping knowledge networks. In International
Conference on Machine Learning (ICML), pp. 5453–5462, 2018.
Hanqi Yan, Lin Gui, Wenjie Li, and Yulan He. Addressing token uniformity in transformers via
singular value transformation. In Uncertainty in Artificial Intelligence, pp. 2181–2191. PMLR,
2022.
Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, and Yuxiong
He. Random-ltd: Random and layerwise token dropping brings efficient training for large-scale
transformers. arXiv preprint arXiv:2211.11586, 2022.
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and
Tie-Yan Liu. Do transformers really perform badly for graph representation? In Advances in
Neural Information Processing Systems (NeurIPS), volume 34, pp. 28877–28888, 2021.
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi
Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on
imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp.
558–567, 2021.
Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe
Zhang, Jiatao Gu, and Joshua M. Susskind. Stabilizing transformer training by preventing attention entropy collapse. In Proceedings of the 40th International Conference on Machine Learning
(ICML), volume 202, pp. 40770–40803. PMLR, 2023.
Aston Zhang, Alvin Chan, Yi Tay, Jie Fu, Shuohang Wang, Shuai Zhang, Huajie Shao, Shuochao
Yao, and Roy Ka-Wei Lee. On orthogonality constraints for transformers. In Proceedings of the
59th Annual Meeting of the Association for Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 375–382, 2021.
Ying Zhang, Mohammad Pezeshki, Philémon Brakel, Saizheng Zhang, César Laurent, Yoshua Bengio, and Aaron Courville. Towards end-to-end speech recognition with deep convolutional neural
networks. Interspeech 2016, 2016.
Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou,
and Jiashi Feng. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886,
2021a.
Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie Jin, Qibin
Hou, and Jiashi Feng. Refiner: Refining self-attention for vision transformers. arXiv preprint
arXiv:2106.03714, 2021b.
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang,
Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020.
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In
Advances in neural information processing systems (NeurIPS), volume 32, 2019.
Chunya Zou, Andi Han, Lequan Lin, and Junbin Gao. A simple yet effective svd-gcn for directed
graphs. arXiv preprint arXiv:2205.09335, 2022.

15

Graph Convolutions Enrich the Self-Attention in Transformers!

Appendix
Table of Contents
A Proof of Theorem. 3.1

17

B Empirical Error Bound

17

C Implementation of GFSA

18

D Oversmoothing and Additional Visualizations

18

E Natural Language Understanding
E.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . .
E.2 Sensitivity to K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19
19
20

F Causal Language Modeling
F.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . .
F.2 Sensitivity to K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21
21
21

G Image Classification
G.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . .
G.2 FLOPs & Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3 Sensitivity to K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.4 Additional Experiments with Guo et al. (2023)’s setting . . . . . . . . . . . . .

22
22
22
22
23

H Graph Classification
H.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . .
H.2 Experimental results with standard deviation . . . . . . . . . . . . . . . . . . .

24
24
24

I

Automatic Speech Recognition
I.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . .
I.2 Training Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25
25
27

J

Code Defect Detection
J.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . .
J.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27
27
28

K Code Clone Detection
K.1 Detailed Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . .
K.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28
28
29

L Time Complexity and Empirical Runtime Analysis

29

M Inference Time

31

N Frequency Analyses in the Singular Value Domain

33

O Matrix Polynomial vs. Graph Fourier Transform

33

P Approximation of ĀK in Equation 4

33

Q GFSA in Selected Layers: A Strategy for Model Efficiency

34

16

Graph Convolutions Enrich the Self-Attention in Transformers!

A

P ROOF OF T HEOREM . 3.1

Proof. We begin by observing that the L∞ norm of Ā2 is bounded by:
∥Ā2 ∥ ≤ ∥Ā∥2 .

(9)

By using the induction principle, it follows that for any positive integer k,
∥Āk ∥ ≤ ∥Ā∥k .

(10)

Given that Ā is a matrix normalized with softmax: i) all the elements of Ā lie within [0, 1], and ii)
the sum of all the elements in Ā is equal to 1 (potentially less than 1 if an attention mask is used).
Then we can calculate the upper bound of L∞ norm for Ā as follows:
X
∥Ā∥ = max
|Āi,j | ≤ max 1 = 1.
(11)
i

i

j

Now, considering the error term EK as given by Theorem 3.1, and applying the triangle inequality
for matrix norms:
EK = ∥ĀK − (Ā + (K − 1)(Ā2 − Ā))∥
K

(12)

2

≤ ∥Ā ∥ + ∥Ā∥ + (K − 1)(∥Ā ∥ + ∥Ā∥).

(13)

From the prior expressions, we deduce that ∥ĀK ∥ ≤ ∥Ā∥K ≤ 1K = 1. Thus, the upper bound of
the error term EK becomes as follows:
EK ≤ 1 + 1 + (K − 1)(1 + 1) = 2K.

B

(14)

E MPIRICAL E RROR B OUND

In this section, we empirically compare the theoretical error bounds in Theorem. 3.1 with the actual
errors. Figure 3 shows the theoretical upper bound of EK and the actual EK for the attention of
the last layer in a trained pure backbone. DeiT-S and Graphormer are trained on ImageNet-1k and
ZINC, respectively, and BERT is finetuned on STS-B. It can be observed that for all models, the
actual EK is always smaller than the theoretical error bound.
20.0

20.0

Theorem
Actual

17.5

15.0

12.5

12.5

10.0

10.0

10.0

EK

15.0

12.5
7.5

7.5

7.5

5.0

5.0

5.0

2.5

2.5

2.5

0.0

0.0
2

6
K

(a) DeiT-S

10

Theorem
Actual

17.5

15.0
EK

EK

20.0

Theorem
Actual

17.5

0.0
2

6
K

10

(b) BERT

Figure 3: Theoretical and actual values of EK

17

2

6
K

(c) Graphormer

10

Graph Convolutions Enrich the Self-Attention in Transformers!

C

I MPLEMENTATION OF GFSA

The pseudo code of our GFSA is shown in Algorithm 1. For implememtation, w0 and w1 can be set
as hyperparameters optionally.
Algorithm 1 PyTorch-style pseudocode for GFSA
w 0 = torch.zeros(h)
w 1 = torch.ones(h)
w K = torch.zeros(h)
I = torch.eyes(n)[None, None, ...]
def GFSA (att, K)
att: original self-attention
att K: high order term
att K = att + (K-1) * (torch.mm(att,att)-att)
gf att: GFSA attention
gf att = w 0[None, :, None, None] * I
+ w 1[None, :, None, None] * att
+ w K[None, :, None, None] * att K
return gf att

D

OVERSMOOTHING AND A DDITIONAL V ISUALIZATIONS

In Figure 2, we show the visualizations of oversmoothing characteristics in DeiT. We also provide
visualizations in other domains. We show the filter response, cosine similarity, and singular value of
BERT finetuned on STS-B dataset of GLUE tasks in Figure 4 and Graphormer finetuned on ZINC
dataset in Figure 5.
To characterize self-attention, we first analyze the filter response of self-attention in the frequency
domain. We follow the method used by Wang et al. (2022) for spectral visualization of the selfattention matrix. As shown in Figure 2 (a), Deit has a near-zero magnitude for the high frequencies,
which is characteristic of a low-frequency filter and is likely to result in oversmoothing when applied
multiple times.
We follow the calculation method of Guo et al. (2023) for cosine similarity. As shown in Figure 2 (b), the higher similarity as the layers of the model get deeper is related to the oversmoothing
problem. To further analyze this issue, we also consider the dimensionality collapse in Transformerbased models. We plot the singular value distribution of the feature in the last block. As shown in
Figure 2 (c), insignificant, near-zero values dominate the feature distribution. As layers get deeper,
the similarity of features increases and dimensional collapse occurs. The oversmoothing problem is
the same in BERT and Graphormer, as shown in Figure 4 and Figure 5.

0.8
0.6
0.4
0.2

0.8

0.6
BERT
BERT + GFSA

0.4
0.0
−40

−20

0
Frequency

20

(a) Filter reponse

40

Normalized Singular Value

BERT
BERT + GFSA
Cosine Similarity

Normalized Magnitude

1.0

1

4

8
Layer Index

(b) Cosine Similarity

12

BERT
BERT + GFSA

0.9

0.6

0.3

0.0
0

10

20

30

40

50

60

Singular Value Index

(c) Singular Value

Figure 4: Filter frequency response, cosine similarity, and singular values on STS-B for BERT and
BERT+GFSA
18

Graph Convolutions Enrich the Self-Attention in Transformers!

Graphormer
Graphormer + GFSA

0.8

0.6
0.4
0.2

Normalized Singular Value

Graphormer
Graphormer + GFSA

0.8

Cosine Similarity

Normalized Magnitude

1.0

0.6
0.4
0.2

0.0
−20 −15 −10 −5 0
5
Frequency

(a) Filter reponse

10

15

0.0

Graphormer
Graphormer + GFSA

0.9

0.6

0.3

0.0
1

4

8
Layer Index

(b) Cosine Similarity

12

0

2

4

6

8

Singular Value Index

(c) Singular Value

Figure 5: Filter frequency response, cosine similarity, and singular values on ZINC for Graphormer
and Graphormer+GFSA

E

NATURAL L ANGUAGE U NDERSTANDING

E.1

D ETAILED E XPERIMENTAL S ETTINGS

Dataset.

The benchmark datasets we used are listed below.

C O LA The Corpus of Linguistic Acceptability (Warstadt et al., 2019) consists of English acceptability judgments drawn from books and journal articles. The target task is a binary classification
task, and each sentence is determined to be grammatically acceptable or not.
SST-2 The Stanford Sentiment Treebank (Socher et al., 2013) is a dataset in which each sentence
is sourced from movie reviews and accompanied by human annotations of their sentiment. The
target task is binary sentiment classification for a single sentence.
MRPC The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs, which are automatically extracted from online news sources and annotated by humans.
The target is to determine whether the sentences in the pair are semantically equivalent.
QQP The Quora Question Pairs (Chen et al., 2018) dataset is a collection of question pairs from
the community question-answering website Quora. The target is to determine whether the questions
in the pair are semantically equivalent.
STS-B The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence
pairs drawn from news headlines, video and image captions, and natural language inference data
with human annotation. The target is a regression task to predict a similarity score from 0 to 5.
MNLI The Multi-Genre Natural Language Inference Corpus (Williams et al., 2017) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence
and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The standard test set consists
of private labels from the authors and evaluates both the matched (in-domain) and mismatched
(cross-domain) sections.
QNLI The Stanford Question Answering (Wang et al., 2018) dataset is a question-answering
dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph contains the answer to the corresponding question written by an annotator. The task is to determine
whether the context sentence contains the answer to the question.
RTE The Recognizing Textual Entailment (Bentivogli et al., 2009) dataset comes from a series of
annual textual entailment challenges. The target task is a binary entailment classification task.
BERT. BERT (Devlin et al., 2019) consists with 12 layers, 12 heads, 768 hidden size, 512 maximum sequence length, and MLP dimension of 3072.
19

Graph Convolutions Enrich the Self-Attention in Transformers!

58

95.0

91.5

94.5

91.0

94.0

90.5
F1 score

59

Accuracy

Matthews Correlation

60

93.5
93.0
92.5

57

3

4

5

6

7

K

8

9

2

10

3

(a) CoLA

4

5

6
K

7

8

9 10

2

88.0
87.5
87.0

4

5

6
K

7

8

6
K

7

8

9 10

8

9 10

8

9 10

85.5
88.5
88.0
87.5
87.0

9 10

85.0
84.5
84.0
83.5

86.5

86.5

5

86.0

Accuracy

Spearman Correlation

88.5

3

4

(c) MRPC

89.0

2

3

(b) SST-2

89.0

F1 score

89.0
88.0

91.5
2

89.5
88.5

92.0

56

90.0

83.0
2

3

(d) QQP

4

5

6
K

7

8

9 10

2

(e) STS-B

3

4

5

6
K

7

(f) MNLI-m

86.0
69.0

84.5
84.0

91.5
91.0

83.5

90.5

83.0

90.0
2

3

4

5

6
K

7

8

(g) MNLI-mm

9 10

Accuracy

92.0

85.0

Accuracy

Accuracy

69.5

92.5

85.5

68.5
68.0
67.5
67.0
66.5
66.0

2

3

4

5

6
K

(h) QNLI

7

8

9 10

2

3

4

5

6
K

7

(i) RTE

Figure 6: Sensitivity results on various K with BERTBASE finetuned on GLUE tasks
ALBERT. ALBERT (Lan et al., 2019) consists with 12 layers, 12 heads, 768 hidden size, 512
maximum sequence length, 128 embedding dimension, and MLP dimension of 3072.
RoBERTa. RoBERTa (Liu et al., 2019) consists with 12 layers, 12 heads, 768 hidden size, 514
maximum sequence length, and MLP dimension of 3072.
Training. For implementation, we adopt HuggingFace framework. We trained all models with 5
epochs with 32 batch size. The linear learning rate decay is used and initial learning rate is set to
2 × 10−5 . We use AdamW (Loshchilov & Hutter, 2017) optimizer, and weight decay is set to 0. All
models are trained on 1 GPU and of NVIDIA RTX A5000 24GB.
E.2

S ENSITIVITY TO K

In this section, we explore the influence of the polynomial order, denoted as K, in our GFSA,
conducting experiments on BERTBASE finetuned with GLUE tasks. We search for values of K from
2 to 10, and the results are presented in Figure 6. For each dataset, Optimal K exists and the
performance of models using GFSA is generally robust to changes in K.

20

Graph Convolutions Enrich the Self-Attention in Transformers!

F

C AUSAL L ANGUAGE M ODELING

F.1

D ETAILED E XPERIMENTAL S ETTINGS

Dataset.

The benchmark datasets we used are listed below.

PTB Penn Treebank (Marcus et al., 1993) dataset is a collection of text documents that have been
extensively annotated with linguistic information, primarily syntactic and grammatical structures.
W IKI T EXT WikiText (Merity et al., 2016) dataset is a collection of over 100 million tokens extracted from the set of verified good and featured articles on Wikipedia. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over
110 times larger.
GPT2. GPT-2 (Radford et al., 2019) is a Transformer pretrained on a very large corpus of English
data in a self-supervised fashion without any human labelling on dataset. It automatically generate
inputs and labels from those texts, and trained to guess the next word in sentences. For implementation, we adopt HuggingFace Framework 3 . For all experiments, GPT2 has 12 layers with 12
attention heads, 768 hidden size and 1024 maximum sequence length, resulting in a total of 117
million parameters.
Training. We finetune GPT2 with 4 batch size, 5 × 10−5 learning rate and linear learning weight
decay using adamW (Loshchilov & Hutter, 2017) optimizer. We also apply dropout with probability
0.1. Following (Yao et al., 2022), we train models for 15 epochs with PTB, 4 epochs with WikiText103 and 10 epochs with WikiText-2. We use sensitivity metric, i.e., perplexity, which is a commonly
used metric to evaluate the performance of language models, particularly in language modeling and
text generation tasks. perplexity measures how well a language modeling can predict a sequence of
words in a given text or a test dataset. All the experiments are conducted on 1 GPU and of NVIDIA
RTX 3090 24GB.
F.2

S ENSITIVITY TO K

We conducted a sensitivity study on K of GPT-2 across all datasets, and the results are presented in
Table 8. For PTB and WikiText-2, GFSA exhibits the best performance when K is high, typically
around 8 or 9. However, for WikiText-103, GFSA achieves the best perplexity when K is small,
specifically when K is 3 or 4.
Table 8: Results comparison on GPT-2 finetuned with GFSA

3

Method

#Params

PTB

WikiText-2

WikiText-103

GPT2 (Radford et al., 2019)
GPT2 + GFSA(K = 2)
GPT2 + GFSA(K = 3)
GPT2 + GFSA(K = 4)
GPT2 + GFSA(K = 5)
GPT2 + GFSA(K = 6)
GPT2 + GFSA(K = 7)
GPT2 + GFSA(K = 8)
GPT2 + GFSA(K = 9)

117M
117M
117M
117M
117M
117M
117M
117M
117M

19.513
19.459
19.456
19.453
19.452
19.451
19.450
19.450
19.450

20.966
20.929
20.927
20.927
20.925
20.925
20.925
20.924
20.923

15.939
15.920
15.919
15.919
15.920
15.920
15.921
15.921
15.921

https://github.com/huggingface/transformers

21

Graph Convolutions Enrich the Self-Attention in Transformers!

G

I MAGE C LASSIFICATION

G.1

D ETAILED E XPERIMENTAL S ETTINGS

Our code is implemented based on the timm library (Wightman, 2019). In the case of our training
recipe, it is the same as experimental setting of Wang et al. (2022) that follows the training recipes of
Touvron et al. (2021a) and Touvron et al. (2021b). To apply our GFSA to existing base models such
as DeiT, Cait, and Swin, we consider a range of K between 2 and 5. For 12-layer DeiT, we follow the
same hyperparameters from Wang et al. (2022). We set the dropout rate to 0 and 0.2 for 12-layer and
24-layer DeiT, respectively. For CaiT, we apply our GFSA on only to the patch embedding layer. All
other hyper-parameters are kept consistent with the original papers of DeiT Touvron et al. (2021a),
CaiT (Touvron et al., 2021b) and, Swin (Liu et al., 2021). All models are trained on NVIDIA RTX
3090 24GB.
G.2

FLOP S & T HROUGHPUT

In Table 9, we report the number of FLOPs and throughput. With GFSA plugged in, the FLOP count
is either the same or no different. For DeiT-S with 24 layers, which shows a slight FLOP increase
with GFSA plugged in. However, for the rest of the settings, the models have the same number of
Flops. For throughput, it tends to decrease because calculating the high-order term is an additional
cost.
Table 9: Experimental evalutation of GFSA plugged into DeiT-S, CaiT-S, and Swin-S
Backbone Method

Input Size #Layers #Params #FLOPs #Throughput Top-1 Acc

DeiT-S
DeiT-S + GFSA

224
224

12
12

22.0M
22.0M

4.57G
4.57G

856.07
614.54

79.8
81.1 (↑ 1.3)

DeiT-S
DeiT-S + GFSA

224
224

24
24

43.3M
43.3M

9.09G
9.10G

423.68
314.75

80.5
81.5 (↑ 1.0)

CaiT

CaiT-S
CaiT-S + GFSA

224
224

24
24

46.9M
47.0M

9.34G
9.34G

574.66
406.96

82.6
82.8 (↑ 0.2)

Swin

Swin-S
Swin-S + GFSA

224
224

24
24

49.6M
49.6M

8.75G
8.75G

912.38
714.60

82.9
83.0 (↑ 0.1)

DeiT

G.3

S ENSITIVITY TO K

We also perform the sensitivity analysis for K. Tables 10 and 11 show the results of sensitivity
analysis for DeiT-S and CaiT-S with GFSA plugged in. For 12-layer DeiT-S, GFSA performance of
81.12 is highest when K = 3. When the GFSA has a K of 2, the performance is worse than the
original DeiT-S, but when the K is 3 or higher, the performance is better than the original DeiT-S,
and most surprisingly, the performance is better than the 24-layer DeiT-S.
CaiT-S shows the highest performance of 82.84 when K = 4. For CaiT-S, the accuracy is slightly
lower than that of the original CaiT-S when K = 2, but it starts to exceed the accuracy of CaiT-S
when K is 3 or higher.
Table 10: Sensitivity to K for 12-layer DeiT-S + Table 11: Sensitivity to K for 24-layer CaiT-S +
GFSA
GFSA
K

2

3

4

K

5

Top-1 Acc (%) 79.27 81.12 80.86 81.07

2

3

4

Top-1 Acc (%) 82.54 82.65 82.84

22

Graph Convolutions Enrich the Self-Attention in Transformers!

G.4

A DDITIONAL E XPERIMENTS WITH G UO ET AL . (2023)’ S SETTING

To make a fair comparison with ContraNorm (Guo et al., 2023), one of the related studies that
mitigates oversmoothing, we run additional experiments to match their experimental setup.
Experimental Setting. We follow the training recipe used by Guo et al. (2023), which is a slightly
modified version of Touvron et al. (2021a)’s recipe. Guo et al. (2023) use AdamW optimizer with
cosine learning rate decay. We select the DeiT-T and DeiT-S for ImageNet-100 and ImageNet-1k,
respectively. “T” and “S” denote tiny and small model sizes, respectively. For all experiments, the
image size is set to be 224x224. We train each model for 300 epochs and the batch size is set to
1024. For ContraNorm, we train with their recommended hyperparameters. All models are trained
on 4 GPUs and of NVIDIA RTX A6000 48GB.
Experimental Results. In Table 12, DeiT-T and DeiT-S with GFSA outperform vanilla DeiT-T
and DeiT-S in all layer settings. On ImageNet-100, GFSA improves the performance of DeiT-T
with 12 layers by 1.52%. The largest gain is a 4.88% improvement on 16-layer DeiT-T. This shows
that the effect of GFSA is larger than the effect of ContraNorm. For ImageNet-1k, surprisingly,
with 16 layers, GFSA is able to increase the performance of DeiT-S by 80.83%, meaning that GFSA
brings benefits with a 3.23% improvement.
Table 12: Experiment results on ImageNet-100 and ImageNet-1k
Dataset

Method

#Layers=12

#Layers=16

#Layers=24

ImageNet-100

DeiT-T
DeiT-T + ContraNorm
DeiT-T + GFSA

76.52
77.03
77.68

75.34
78.72
79.02

76.76
78.12
78.64

ImageNet-1k

DeiT-S
DeiT-S + ContraNorm
DeiT-S + GFSA

77.32
77.80
79.86

78.25
79.04
80.83

77.69
78.67
79.15

Sensitivity to K. In Table 13, we experiment with a sensitivity analysis for K. For ImageNet-100,
the performance of GFSA generally improves when K is 4 or 5. On the other hand, GFSA performs
better at lower K for settings that are layers 16 and 24 for ImagNet-1k.
Table 13: Experiment results on ImageNet-100 and ImageNet-1k
Dataset

Method

K

#Layers=12

#Layers=16

#Layers=24

ImageNet-100

DeiT-T + GFSA
DeiT-T + GFSA
DeiT-T + GFSA
DeiT-T + GFSA

2
3
4
5

76.92
77.41
77.01
77.68

78.14
77.76
79.02
78.14

78.40
78.09
78.64
78.64

ImageNet-1k

DeiT-S + GFSA
DeiT-S + GFSA
DeiT-S + GFSA

2
3
4

79.84
79.85
79.86

80.83
79.39
79.44

79.15
79.07
79.10

23

Graph Convolutions Enrich the Self-Attention in Transformers!

H

G RAPH C LASSIFICATION

H.1

D ETAILED E XPERIMENTAL S ETTINGS

Dataset.

The benchmark datasets we used are listed below.

ZINC ZINC (Irwin et al., 2012) is the most popular real-world molecular dataset to predict graph
property regression for constrained solubility, an important chemical property for designing generative GNNs for molecules. Uniform sampling is adopted for data splitting. We use a ZINC-subset of
small-scale dataset.
PCQM4M PCQM4M-LSC (Hu et al., 2021) is 2D molecular graphs, which is one of the
most practically relevant quantum chemical properties of molecule science. The task is to predict DFT(density functional theory)-calculated HOMO-LUMO energy gap of molecules given their
graphs. PCQM4M-LSC is unprecedentedly large in scale comparing to other labeled graph-level
prediction datasets, which contain more than 3.8M graphs. We use PCQM4M and PCQM4Mv2
large-scale datasets.
Graphormer. Following Graphormer (Ying et al., 2021), we use Graphormer for PCQM4M and
GraphormerSLIM for ZINC. Graphormer consists of 12 encoder layers, 80 encoder embedding dimension, and 768 MLP dimension. It employs 32 encoder heads and 24 hidden dimension for each
head. GraphormerSLIM consists of 12 encoder layers, 80 encoder embedding dimension, and 80
MLP dimension. It employs 8 encoder heads and 10 hidden dimension for each head.
Training. We adopt a training recipe from Graphormer (Ying et al., 2021).
We use
adamW (Loshchilov & Hutter, 2017) optimizer with 0.9 and 0.999 coefficients for running averages of gradient and its square, and use Mean Absolute Error (MAE) as loss function. We use
polynomial learning rate decay, with initial learning rate set to 2 × 10−4 and end learning rate set
to 1 × 10−9 . For ZINC, we set batch size as 256, max epochs as 10k, and warm-up stage step as
40k. For PCQM4M and PCQM4Mv2, we set batch size as 1024, max epochs as 300, and warm-up
stage step as 60k. All models are trained on 1 GPU and of NVIDIA RTX 3090 24GB. We conduct
experiments with 4 different seeds.
H.2

E XPERIMENTAL RESULTS WITH STANDARD DEVIATION
Table 14: Experimental results and number of parameters on ZINC
Method

#Params

MAE

Graphormer
500K 0.1240±0.006
Graphormer + GFSA 500K 0.1189±0.002

Table 15: Experimental results and number of parameters on PCQM4M and PCQM4Mv2
Method

#Params

PCQM4M
Train

Validate

PCQM4Mv2
Train

Validate

Graphormer
48.3M 0.0535±0.038 0.1286±0.016 0.0250±0.000 0.0862±0.000
Graphormer + GFSA 48.3M 0.0312±0.001 0.1193±0.000 0.0249±0.000 0.0860±0.000

We conducted experiments following the experimental environments of Graphormer Ying et al.
(2021) using 4 different seeds. Due to space constraints, only the mean values are reported in
Tables 5 and 6. In this section, we report the results with mean and standard deviations in Tables 14
and 15.
24

Graph Convolutions Enrich the Self-Attention in Transformers!

I

AUTOMATIC S PEECH R ECOGNITION

I.1

D ETAILED E XPERIMENTAL S ETTINGS

Dataset. We conduct experiments on the LibriSpeech 4 dataset (Panayotov et al., 2015), which
consists of audio recordings paired with their transcriptions. The LibriSpeech dataset has approximately 1,000 hours of read English speech with a sampling rate of 16 kHz. We keep the original
16,000Hz sampling rate and compute 80-dim log-Mel filterbanks for a 25ms sliding window, strided
by 10ms. The filterbank features are then normalized to zero mean and unit variance per input sequence. For implementation, we follow the recipes of SpeechBrain (Ravanelli et al., 2021).
Evaluation Metric. Word error rate (WER (%)) is derived from the Levenshtein distance and
compares a reference to a hypothesized word-level transcription. It is calculated by summing the
number of word insertions, deletions, substitutions and dividing it by the total number of words in
the reference transcription.
Vanilla Transformer. We use a vanilla Transformer to apply our GFSA. For implementation, we
use a SpeechBrain (Ravanelli et al., 2021) framework. The vanilla Transformer consists of i) 1D
convolution to perform striding, ii) Transformer encoder with 12 layers, 4 heads, embedding dimension of 512, MLP dimension of 2048, and post-LayerNorm iii) decoder with 6 layers, 4 heads,
embedding dimension of 512, MLP dimension of 2048, joint beamsearch, and iv) external Transformer language model with 12 layers, 12 heads, embedding dimension of 768, and MLP dimension
of 3072.
Branchformer. We use one of the SOTA models, Branchformer (Peng et al., 2022) to plug-in our
GFSA. Branchformer has two parallel branches, one for capturing global interactions using attention
and the other for more localized context using convolutional gating MLP. The Branchformer architecture for speech recognition consists of i) 1D convolution to perform striding, ii) Branchformer
encoder with 18 layers, 8 heads, embedding dimension of 512, and MLP dimension of 3072, iii)
decoder with 6 layers, 8 heads, embedding dimension of 512, a convolutional spatial gating unit
(CSGU) dimension of 3072, joint beamsearch, and iv) external Transformer language model with
12 layers, 12 heads, embedding dimension of 768, and MLP dimension of 3072.
Training. We follow a training recipe from SpeechBrain (Ravanelli et al., 2021). The standard
LibriSpeech validation sets (dev-clean and dev-other) are used to tune all parameters and select the
best models. Test sets (test-clean and test-other) are used only to report final WER performance.
We train the pure Transformer for 100 epochs and the Branchformer for 120 epochs with a batch
size of 16. We use a data augmentation method on all models using SpecAugment (Park et al.,
2019). SpecAugment applies time and frequency masking as well as time warping to the input
spectrum. For Branchformer, we use AdamW (Loshchilov & Hutter, 2017) optimizer with 0.9 and
0.98 coefficients for computing running averages of gradient and its square. The learning rate and
weight decay in all models are 0.0008 and 0.01, respectively. We use a connectionist temporal
classification (CTC) loss Graves & Jaitly (2014); Zhang et al. (2016). We also apply dropout with
probability 0.1 and label smoothing with weight 0.1 to mitigate overfitting. We fix the random seed
as 74443 on all experiments. All models are trained on 1 GPU and of NVIDIA RTX A6000 48GB.
Hyperparameters. In Table 16, we describe main hyperparameters used in the automatic speech
recognition task. For Transformer+GFSA and Branchformer+GFSA, we also report the best K
hyperparameter.

4

http://www.openslr.org/12

25

Graph Convolutions Enrich the Self-Attention in Transformers!

Table 16: Main hyperparameters used in ASR
Model

Experimental Setting

Transformer

Encoder: Transformer (12 layers)
Decoder: Transformer (6 layers) + (CTC/ATT joint) beamsearch + TransformerLM
Augmentation: SpecAugment
Features: 40 fbanks
Pretraining: no
Dropout: 0.1
Batchnorm: yes
Number of epochs: 100
Batch size: 32
Learning rate: 0.0008
LR scheduler: Noam
Optimizer: Adam
Loss: CTC + KLdiv (Label Smoothing loss)
CTC weight: 0.3

Transformer+GFSA

Encoder: Transformer (12 layers)
Decoder: Transformer (6 layers) + (CTC/ATT joint) beamsearch + TransformerLM
Augmentation: SpecAugment
Features: 40 fbanks
Pretraining: no
Dropout: 0.1
Batchnorm: yes
Number of epochs: 100
Batch size: 32
Learning rate: 0.0008
LR scheduler: Noam
Optimizer: Adam
Loss: CTC + KLdiv (Label Smoothing loss)
CTC weight: 0.3
K: 2

Branchformer

Encoder: Branchformer
Decoder: Transformer (6 layers) + (CTC/ATT joint) beamsearch + TransformerLM
Augmentation: SpecAugment
Features: 40 fbanks
Pretraining: no
Dropout: 0.1
Batchnorm: yes
Number of epochs: 120
Batch size: 16
Learning rate: 0.0008
LR scheduler: Noam
Optimizer: AdamW with coefficients 0.9 and 0.98
Loss: CTC + KLdiv (Label Smoothing loss)
CTC weight: 0.3

Branchformer+GFSA

Encoder: Branchformer
Decoder: Transformer (6 layers) + (CTC/ATT joint) beamsearch + TransformerLM
Augmentation: SpecAugment
Features: 40 fbanks
Pretraining: no
Dropout: 0.1
Batchnorm: yes
Number of epochs: 120
Batch size: 16
Learning rate: 0.0008
LR scheduler: Noam
Optimizer: AdamW with coefficients 0.9 and 0.98
Loss: CTC + KLdiv (Label Smoothing loss)
CTC weight: 0.3
K: 3

26

Graph Convolutions Enrich the Self-Attention in Transformers!

I.2

T RAINING C URVE

We compare the training and validation curves for LibriSpeech 100h in Figure 7. The training
loss curve of GFSA is lower than the pure Transformer. GFSA stabilizes the loss curve of pure
Transformer slightly earlier.

15
10

20
15
10

5
0

Transformer
Transformer+GFSA

25
Valid Loss

Train Loss

30

Transformer
Transformer+GFSA

20

5
0

25

50
Epoch

75

100

(a) Training loss curve

0

0

25

50
Epoch

75

100

(b) Validation loss curve

Figure 7: Training curve on LibriSpeech 100h

J

C ODE D EFECT D ETECTION

J.1

D ETAILED E XPERIMENTAL S ETTINGS

Dataset. We use Devign dataset provided by (Zhou et al., 2019), which is a binary classification
task to evaluate whether a C language code is vulnerable to software systems or not.
Implementation. We build our experiments on top of the open-sourced code 5 and recipes provided by Wang et al. (2021b).
RoBERTa. RoBERTa (Liu et al., 2019) is an encoder-only model trained with masked language
modeling on code. All hyperparameters are consistent with the training method in the source code
of Wang et al. (2021b).
PLBART. PLBART (Ahmad et al., 2021) is an encoder-decoder model based on BART (Lewis
et al., 2020) architecture. PLBART can support understanding and generation tasks. All hyperparameters are consistent with the training method in the source code of Wang et al. (2021b).
CodeBert. CodeBERT (Feng et al., 2020) is a model trained on masked language modeling and
replaced token detection. CodeBERT is a bimodal pretrained model based on Transformer with 12
layers for programming language and natural language. All hyperparameters are consistent with the
training method in the source code of Wang et al. (2021b).
CodeT5. CodeT5 is an encoder-decoder framework with the same architecture as T5 (Raffel et al.,
2020). It aims to derive generic representations for programming language and natural language via
pre-training on unlabeled source code. CodeT5-small has 6 encoder layers, 6 decoder layers, 8
attention heads, 512 dimensional hidden states, and 60M parameters. The other models have 12
encoder layers, 12 decoder layers, 12 attention heads, 768 dimensional hidden states, and 220M
parameters. All hyperparameters are consistent with the training method in the source code of Wang
et al. (2021b).
Training. The pre-trained models mentioned above are applied to this downstream task. We add
GFSA directly on top of self-attention. We finetune baselines and GFSA models for 10 epochs with
a batch size of 16. We use early stopping strategy with a patience of 2. Models generate binary
labels from unigram sequences at the decoder for defect detection task. We employ accuracy for
evaluating the code defect detection task. All models are trained on 1 GPU and of NVIDIA RTX
A6000 48GB.
5

https://github.com/salesforce/CodeT5

27

Graph Convolutions Enrich the Self-Attention in Transformers!

J.2

C ASE S TUDY

In Listing 1, we show one of case for code snippets of defects in QEMU 6 that CodeT5-base does
not predict correctly, but that only CodeT5-base+GFSA predicts. The commit message 7 for this
case is as follow:
Needed for changing cpu has work() argument type to CPUState, used in
h cede().
h cede() is the hypercall that asks the hypervisor to shut down the CPU. Previously, this hypercall simply passed the CPUID, so the hypervisor did not know what state the CPU was in. This
change allows the hypervisor to know whether the CPU is actually performing work. If the CPU is
performing a task, the hypervisor waits for the CPU to complete the task.
In this context, accurately predicting defects like the one above is very important, and applying
GFSA to CodeT5-base helps in terms of performance improvement.
@@ -204,7 +204,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet,
target_ulong ioba,
2 - static target_ulong h_put_tce(CPUPPCState *env, sPAPREnvironment *spapr
3 + static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr
4
, target_ulong opcode, target_ulong *args)
5
{
6
target_ulong liobn = args[0];
7
target_ulong ioba = args[1];
8
target_ulong tce = args[2];
9
VIOsPAPRDevice *dev = spapr_vio_find_by_reg(spapr->vio_bus, liobn);
10
VIOsPAPR_RTCE *rtce;
11
if (!dev) {
12
hcall_dprintf("LIOBN 0x" TARGET_FMT_lx " does not exist\n", liobn);
13
return H_PARAMETER;
14
}
15
ioba &= ˜(SPAPR_VIO_TCE_PAGE_SIZE - 1);
16
#ifdef DEBUG_TCE
17
fprintf(stderr, "spapr_vio_put_tce on %s ioba 0x" TARGET_FMT_lx " TCE
0x" TARGET_FMT_lx "\n", dev->qdev.id, ioba, tce);
18
#endif
19
if (ioba >= dev->rtce_window_size) {
20
hcall_dprintf("Out-of-bounds IOBA 0x" TARGET_FMT_lx "\n", ioba);
21
return H_PARAMETER;
22
}
23
rtce = dev->rtce_table + (ioba >> SPAPR_VIO_TCE_PAGE_SHIFT);
24
rtce->tce = tce;
25
return H_SUCCESS;
26 }
1

Listing 1: An example commit history for defects in Devign dataset

K

C ODE C LONE D ETECTION

K.1

D ETAILED E XPERIMENTAL S ETTINGS

Dataset. Code clone detection aims to measure the similarity between two code snippets and predict whether they have the same functionality. We experiment with the Java data provided by Wang
et al. (2020).
Implementation. We build our experiments on top of the open-sourced code 8 and recipes provided by Wang et al. (2021b).
6

https://www.qemu.org
https://github.com/qemu/qemu/commit/b13ce26d3e8c6682044ae84920f2417b30ce356b
8
https://github.com/salesforce/CodeT5
7

28

Graph Convolutions Enrich the Self-Attention in Transformers!

Training. We finetune both CodeT5 and CodeT5+GFSA for one epoch with a batch size of 16.
We also use early stopping with patience of 2. CodeT5 and CodeT5+GFSA encode source code
and take the representation to calculate similarity of two code snippets. We employ F1 score for
evaluating this task. All models are trained on 1 GPU and of NVIDIA RTX A6000 48GB.
K.2

E XPERIMENT R ESULT

Table 17 shows results comparing CodeT5 and CodeT5 with GFSA. The result shows that by using
our GFSA, CodeT5 models improve their performance. CodeT5-small+GFSA provides a 0.61%
improvment over Code5T-small.
Table 17: Results on the code clone detection task
Method

L

Clone F1

CodeT5-small (Wang et al., 2021b)
CodeT5-small + GFSA

94.36
94.94 (↑ 0.61%)

CodeT5-base (Wang et al., 2021b)
CodeT5-base + GFSA

94.31
94.92 (↑ 0.64%)

T IME C OMPLEXITY AND E MPIRICAL RUNTIME A NALYSIS

Time Complexity. The time complexity of original self-attention is O(n2 d). But our GFSA has a
high order term. Therefore, the time complexity of GFSA has O(n2 d + n3 ). If n is smaller than d,
the time complexity approaches O(n2 d), which is the complexity of original self-attention.
Empirical Runtime Analysis. We report the training time of various methods with GFSA in Tables 31 to 23. In general, the training time of methods with GFSA is slightly longer than that of
existing methods. For example, the Transformer for the automatic speech recognition task increases
from 190.5 seconds to 191.6 seconds on Librispeech 100h dataset, as increases of only 1 second.
Instead of computing higher-order polynomial terms, our GFSA approximates them, with only a
small increase in runtime, which is not very significant.
Table 18: Training time (seconds per epoch) on GLUE tasks. s denotes the abbreviation for second.
Avg denotes the average training time across all tasks.
Datasets

#Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg

BERTBASE (Devlin et al., 2019)
BERTBASE + GFSA

110M
110M

17s
19s

182s
192s

17s 1483s 24s
19s 1571s 25s

2004s
2147s

580s 18s 541s
621s 20s 577s

ALBERTBASE (Lan et al., 2019)
ALBERTBASE + GFSA

11M
11M

15s
16s

188s
197s

20s 1506s 25s
21s 1604s 26s

2072s
2219s

612s 19s 557s
659s 21s 595s

RoBERTaBASE (Liu et al., 2019) 125M
RoBERTaBASE + GFSA
125M

17s
19s

190s
200s

18s 1492s 25s
19s 1580s 26s

2012s
2151s

593s 18s 546s
634s 20s 581s

Table 19: Training time (seconds per epoch) on causal language modeling tasks.
Method

#Params

PTB

WikiText-2

WikiText-103

Avg

GPT2 (Radford et al., 2019)
GPT2 + GFSA

117M
117M

89.1s
160.3s

195.7s
354.2s

9638.4s
17424.6s

3307.8s
5979.7s

29

Graph Convolutions Enrich the Self-Attention in Transformers!

Table 20: Training time (seconds per epoch) on ImageNet-1k
Backbone

Method

#Layers

#Params

#FLOPs

#Throughput

Runtime

DeiT

DeiT-S
DeiT-S + GFSA
DeiT-S
DeiT-S + GFSA

12
12
24
24

22.0M
22.0M
43.3M
43.3M

4.57G
4.57G
9.09G
9.10G

856.07
614.54
423.68
314.75

551s
814s
1508s
1798s

CaiT

CaiT-S
CaiT-S + GFSA

24
24

46.9M
47.0M

9.34G
9.34G

574.66
406.96

1530s
1624s

Swin

Swin-S
Swin-S + GFSA

24
24

49.6M
49.6M

8.75G
8.75G

912.38
714.60

1897s
1970s

Table 21: Training time (seconds per epoch) on graph classficiation tasks
Method

ZINC

PCQM4M

PCQM4Mv2

Graphormer (Ying et al., 2021)
Graphormer + GFSA

9s
9s

740s
896s

817s
955s

Table 22: Training time (seconds per epoch) on LibriSpeech datasets
Method

LibriSpeech 100h

LibriSpeech 960h

Transformer
Transformer + GFSA

190.5s
191.6s

3049.3s
3398.4s

Branchformer (Peng et al., 2022)
Branchformer + GFSA

248.5s
254.3s

4990.1s
4999.3s

Table 23: Training time (seconds per epoch) on the code defect prediction task
Method

Runtime

RoBERTa (Liu et al., 2019)
RoBERTa + GFSA
CodeBERT (Feng et al., 2020)
CodeBERT + GFSA
PLBART (Ahmad et al., 2021)
PLBART + GFSA
CodeT5-small (Wang et al., 2021b)
CodeT5-small + GFSA

543.96s
537.79s
555.28s
561.43s
467.80s
470.19s
301.11s
309.04s

CodeT5-base (Wang et al., 2021b)
CodeT5-base + GFSA

362.28s
373.22s

30

Graph Convolutions Enrich the Self-Attention in Transformers!

M

I NFERENCE T IME

We report the inference time of various methods with GFSA in Tables 24 to 29.
Table 24: Inference time on GLUE tasks. s denotes the abbreviation for second. Avg denotes the
average training time across all tasks.
Datasets

#Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg

BERTBASE (Devlin et al., 2019)
BERTBASE + GFSA

110M
110M

1.0s
1.1s

1.4s
1.4s

1.2s 48.7s 1.9s
1.2s 52.3s 2.0s

15.5s
16.8s

10.1s 1.2s 10.0s
11.0s 1.3s 11.0s

ALBERTBASE (Lan et al., 2019)
ALBERTBASE + GFSA

11M
11M

1.1s
1.2s

1.6s
1.7s

1.4s 58.4s 2.2s
1.4s 62.1s 2.3s

18.4s
19.7s

12.1s 1.3s 12.0s
13.1s 1.4s 13.0s

RoBERTaBASE (Liu et al., 2019) 125M
RoBERTaBASE + GFSA
125M

1.0s
1.1s

1.4s
1.4s

1.1s 47.0s 1.9s
1.2s 50.4s 2.0s

15.0s
16.3s

9.9s 1.2s 10.0s
10.8s 1.2s 11.0s

Table 25: Inference time on causal language modeling tasks.
Method

#Params

PTB

WikiText-2

WikiText-103

Avg

GPT2 (Radford et al., 2019)
GPT2 + GFSA

117M
117M

3.2s
5.5s

7.4s
12.9s

7.4s
12.9s

6.0s
10.4s

Table 26: Inference time on ImageNet-1k
Backbone

Method

#Layers

Inference Time

DeiT

DeiT-S
DeiT-S + GFSA
DeiT-S
DeiT-S + GFSA

12
12
24
24

52s
53s
68s
69s

CaiT

CaiT-S
CaiT-S + GFSA

24
24

105s
107s

Swin

Swin-S
Swin-S + GFSA

24
24

17s
17s

31

Graph Convolutions Enrich the Self-Attention in Transformers!

Table 27: Inference time on graph classficiation tasks
Method

ZINC

PCQM4M

PCQM4Mv2

Graphormer (Ying et al., 2021)
Graphormer + GFSA

8s
8s

99s
117s

31s
29s

Table 28: Inference time on LibriSpeech datasets
Method

LibriSpeech 100h

LibriSpeech 960h

Transformer
Transformer + GFSA

328.1s
329.5s

323.7s
343.3s

Branchformer (Peng et al., 2022)
Branchformer + GFSA

299.4s
305.5s

328.7s
354.1s

Table 29: Inference time on the code defect prediction task
Method

Inference Time

RoBERTa (Liu et al., 2019)
RoBERTa + GFSA

22.4s
23.9s

CodeBERT (Feng et al., 2020)
CodeBERT + GFSA

23.8s
24.1s

PLBART (Ahmad et al., 2021)
PLBART + GFSA

37.7s
39.3s

CodeT5-small (Wang et al., 2021b)
CodeT5-small + GFSA

78.2s
82.5s

CodeT5-base (Wang et al., 2021b)
CodeT5-base + GFSA

83.2s
88.5s

32

Graph Convolutions Enrich the Self-Attention in Transformers!

N

F REQUENCY A NALYSES IN THE S INGULAR VALUE D OMAIN

Graph signal processing (GSP) (Sandryhaila & Moura, 2013; 2014) can be understood as a generalized concept of DSP — in other words, DSP is a special case of GSP where a line graph with n
nodes is used and therefore, the graph Fourier transform (GFT) of the line graph is identical to the
discrete Fourier transform.
In the definition of GFT, we assume that the graph shift operator (GSO) S is diagonalizable. Considering the eigendecomposition of the GSO S = V ⊺ ΛV with eigenvector V , we can write the
graph filter output as follows:
y=

K
X
k=0

wk S k x =

K
X

V ⊺ wk Λk V x = V ⊺

k=0

K
X


wk Λk V x = V ⊺ g(Λ)V x,

(15)

k=0

where x ∈ Rn is a 1-dimensional graph signal, Λ is a diagonal matrix with eigenvalues, and wk ∈
[−∞, ∞] is a coefficient.
However, one can use the singular value decomposition, S = V ⊺ ΛU , when the GSO is not diagonalizable but symmetrically normalized instead of the eigendecomposition (Maskey et al., 2023).
Both the singular value decomposition and the eigendecomposition project the original signal onto
a set of basis, but they use different basis sets. In the singular value decomposition, we sort the
set of basis in ascending order of their eigenvalues, i.e., Λ, and perform frequency domain-like
analyses (Zou et al., 2022; Maskey et al., 2023).
Since the self-attention matrix’s row-wise sum is always 1, the following is the case: Ā = D −1 A =
1
n A, where n is the number of tokens. Maskey et al. (2023) define the following symmetrically
−1/2
−1/2
normalized adjacency matrix (SNA): Din ADout . Since the degree of every node is n in the
−1/2
−1/2
self-attention matrix, the following is the case: Din ADout = D −1/2 AD −1/2 = √1n A √1n =
1
n A = Ā. Therefore, the self-attention matrix is a special case of SNAs.

O

M ATRIX P OLYNOMIAL VS . G RAPH F OURIER T RANSFORM

There are two paradigms of implementing graph filters: i) matrix polynomial, which does not require
diagonalizability, and ii) graph Fourier transform, which uses the eigendecomposition for diagonalizable adjacency matrices or the Jordan decomposition for non-diagonalizable adjacency matrices.
Those two paradigms have their own weaknesses: i) the matrix polynomial approach requires explicit matrix multiplications, and ii) the graph Fourier transform approach requires expansive spectral decompositions. The matrix polynomial is preferred when the number of matrix multiplications
is not many. Otherwise, the graph Fourier transform approach may be better since the matrix multiplication can be simplified after the eigendecomposition or the Jordan decomposition or the Jordan
normal form.
Among those two, we use the first matrix polynomial approach with only three non-zero coefficients
{w0 , w1 , wK } since it does not require the complicated spectral decomposition. Since we do not
rely on any explicit spectral decomposition but on the matrix polynomial, any adjacency matrix can
be used.

P

A PPROXIMATION OF ĀK IN E QUATION 4

The general formulation of first-order taylor approximation at point a is as follows:
f (x) ≃ f (a) + f ′ (a)(x − a).

(16)

We approximate f (K) = ĀK with first-order taylor approximation at point a = 1:
f (K) = ĀK ≃ f (1) + f ′ (1)(K − 1).

(17)

Inspired by Brouwer et al. (2019), which approximates the derivative of hidden states of RNNs
as the difference between hidden states, we effectively approximate f ′ (1), the derivative of ĀK
33

Graph Convolutions Enrich the Self-Attention in Transformers!

estimated at the position of K = 1, with the difference term as f (2) − f (1) = Ā2 − Ā1 . Then the
approximated ĀK becomes as follows:
f (K) = ĀK ≃ f (1) + (Ā2 − Ā)(K − 1) = Ā + (K − 1)(Ā2 − Ā).

(18)

This approximation has two advantages:
1. The approximation for ĀK with Ā and Ā2 provides a simpler computation that can significantly reduce the required computational resources and time (see theoretical analysis in
Section 3.2).
2. When approximating derivative of ĀK with difference term for graph filter design, our
filter can reduce to GPR-GNN and GREAD, two popular GNN methods preventing the
oversmoothing for graph convolutional networks, which are special cases of our design
(see discussion in Section 3.3). In other words, our method is more general than those two
GNNs.

Q

GFSA IN S ELECTED L AYERS : A S TRATEGY FOR M ODEL E FFICIENCY

Because GFSA requires more calculation than the original self-attention, the runtime after using
GFSA increases. While the experiments in the main paper used GFSA on all layers of Transformer,
we report a strategy to alleviate the computational burden of the Transformer models by applying
GFSA only to some layers. For this purpose, GFSA is used only on even-numbered layers. In
Tables 30 to 34, the results show that this strategy can reduce the increase in runtime and maintain
performance compared to using GFSA for all layers.
Table 30: Comparison of performance using GFSA on all layers vs. GFSAeven on even layers for
GLUE tasks
#Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg

Datasets

BERTBASE (Devlin et al., 2019) 110M 56.79 93.81 88.70 88.32 88.16 84.96/84.15 91.63 66.06 82.51
BERTBASE + GFSA
110M 59.56 94.15 90.60 88.46 88.33 85.12/85.06 91.95 68.95 83.58
BERTBASE + GFSAeven

110M 58.80 93.69 90.50 88.47 88.27 85.13/85.02 91.65 70.76 83.59

Table 31: Comparison of training time (seconds per epoch) using GFSA on all layers vs. GFSAeven
on even layers for GLUE tasks
#Params CoLA SST-2 MRPC QQP STS-B MNLI-m/mm QNLI RTE Avg

Datasets

BERTBASE (Devlin et al., 2019) 110M
BERTBASE + GFSA
110M

17s
19s

182s
192s

17s 1483s 24s
19s 1571s 25s

2004s
2147s

580s 18s 541s
621s 20s 577s

BERTBASE + GFSAeven

17s

185s

18s 1506s 24s

2061s

595s 18s 553s

110M

Table 32: Comparison of performance using GFSA on all layers vs. GFSAeven on even layers for
causal language modeling tasks
Method

#Params

PTB

WikiText-2

WikiText-103

Avg

GPT2 (Radford et al., 2019)
GPT2 + GFSA

117M
117M

19.513
19.450

20.966
20.923

15.939
15.919

18.806
18.764

GPT2 + GFSAeven

117M

19.453

20.926

15.923

18.767

34

Graph Convolutions Enrich the Self-Attention in Transformers!

Table 33: Comparison of training time (seconds per epoch) using GFSA on all layers vs. GFSAeven
on even layers for causal language modeling tasks
Method

#Params

PTB

WikiText-2

WikiText-103

Avg

GPT2 (Radford et al., 2019)
GPT2 + GFSA

117M
117M

89.1s
160.3s

195.7s
354.2s

9638.4s
17424.6s

3307.8s
5979.7s

GPT2 + GFSAeven

117M

127.4s

279.1s

13761.4s

4722.6s

Table 34: Comparison of using GFSA on all layers vs. GFSAeven on even layers for ImageNet-1k
Method

Input Size

#Layers

#Params

Top-1 Acc

Runtime

DeiT-S
DeiT-S + GFSA

224
224

12
12

43M
43M

79.8
81.1

551s
814s

DeiT-S + GFSAeven

224

12

43M

81.0

595s

35