E QUIVARIANT S CALAR F IELDS FOR M OLECULAR
D OCKING WITH FAST F OURIER T RANSFORMS
Bowen Jing1 , Tommi Jaakkola1 , Bonnie Berger1 2
CSAIL, Massachusetts Institute of Technology
2
Dept. of Mathematics, Massachusetts Institute of Technology
bjing@mit.edu, {tommi, bab}@csail.mit.edu

arXiv:2312.04323v1 [q-bio.BM] 7 Dec 2023

1

A BSTRACT
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning
can accelerate this process by learning a scoring function with a functional form
that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields
parameterized by equivariant graph neural networks, enabling rapid optimization
over rigid-body degrees of freedom with fast Fourier transforms. The runtime of
our approach can be amortized at several levels of abstraction, and is particularly
favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose
scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring
functions, and is more robust on computationally predicted structures. Code is
available at https://github.com/bjing2016/scalar-fields.

1

I NTRODUCTION

Proteins are the macromolecular machines that drive almost all biological processes, and much of
early-stage drug discovery focuses on finding molecules which bind to and modulate their activity.
Molecular docking—the computational task of predicting the binding pose of a small molecule to
a protein target—is an important step in this pipeline. Traditionally, molecular docking has been
formulated as an optimization problem over a scoring function designed to be a computational proxy
for the free energy (Torres et al., 2019; Fan et al., 2019). Such scoring functions are typically
a sum of pairwise interaction terms between atoms with physically-inspired functional forms and
empirically tuned weights (Quiroga & Villarreal, 2016). While these terms are simple and hence
fast to evaluate, exhaustive sampling or optimization over the space of ligand poses is difficult and
leads to the significant runtime of docking software.
ML-based scoring functions for docking have been an active area of research, ranging in sophistication from random forests to deep neural networks (Yang et al., 2022; Crampon et al., 2022). These
efforts have largely sought to more accurately model the free energy based on a docked pose, which
is important for downstream identification of binders versus non-binders (virtual screening). However, they have not addressed nor reduced the computational cost required to produce these poses
in the first place. Hence, independently of the accuracy of these workflows, molecular docking for
large-scale structure-based virtual screening remains computationally challenging, especially with
the growing availability of large billion-compound databases such as ZINC (Tingle et al., 2023).
In this work, we explore a different paradigm and motivation for machine learning scoring functions,
with the specific aim of accelerating scoring and optimization of ligand poses for high-throughput
molecular docking. To do so, we forego the physics-inspired functional form of commonly used
scoring functions, and instead frame the problem as that of learning scalar fields independently associated with the 3D structure of the protein and ligand, respectively. We then define the score to
be the cross-correlation between the overlapping scalar fields when oriented according to the ligand
pose. While seemingly more complex than existing scoring functions, these cross-correlations can
1

A

B

ϕP

ϕL ⋆ ϕP

C

ESF

⋆
ESF

ϕL
Figure 1: Overview of the scalar field-based scoring function and docking procedure. The translational FFT procedure is shown here; the rotational FFT is similar, albeit harder to visualize. (A) The
protein pocket and ligand conformer are independently passed through equivariant scalar field networks (ESFs) to produce scalar fields. (B) The fields are cross-correlated to produce heatmaps over
ligand translations. (C) The ligand coordinates are translated to the argmax of the heatmap. Additional scalar field visualizations are in Appendix C.

be rapidly evaluated over a large number of ligand poses simultaneously using Fast Fourier Transforms (FFT) over both the translational space R3 and the rotational space SO(3). This property
allows for significant speedups in the optimization over these degrees of freedom.
Scalar fields representing molecules in 3D space have been previously parameterized as neural fields
(Zhong et al., 2019), i.e., neural networks taking coordinates of a query point as input and producing
field values as output. However, our scalar fields must be defined relative to the ligand (or protein)
structure—represented as a graph embedded in 3D space—and be SE(3) equivariant in order to
yield an invariant scoring function. To satisfy these requirements, we introduce equivariant scalar
field networks (ESFs), which parameterize the scalar fields with message passing E3NNs (Thomas
et al., 2018; Geiger & Smidt, 2022). Unlike neural fields, these networks yield a compact representation of the scalar field in a single forward pass and are automatically SE(3)-equivariant.
Contrasting with existing ML scoring functions, the computational cost of our method can be amortized at several levels of abstraction, significantly accelerating runtimes for optimized workflows.
For example, unlike methods that require one neural network forward pass per pose, our network is
evaluated once per protein structure or ligand conformer independently. Post-amortization, we attain
translational and rotational optimization runtimes as fast as 160 µs and 650 µs, respectively, with
FFTs. Such throughputs, when combined with effective sampling and optimization, could make
docking of very large compound libraries feasible with only modest resources.
Empirically, we evaluate our method on two simplified docking-related tasks: (1) decoy pose scoring
and (2) rigid conformer docking. On both tasks, our scoring function is competitive with—but faster
than—the scoring functions of Gnina (Ragoza et al., 2017; McNutt et al., 2021) and Vina (Trott
& Olson, 2010) on PDBBind crystal structures and is significantly better on ESMFold structures.
We then demonstrate the further advantages of runtime amortization on the virtual screening-like
setup of the PDE10A test set (Tosstorff et al., 2022), where—since there is only one unique protein
structure—our method obtains a 50x speedup in total inference time at no loss of accuracy.
To summarize, our contributions are:
• We are the first to propose learning protein-ligand scoring functions based on crosscorrelations of scalar fields in order to accelerate pose optimization in molecular docking.
• We formulate a neural network parameterization and training procedure for learning equivariant scalar fields for molecules.
• We demonstrate that the performance and runtime of our scoring function holds promise
when evaluated on docking-related tasks.

2

2

BACKGROUND

Molecular docking. The two key components of a molecular docking algorithm are (1) one or
more scoring functions for ligand poses, and (2) a search, sampling, or optimization procedure.
There is considerable variation in the design of these components and how they interact with each
other, ranging from exhaustive enumeration and filtering (Shoichet et al., 1992; Meng et al., 1992)
to genetic, gradient-based, or MCMC optimization algorithms (Trott & Olson, 2010; Morris et al.,
1998; McNutt et al., 2021). We refer to reviews elsewhere (Ferreira et al., 2015; Torres et al., 2019;
Fan et al., 2019) for comprehensive details. These algorithms have undergone decades of development and have been given rise to well-established software packages in academia and industry, such
as AutoDock (Morris & Lim-Wilby, 2008), Vina (Trott & Olson, 2010) and Glide (Halgren et al.,
2004). In many of these, the scoring function is designed not only to identify the binding pose, but
also to predict the binding affinity or activity of the ligand (Su et al., 2018). In this work, however,
we focus on learning and evaluating scoring functions for the rapid prediction of binding poses.
ML methods in docking. For over a decade, ML methods have been extensively explored to improve scoring functions for already-docked ligand poses, i.e., for prediction of activity and affinity
in structural-based virtual screens (Li et al., 2021; Yang et al., 2022; Crampon et al., 2022). On the
other hand, developing ML scoring functions as the direct optimization objective has required more
care due the enormous number of function evaluations involved. MedusaNet (Jiang et al., 2020) and
Gnina (Ragoza et al., 2017; McNutt et al., 2021) proposed to sparsely use CNNs for guidance and
re-ranking (respectively) in combination with a traditional scoring function. DeepDock (MéndezLucio et al., 2021) used a hypernetwork to predict complex-specific parameters of a simple statistical
potential. Recently, geometric deep learning models have explored entirely different paradigms for
docking via direct prediction of the binding pose (Stärk et al., 2022; Zhang et al., 2022; Lu et al.,
2022) or via a generative model over ligand poses (Corso et al., 2023).
FFT methods in docking. Methods based on fast Fourier transforms have been widely applied for
the related problem of protein-protein docking. Katchalski-Katzir et al. (1992) first proposed using
FFTs over the translational space R3 to rapidly evaluate poses using scalar fields that encode the
shape complementarity of the two proteins. Later works extended this method to rotational degrees
of freedom (Ritchie & Kemp, 2000; Ritchie et al., 2008; Padhorny et al., 2016) and additional scoring
terms, such as pairwise electrostatic potentials and solvent accessibility (Gabb et al., 1997; Mandell
et al., 2001; Chen & Weng, 2002). Today, FFT methods are a routine step in established proteinprotein docking programs such as PIPER (Kozakov et al., 2006), ClusPro (Kozakov et al., 2017),
and HDOCK (Yan et al., 2020), where they enable the evaluation of billions of poses, typically as an
initial screening step before further evaluation and refinement with a more accurate scoring function.
In contrast, FFT methods have been significantly less studied for protein-ligand docking. While
a few works have explored this direction (Padhorny et al., 2018; Ding et al., 2020; Nguyen et al.,
2018), these algorithms have not been widely adopted nor been incorporated into established docking software. A key limitation is that protein-ligand scoring functions are typically more complicated
than protein-protein scoring functions and cannot be easily expressed as a cross-correlation between
scalar fields (Ding et al., 2020). To our knowledge, no prior works have explored the possibility of
overcoming this limitation by learning cross-correlation based scoring functions.

3

M ETHOD

3.1

E QUIVARIANT S CALAR F IELDS

We consider the inputs to a molecular docking problem to be a pair of protein structure and ligand
molecule, encoded as a featurized graphs GP , GL , and with the protein structure associated with
P
3×NP
alpha carbon coordinates XP = [xP
. The molecular docking problem is to
1 , . . . xNP ] ∈ R
L
L
L
find the ligand atomic coordinates X = [x1 , . . . xNL ] ∈ R3×NL of the true binding pose. To this
end, our aim is to parameterize and learn (multi-channel) scalar fields ϕP := ϕ(x; GP , XP ) and
ϕL := ϕL (x; GL , XL ) associated with the protein and ligand structures, respectively, such that the
scoring function evaluated on any pose XL ∈ R3NL is given by
XZ
P
P
L
L
L
3
E(XP , XL ) =
ϕP
(1)
c (x; G , X )ϕc (x; G , X ) d x
c

R3

3

where ϕc refers to the cth channel of the scalar field. While neural fields that directly learn functions R3 → R have been previously developed as encodings of molecular structures (Zhong et al.,
2019), such a formulation is unsuitable here as the field must be defined relative to the variable-sized
structure graphs GP , GL and transform appropriately with rigid-body motions of their coordinates.
Instead, we propose to parameterize the scalar field as a sum of contributions from each ligand
atom or protein alpha-carbon, where each contribution is defined by its coefficients in a spherical
harmonic expansion centered at that atom (or alpha-carbon) coordinate in 3D space. To do so, we
choose a set Rj : R+ → R of radial basis functions (e.g., Gaussian RBFs) in 1D and let Ymℓ be the
real spherical harmonics. Then we define


X
x − xn
ϕc (x; G, X) =
Acnjℓm (G, X)Rj (∥x − xn ∥)Yℓm
(2)
∥x − xn ∥
n,j,ℓ,m

where here (and elsewhere) we drop the superscripts L, P for common definitions. Given some
constraints on how the vector of coefficients Acnjℓm transforms under SE(3), this parameterization
of the scalar field satisfies the following important properties:
Proposition 1. Suppose the scoring function is parameterized
as in Equation 2 and for any R ∈
P
ℓ
ℓ
SO(3), t ∈ R3 we have Acnjℓm (G, R.X + t) = m′ Dmm
′ (R)Acnjℓm′ (G, X) where D (R) are
the (real) Wigner D-matrices, i.e., irreducible representations of SO(3). Then for any g ∈ SE(3),
1. The scalar field transforms equivariantly: ϕc (x; G, g.X) = ϕc (g −1 .x; G, X).
2. The scoring function is invariant: E(g.XP , g.XL ) = E(XP , XL ).
See Appendix A for the proof. We choose to parameterize Acnjℓm (G, R.X) with E3NN graph
neural networks (Thomas et al., 2018; Geiger & Smidt, 2022), which are specifically designed to
satisfy these equivariance properties and produce all coefficients in a single forward pass. The core
of our method consists of the training of two such equivariant scalar field networks (ESFs), one for
the ligand and one for the protein, which then parameterize their respective scalar fields. While
the second property (invariance of the scoring function) is technically the only one required by
the problem symmetries, the first property ensures that different ligand poses related by rigid-body
transformations can be evaluated via transformations of the scalar field itself (without re-evaluating
the neural network) and is thus essential to our method.
Next, we show how this parameterization enables ligand poses related by rigid body motions to some
reference pose to be rapidly evaluated with fast Fourier transforms (all derivations in Appendix A).
There are actually two ways to do so: we can evaluate the score of all poses generated by translations
of the reference pose, or via rotations around some fixed point (which we always choose to be the
center of mass of ligand). These correspond to FFTs over R3 and SO(3), respectively.
3.2

FFT OVER T RANSLATIONS

We first consider the space of poses generated by translations. Given some reference pose XL , the
score as a function of the translation is just the cross-correlation of the fields ϕL and ϕP :
XZ
X
P
L
L
3
P
E(X , X + t) =
ϕP
(ϕL
(3)
c (x)ϕc (x − t) d x =
c ⋆ ϕc )(t)
c

R3

c

where we have dropped the dependence on G, X for cleaner notation and applied Proposition 1. By
the convolution theorem, these cross-correlations may be evaluated using Fourier transforms:
n
 P o
1
P
ϕL
F −1 F [ϕL
(4)
c ⋆ ϕc =
c ] · F ϕc
3/2
(2π)
Hence, in order to simultaneously evaluate all possible translations of the reference pose, we need
to compute the Fourier transforms of the protein and ligand scalar fields. One naive way of doing so
would be to explicitly evaluate Equation 2 at an evenly-spaced grid of points spanning the structure
and then apply a fast Fourier transform. However, this would be too costly, especially during training
time. Instead, we observe that the functional form allows us to immediately obtain the Fourier
transform via the expansion coefficients Acnjℓm :
X
X
X
(−i)ℓ
Acnjℓm Hℓ [Rj ](∥k∥)Yℓm (k/∥k∥)
(5)
F [ϕc ] (k) =
e−ik·xn
n

ℓ

m,n

4

where now Yℓm must refer to the complex spherical harmonics and the coefficients must be transformed correspondingly, and
r Z ∞
2
Hℓ [Rj ](k) =
jℓ (kr)Rj (r)r2 dr
(6)
π 0
is the ℓth order spherical Bessel transform of the radial basis functions. Conceptually, this expression
corresponds to first evaluating the Fourier transform of each atom’s scalar field in a coordinate system centered at its location, and then translating it via multiplication with e−ik·xn in Fourier space.
Importantly, Hℓ [Rj ] and Yℓm can be precomputed and cached at a grid of points independently of
any specific structure, such that only the translation terms and expansion coefficients need to be
computed for every new example.
3.3

FFT OVER ROTATIONS

We next consider the space of poses generated by rotations. Suppose that given some reference pose
XL , the protein and ligand scalar fields are both expanded around some common coordinate system
origin using the complex spherical harmonics and a set of global radial basis functions Sj (r):
X
ϕc (x) =
Bcjℓm Sj (∥x∥)Yℓm (x/∥x∥)
(7)
j,ℓ,m

We seek to simultaneously evaluate the score of poses generated via rigid rotations of the ligand,
which (thanks again to Proposition 1) is given by the rotational cross-correlation
XZ
L
−1
E(XP , R.XL ) =
ϕP
x) d3 x
(8)
c (x)ϕc (R
c

R3

Cross-correlations of this form have been previously studied for rapid alignment of crystallographic
densities (Kovacs & Wriggers, 2002) and of signals on the sphere in astrophysics (Wandelt & Górski,
2001). It turns out that they can also be evaluated in terms of Fourier sums:
Z
X
L
−1
ℓ
ϕP
x) d3 x =
dℓmh dℓhn Imn
ei(mξ+hη+nω)
(9)
c (x)ϕc (R
R3

ℓ,m,h,n

where ξ, η, ω are related to the the Euler angles of the rotation R, dℓ is the (constant) Wigner Dmatrix for a rotation of π/2 around the y-axis, and
Z ∞
X
ℓ
P
L
Imn
=
Bcjℓm
Bckℓn
Gjk where Gjk =
Sj (r)Sk (r)r2 dr
(10)
0

j,k

Thus the main task is to compute the complex coefficients Bcjℓm of the ligand and protein scalar
fields, respectively. This is not immediate as the fields are defined using expansions in “local” radial
and spherical harmonic bases, i.e., with respect to the individual atom positions as opposed to the
coordinate system origin. Furthermore, since we cannot (in practice) use a complete set of radial
or angular basis functions, it is generally not possible to express the ligand or protein scalar field
as defined in Equation 2 using the form in Equation 7. Instead, we propose to find the coefficients
Bcjℓm that give the best approximation to the true scalar fields, in the sense of least squared error.
Specifically, suppose that R ∈ RNgrid ×Nlocal are the values of Nlocal real local basis functions (i.e.,
different origins, RBFs, and spherical harmonics) evaluated at Ngrid grid points and A ∈ RNlocal is
the vector of coefficients defining the scalar field ϕc . Similarly define S ∈ RNgrid ×Nglobal using the
real versions of the global basis functions. We seek to find the least-squares solution B ∈ RNglobal to
the overdetermined system of equations RA = SB, which is given by
B = (ST S)−1 ST RA

(11)

Notably, this is simply a linear transformation of the local coefficients Acnjℓm . Thus, if we can precompute the inverse Gram matrix of the global bases (ST S)−1 and the inner product of the global
and local bases ST R, then for any new scalar field ϕc the real global coefficients are immediately
available via a linear transformation. The desired complex coefficients can then be easily obtained
via a change of bases. At first glance, this still appears challenging due to the continuous space of
possible atomic or alpha-carbon positions, but an appropriate discretization makes the precomputation relatively inexpensive without a significant loss of fidelity.
5

3.4

T RAINING AND I NFERENCE

We now study how the rapid cross-correlation procedures presented thus far are used in training and
inference. For a given training example with protein structure XP , the scoring function E(XP , XL )
⋆
should ideally attain a maximum at the true ligand pose XL = XL . We equate this task to that
of learning an energy based
the log-likelihood of the true pose under the model
 model to maximize

likelihood p(XL ) ∝ exp E(XP , XL ) . However, as is typically the case for energy-based models,
directly optimizing this objective is difficult due to the intractable partition function.
Instead, following Corso et al. (2023), we conceptually decompose the ligand pose XL to be a tuple
XL = (XC , R, t) consisting of a zero-mean conformer XC , a rotation R, and a translation t, from
which the pose coordinates are obtained: XL = R.XC + t. Then consider the following conditional
log-likelihoods:
Z


log p(t | XC , R) = E(XP , XL ) − log
exp E(XP , R.XC + t′ ) d3 t′
(12a)
3
ZR


log p(R | XC , t) = E(XP , XL ) − log
exp E XP − t, R′ .XC dR′
(12b)
SO(3)

We observe that these integrands are precisely the cross-correlations in Equations 3 and 8, respectively, and can be quickly evaluated and summed for all values of t′ and R′ using fast Fourier
transforms. Thus, the integrals—which are the marginal likelihoods p(XC , R) and p(XC , t)—are
tractable and the conditional log-likelihoods can be directly optimized in order to train the neural
network. Although neither technically corresponds to the joint log-likelihood of the pose, we find
that these training objectives work well in practice and optimize their sum in our training procedure.
At inference time, a rigid protein structure XP is given and the high-level task is to score or optimize
candidate ligand poses XL . A large variety of possible workflows can be imagined; however, for
proof of concept and for our experiments in Section 4 we describe and focus on the following
relatively simple inference workflows (presented in greater detail in Appendix B):
• Translational FFT (TF). Given a conformer XC , we conduct a grid-based search over R
and use FFT to optimize t in order to find the best pose (XC , R, t). To do so, we compute
the Fourier coefficients (Equation 5) of the protein XP once and for each possible ligand
orientation R.XC . We then use translational cross-correlations (Equation 3) to find the best
translation t for each R and return the highest scoring combination.
• Rotational FFT (RF). Given a conformer XC , we conduct a grid-based search over t and
P
use FFT to optimize R. To do so, we compute the global expansion coefficients Bcjℓm
of
L
the protein X − t relative to each possible ligand position t and once for the ligand XC
relative to its (zero) center of mass (Equation 11). We then use rotational cross-correlations
(Equation 8) to find the best orientation R for each t and return the highest scoring combination.
• Translational scoring (TS). Here we instead are given a list of poses (XC , R, t) and wish
to score them. Because the values of R nor t may not satisfy a grid structure, we cannot
use the FFT methods. Nevertheless, we can compute the (translational) Fourier coefficients of the protein XP and for each unique oriented conformer R.XC of the ligand using
Equation 5. We then evaluate
XZ
L
C
−ik·t 3
F[ϕP
d k
(13)
E(XP , R.XC + t) =
c ](k) · F[ϕc ( · ; R.X )](k) · e
c

R3

Since the Fourier transform is an orthogonal operator on functional space, this is equal to
the real-space cross-correlation.
• Rotational scoring (RS). Analogously, we can score a list of poses (XC , R, t) using the
global spherical expansions Bcjℓm . We obtain the real expansion coefficients of the protein
relative to each t and for each ligand conformer XC using Equation 11. The score for
(XC , R, t) is then given by the rotational cross-correlation
X
P
L
ℓ
E(XP , R.XC + t) =
Bcjℓm
(XP − t)Bckℓn
(XC )Dmn
(R)Gjk
(14)
c,j,k,ℓ,m,n
ℓ
where Gjk is as defined in Equation 10 and Dmn
are the real Wigner D-matrices.

6

Table 1: Typical runtimes of the computations involved in inference-time scoring and optimization
procedures, measured on PDBBind with one V100 GPU. The three sets of rows delineate computations that are protein-specific, ligand-specific, or involve both protein and ligand, respectively.
Frequency
Per protein structure
,→ Per translation
Per ligand conformer
,→ Per rotation
Per conformer × rotation
Per conformer × translation
Per pose

Computation

TF

RF

TS

RS

Runtime

Coefficients Acnjℓm
FFT coefficients
Global expansion Bcjℓm

✓
✓

✓

✓
✓

✓

65 ms
7.0 ms
80 ms

Coefficients Acnjℓm
Global expansion Bcjℓm
FFT coefficients

✓

Translational FFT
Rotational FFT
Translational scoring
Rotational scoring

✓

✓
✓
✓

✓

✓
✓

✓
✓

✓
✓
✓
✓

4.3 ms
17 ms
1.6 ms
160 µs
650 µs
1.0 µs
8.2 µs

The runtime of these workflows can vary significantly depending on the parameters, i.e., number
of proteins, ligands, conformers, rotations, and translations, with amortizations possible at several
levels. Table 1 provides a summary of the computations in each workflow, their frequencies, and
typical runtimes. We highlight that the RF workflow is well-suited for virtual screening since the
precomputations for the protein and ligand translations within a pocket can be amortized across all
ligands. Furthermore, if the ligands are drawn from a shared library, their coefficients can also be
precomputed independent of any protein, leaving only the rotational FFT as the cost per ligandprotein pair. See Appendix F for further discussion on runtime amortization.

4

E XPERIMENTS

We train and test our model on the PDBBind dataset (Liu et al., 2017) with splits as defined by Stärk
et al. (2022). We train two variants of our model: ESF and ESF-N, where the latter is trained with
rotational and translational noise injected into the examples to increase model robustness. All other
model and inference-time hyperparameters are discussed in Appendix E. For the test set, we consider
both the co-crystal structures in the PDBBind test split and their counterpart ESMFold complexes as
prepared by Corso et al. (2023). We also collect a test set of 77 crystal structures (none of which are
in PDBBind) of phosphodiesterase 10A (PDE10A) with different ligands bound to the same pocket
(Tosstorff et al., 2022). This industrially-sourced dataset is representative of a real-world use case
for molecular docking and benchmarks the benefits of runtime amortization with our approach.
To evaluate our method against baselines, we note that a scoring function by itself is not directly
comparable to complete docking programs, which also include tightly integrated conformer search,
pose clustering, and local refinement algorithms. Here, however, we focus on the development of the
scoring function itself independently of these other components. Thus, we consider two simplified
settings for evaluating our model: (1) scoring decoy poses with the aim of identifying the best pose
among them, and (2) docking rigid conformers to a given pocket, similar to the re-docking setup
in Stärk et al. (2022). The first setting focuses on evaluating only the quality of the scoring function
itself, whereas the second is a simplified version of a typical docking setting that circumvents some
of the confounding factors while still allowing the benchmarking of FFT-accelerated optimization.
For the baselines, we select Gnina (McNutt et al., 2021) as the representative traditional docking
software, which runs parallel MCMC chains to collect pose candidates that are then refined and
re-ranked to produce the final prediction. Among the scoring functions supported by Gnina, we
evaluate its namesake CNN (Ragoza et al., 2017) scoring function and the traditional scoring function of Vina (Trott & Olson, 2010). For recent deep learning baselines, we adapt TANKBind (Lu
et al., 2022) and DiffDock (Corso et al., 2023) for decoy-pose scoring. Specifically, with DiffDock
we use the provided confidence model—which is natively trained as an <2 Å RMSD classifier—to
score all poses. With TANKBind, we use the predicted distance map and ground truth conformer
interatomic distances to calculate Lgeneration as the pose score. We do not evaluate these methods on
pocket-level conformer docking as they cannot be easily adapted for this task.
7

Table 2: Decoy scoring results (median over test complexes). All RMSDs are heavy-atom symmetry aware. The best results from our method (ESF) are underlined if not bolded. † These methods
have been adapted for the pose scoring task; see main text.
Crystal structures
Method

ESMFold structures

Time per

%
%
Top
Top
Top
Top
<2 Å
<2 Å
AUROC RMSD Rank <2 Å AUROC RMSD Rank <2 Å

Pose

Complex

Vina
Gnina

0.93
0.90

0.54
0.59

2
3

91
83

0.86
0.84

2.43
2.19

419
1110

43
46

3.4 ms
13.0 ms

110 s
426 s

TANKBind†
DiffDock†

0.69
0.96

4.01
0.66

6811
3

10
87

0.64
0.89

4.22
2.01

8538
143

9
50

1.1 µs
62 ms

62 ms
2041 s

ESF-TS
ESF-RS
ESF-N-TS
ESF-N-RS

0.87
0.87
0.92
0.92

0.59
0.63
0.69
0.75

3
3
4
5

87
85
81
80

0.82
0.82
0.87
0.87

1.38
1.75
1.64
1.74

24
22
22
26

57
53
54
53

1.0 µs
8.2 µs
1.0 µs
8.2 µs

3.2 s
5.7 s
3.2 s
5.7 s

4.1

S CORING D ECOYS

For each PDBBind test complex, we generate 323 − 1 = 32767 decoy poses by sampling 31 translational, rotational, and torsional perturbations to the ground truth pose and considering all their
possible combinations. On median, the RMSD of the closest decoy is 0.4 Å, and 1.6% of all poses
(n = 526.5) are below 2 Å RMSD (Appendix D.1). We then score all poses using the baselines
and with our method in both TS (Equation 13) and RS (Equation 14) modes. The quality of each
scoring function is evaluated with the AUROC when used as a <2Å RMSD classifier, the RMSD of
the top-ranked pose (Top RMSD), the rank of the lowest-RMSD pose (Top Rank), and the fraction
of complexes for which the top-ranked pose is under 2 Å RMSD.
As shown in Table 2, our method is competitive with the Gnina and Vina baselines on crystal
structures and better on ESMFold structures. This improved robustness is expected since the interaction terms in traditional scoring functions are primarily mediated by sidechain atoms, which
are imperfectly predicted by ESMFold, whereas our scalar fields only indirectly depend on the
sidechains via residue-level coefficients. Our method matches and outperforms the DiffDock confidence model on crystal and ESMFold structures, respectively, in terms of identifying the best poses,
although DiffDock obtains better <2 Å RMSD AUROC. Finally, our method substantially ourperforms TANKBind in all metrics. Overall, the performance of our method is superior in the TS mode
relative to RS, likely due to the spatially coarser representation of the scalar fields in the global
spherical harmonic expansion (i.e., Equation 7) relative to the grid-based Cartesian expansion.
In terms of runtime per pose, our method is faster than Vina by several orders of magnitude, with
even greater acceleration compared to the CNN-based Gnina and the E3NN-based DiffDock, both of
which require one neural network evaluation for every pose. The runtime improvement per complex
is more tempered since the different proteins and ligand in every complex limit the opportunity for
amortization. In fact, of the total runtime per complex in Table 2, only 1% (TS) to 5% (RS) is due
to the pose scoring itself, with the rest due to preprocessing that must be done for every new protein
and ligand independently. Similar to our method, TANKBind requires only a single neural network
evaluation per complex and thus has comparable per-pose runtime, but enjoys faster per-complex
runtime due to the lack of such preprocessing.
4.2

D OCKING C ONFORMERS

We consider the task of pocket-level docking where all methods are given as input the ground-truth
conformer in a random orientation. Following common practice (McNutt et al., 2021), we aim to
provide 4 Å of translational uncertainty around the true ligand pose in order to define the binding
pocket. To do so, we provide Gnina with a bounding box with 4 Å of padding around the true pose,
and provide our method with a cube of side length 8 Å as the search space for t (with a random grid
offset). For PDE10A, we define the pocket using the pose of the first listed complex (PDB 5SFS)
and cross-dock to that protein structure. In all docking runs, we deactivate all torsion angles so that
Gnina docks the provided conformer to the pocket.
8

Table 3: Rigid conformer docking results (median over test complexes). All RMSDs are heavyatom symmetry aware. The median RMSD of our method (ESF) is lower-bounded at 0.5–0.6 Å by
the resolution of the search grid (Appendix G.2). The runtime is shown as an average per complex,
excluding / including pre-computations that can be amortized.
PDBBind test
Crystal
Method

%
<2 Å

Vina
Gnina
ESF-TF
ESF-RF
ESF-N-TF
ESF-N-RF

ESMFold

PDE10A

Med.
RMSD

%
<2 Å

Runtime

%
<2 Å

Med.
RMSD

Med.
RMSD

Runtime

79
77

0.32
0.33

24
28

6.1
5.9

20 s
23 s

74
73

0.75
0.77

6.1 s
6.0 s

70
71
72
73

1.13
0.97
1.10
1.00

31
32
46
47

4.6
4.4
2.9
3.0

0.8 s / 8.3 s
0.5 s / 67 s
0.7 s / 8.2 s
0.5 s / 68 s

67
73
64
70

1.20
0.82
1.11
1.00

1.0 s / 7.1 s
0.5 s / 1.5 s
1.0 s / 7.2 s
0.5 s / 1.5 s

As shown in Table 3, the baseline scoring functions obtain excellent success rates on the PDBBind
crystal structures (79%). Our method is slightly weaker but also obtains high success rates (73%).
The performance decrease in terms of Median RMSD is somewhat larger, likely due to the coarse
search grid over non-FFT degrees of freedom (Appendix G.2) and the lack of any refinement steps
(which are an integral part of Gnina) in our pipeline. On ESMFold structures, however, our method
obtains nearly twice the success rate (47% vs 28%) of the baseline scoring functions. Unlike in
decoy scoring, noisy training noticeably contributes to the performance on ESMFold structures, and
the RF procedure generally outperforms TF, likely due to the relatively finer effective search grid
in rotational cross-correlations (Appendix G.2).
Because of the nature of the PDBBind workflow, the total runtime is comparable to or slower than the
baselines when precomputations are taken into account. However, in terms of the pose optimization
itself, our method is significantly faster than the Gnina baselines, despite performing a brute force
search over the non-FFT degrees of freedom. While it is also possible to trade-off performance and
runtime by changing various Gnina settings from their default values, our method expands the Pareto
front currently available with the Gnina pipeline (Appendix G.3; Figure 8). This favorable tradeoff
affirms the practical value-add of our method in the context of existing approaches.
To more concretely demonstrate the runtime improvements of our method with amortization, we
then dock the conformers in the PDE10A dataset. Our method again has similar accuracy to the
baselines (Table 3); however, because of the common pocket, all protein-level quantities are computed only once and the total runtime is significantly accelerated. For the RF procedure in particular,
the computation of protein global coefficients on the translational grid is by far the most expensive
step (Table 9), and the remaining ligand precomputations are very cheap. The amortization of these
coefficients leads to a 45x speedup in the overall runtime (67 s → 1.5 s). (The runtime for Gnina
is also accelerated, although to a lesser extent, due to the smaller ligand size.) As the number of
ligands increases further, the total runtime per complex of our method would further decrease.

5

C ONCLUSION

We have proposed a machine-learned based scoring function for accelerating pose optimization
in molecular docking. Different from existing scoring functions, the score is defined as a crosscorrelation between scalar fields, which enables the use of FFTs for rapid search and optimization. We have formulated a novel parameterization for such scalar fields with equivariant neural
networks, as well as training and inference procedures with opportunities for significant runtime
amortization. Our scoring function shows comparable performance but improved runtime on two
simplified docking-related tasks relative to standard optimization procedures and scoring functions.
Thus, our methodology holds promise when integrated with other components into a full docking
pipeline. These integrations may include multi-resolution search, refinement with traditional scoring
functions, and architectural adaptations for conformational (i.e., torsional) degrees of freedom—all
potential directions of future work.

9

ACKNOWLEDGMENTS
We thank Hannes Stärk, Samuel Sledzieski, Ruochi Zhang, Michael Brocidiacono, Gabriele Corso,
Xiang Fu, Felix Faltings, Ameya Daigavane, and Mario Geiger for helpful feedback and discussions.
This work was supported by the NIH NIGMS under grant #1R35GM141861 and a Department of
Energy Computational Science Graduate Fellowship.

R EFERENCES
Rong Chen and Zhiping Weng. Docking unbound proteins using shape complementarity, desolvation, and electrostatics. Proteins: Structure, Function, and Bioinformatics, 47(3):281–294, 2002.
Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi S Jaakkola. Diffdock:
Diffusion steps, twists, and turns for molecular docking. In The Eleventh International Conference
on Learning Representations, 2023.
Kevin Crampon, Alexis Giorkallos, Myrtille Deldossi, Stéphanie Baud, and Luiz Angelo Steffenel.
Machine-learning methods for ligand–protein molecular docking. Drug discovery today, 27(1):
151–164, 2022.
Xinqiang Ding, Yujin Wu, Yanming Wang, Jonah Z Vilseck, and Charles L Brooks III. Accelerated
cdocker with gpus, parallel simulated annealing, and fast fourier transforms. Journal of chemical
theory and computation, 16(6):3910–3919, 2020.
Jiyu Fan, Ailing Fu, and Le Zhang. Progress in molecular docking. Quantitative Biology, 7:83–89,
2019.
Leonardo G Ferreira, Ricardo N Dos Santos, Glaucius Oliva, and Adriano D Andricopulo. Molecular docking and structure-based drug design strategies. Molecules, 20(7):13384–13421, 2015.
Henry A Gabb, Richard M Jackson, and Michael JE Sternberg. Modelling protein docking using
shape complementarity, electrostatics and biochemical information. Journal of molecular biology,
272(1):106–120, 1997.
Mario Geiger and Tess Smidt. e3nn: Euclidean neural networks. arXiv preprint arXiv:2207.09453,
2022.
Thomas A Halgren, Robert B Murphy, Richard A Friesner, Hege S Beard, Leah L Frye, W Thomas
Pollard, and Jay L Banks. Glide: a new approach for rapid, accurate docking and scoring. 2.
enrichment factors in database screening. Journal of medicinal chemistry, 2004.
Huaipan Jiang, Mengran Fan, Jian Wang, Anup Sarma, Shruti Mohanty, Nikolay V Dokholyan,
Mehrdad Mahdavi, and Mahmut T Kandemir. Guiding conventional protein–ligand docking software with convolutional neural networks. Journal of Chemical Information and Modeling, 60
(10):4594–4602, 2020.
Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, and Tommi Jaakkola. Torsional diffusion for molecular conformer generation. arXiv preprint arXiv:2206.01729, 2022.
Ephraim Katchalski-Katzir, Isaac Shariv, Miriam Eisenstein, Asher A Friesem, Claude Aflalo, and
Ilya A Vakser. Molecular surface recognition: determination of geometric fit between proteins
and their ligands by correlation techniques. Proceedings of the National Academy of Sciences, 89
(6):2195–2199, 1992.
Julio A Kovacs and Willy Wriggers. Fast rotational matching. Acta Crystallographica Section D:
Biological Crystallography, 58(8):1282–1286, 2002.
Dima Kozakov, Ryan Brenke, Stephen R Comeau, and Sandor Vajda. Piper: an fft-based protein
docking program with pairwise potentials. Proteins: Structure, Function, and Bioinformatics, 65
(2):392–406, 2006.
10

Dima Kozakov, David R Hall, Bing Xia, Kathryn A Porter, Dzmitry Padhorny, Christine Yueh,
Dmitri Beglov, and Sandor Vajda. The cluspro web server for protein–protein docking. Nature
protocols, 12(2):255–278, 2017.
Hongjian Li, Kam-Heung Sze, Gang Lu, and Pedro J Ballester. Machine-learning scoring functions
for structure-based virtual screening. Wiley Interdisciplinary Reviews: Computational Molecular
Science, 11(1):e1478, 2021.
Zhihai Liu, Minyi Su, Li Han, Jie Liu, Qifan Yang, Yan Li, and Renxiao Wang. Forging the basis
for developing protein–ligand interaction scoring functions. Accounts of Chemical Research, 50
(2):302–309, 2017.
Wei Lu, Qifeng Wu, Jixian Zhang, Jiahua Rao, Chengtao Li, and Shuangjia Zheng. Tankbind:
Trigonometry-aware neural networks for drug-protein binding structure prediction. Advances in
neural information processing systems, 2022.
Jeffrey G Mandell, Victoria A Roberts, Michael E Pique, Vladimir Kotlovyi, Julie C Mitchell, Erik
Nelson, Igor Tsigelny, and Lynn F Ten Eyck. Protein docking using continuum electrostatics and
geometric fit. Protein engineering, 14(2):105–113, 2001.
Andrew T McNutt, Paul Francoeur, Rishal Aggarwal, Tomohide Masuda, Rocco Meli, Matthew
Ragoza, Jocelyn Sunseri, and David Ryan Koes. Gnina 1.0: molecular docking with deep learning. Journal of cheminformatics, 13(1):1–20, 2021.
Oscar Méndez-Lucio, Mazen Ahmad, Ehecatl Antonio del Rio-Chanona, and Jörg Kurt Wegner.
A geometric deep learning approach to predict binding conformations of bioactive molecules.
Nature Machine Intelligence, 3(12):1033–1039, 2021.
Elaine C Meng, Brian K Shoichet, and Irwin D Kuntz. Automated docking with grid-based energy
evaluation. Journal of computational chemistry, 13(4):505–524, 1992.
Garrett M Morris and Marguerita Lim-Wilby. Molecular docking. In Molecular modeling of proteins, pp. 365–382. Springer, 2008.
Garrett M Morris, David S Goodsell, Robert S Halliday, Ruth Huey, William E Hart, Richard K
Belew, and Arthur J Olson. Automated docking using a lamarckian genetic algorithm and an
empirical binding free energy function. Journal of computational chemistry, 19(14):1639–1662,
1998.
Trung Hai Nguyen, Huan-Xiang Zhou, and David DL Minh. Using the fast fourier transform in
binding free energy calculations. Journal of computational chemistry, 39(11):621–636, 2018.
Dzmitry Padhorny, Andrey Kazennov, Brandon S Zerbe, Kathryn A Porter, Bing Xia, Scott E
Mottarella, Yaroslav Kholodov, David W Ritchie, Sandor Vajda, and Dima Kozakov. Protein–
protein docking by fast generalized fourier transforms on 5d rotational manifolds. Proceedings of
the National Academy of Sciences, 113(30):E4286–E4293, 2016.
Dzmitry Padhorny, David R Hall, Hanieh Mirzaei, Artem B Mamonov, Mohammad Moghadasi,
Andrey Alekseenko, Dmitri Beglov, and Dima Kozakov. Protein–ligand docking using fft based
sampling: D3r case study. Journal of computer-aided molecular design, 32:225–230, 2018.
Rodrigo Quiroga and Marcos A Villarreal. Vinardo: A scoring function based on autodock vina
improves scoring, docking, and virtual screening. PloS one, 11(5):e0155183, 2016.
Matthew Ragoza, Joshua Hochuli, Elisa Idrobo, Jocelyn Sunseri, and David Ryan Koes. Protein–
ligand scoring with convolutional neural networks. Journal of chemical information and modeling, 57(4):942–957, 2017.
David W Ritchie and Graham JL Kemp. Protein docking using spherical polar fourier correlations.
Proteins: Structure, Function, and Bioinformatics, 39(2):178–194, 2000.
David W Ritchie, Dima Kozakov, and Sandor Vajda. Accelerating and focusing protein–protein
docking correlations using multi-dimensional rotational fft generating functions. Bioinformatics,
24(17):1865–1873, 2008.
11

Brian K Shoichet, Irwin D Kuntz, and Dale L Bodian. Molecular docking using shape descriptors.
Journal of computational chemistry, 13(3):380–397, 1992.
Hannes Stärk, Octavian Ganea, Lagnajit Pattanaik, Regina Barzilay, and Tommi Jaakkola. Equibind:
Geometric deep learning for drug binding structure prediction. In International Conference on
Machine Learning, pp. 20503–20521. PMLR, 2022.
Minyi Su, Qifan Yang, Yu Du, Guoqin Feng, Zhihai Liu, Yan Li, and Renxiao Wang. Comparative
assessment of scoring functions: the casf-2016 update. Journal of chemical information and
modeling, 59(2):895–913, 2018.
Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick
Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point
clouds. arXiv preprint, 2018.
Benjamin I Tingle, Khanh G Tang, Mar Castanon, John J Gutierrez, Munkhzul Khurelbaatar, Chinzorig Dandarchuluun, Yurii S Moroz, and John J Irwin. Zinc-22– a free multi-billion-scale
database of tangible compounds for ligand discovery. Journal of Chemical Information and Modeling, 63(4):1166–1176, 2023.
Pedro HM Torres, Ana CR Sodero, Paula Jofily, and Floriano P Silva-Jr. Key topics in molecular
docking for drug design. International journal of molecular sciences, 20(18):4574, 2019.
Andreas Tosstorff, Markus G Rudolph, Jason C Cole, Michael Reutlinger, Christian Kramer, Hervé
Schaffhauser, Agnès Nilly, Alexander Flohr, and Bernd Kuhn. A high quality, industrial data
set for binding affinity prediction: performance comparison in different early drug discovery
scenarios. Journal of Computer-Aided Molecular Design, 36(10):753–765, 2022.
Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with
a new scoring function, efficient optimization, and multithreading. Journal of computational
chemistry, 31(2):455–461, 2010.
Benjamin D Wandelt and Krzysztof M Górski. Fast convolution on the sphere. Physical review D,
63(12):123002, 2001.
Wikipedia. Hankel transform — Wikipedia, the free encyclopedia, 2023. URL https:
//en.wikipedia.org/wiki/Hankel_transform#Fourier_transform_in_
three_dimensions.
Yumeng Yan, Huanyu Tao, Jiahua He, and Sheng-You Huang. The hdock server for integrated
protein–protein docking. Nature protocols, 15(5):1829–1852, 2020.
Chao Yang, Eric Anthony Chen, and Yingkai Zhang. Protein–ligand docking in the machinelearning era. Molecules, 27(14):4568, 2022.
Anna Yershova, Swati Jain, Steven M Lavalle, and Julie C Mitchell. Generating uniform incremental
grids on so (3) using the hopf fibration. The International journal of robotics research, 29(7):801–
812, 2010.
Yangtian Zhang, Huiyu Cai, Chence Shi, Bozitao Zhong, and Jian Tang. E3bind: An end-to-end
equivariant network for protein-ligand docking. arXiv preprint arXiv:2210.06069, 2022.
Ellen D Zhong, Tristan Bepler, Joseph H Davis, and Bonnie Berger. Reconstructing continuous distributions of 3d protein structure from cryo-em images. In International Conference on Learning
Representations, 2019.

12

A

M ATHEMATICAL D ETAILS

A.1

P ROOF OF P ROPOSITION 1

Proposition 1. Suppose the scoring function is parameterized
as in Equation 2 and for any R ∈
P
ℓ
ℓ
SO(3), t ∈ R3 we have Acnjℓm (G, R.X + t) = m′ Dmm
′ (R)Acnjℓm′ (G, X) where D (R) are
the (real) Wigner D-matrices, i.e., irreducible representations of SO(3). Then for any g ∈ SE(3),
1. The scalar field transforms equivariantly: ϕc (x; G, g.X) = ϕc (g −1 .x; G, X).
2. The scoring function is invariant: E(g.XP , g.XL ) = E(XP , XL ).
Proof. Let the action of g = (R, t) ∈ SE(3) be written as g : x 7→ Rx + t and hence g −1 : x 7→
RT (x − t). We first note that ∥x − g.xn ∥ = ∥g −1 .x − xn ∥ and RT (x − g.xn ) = g −1 .x − xn . Then


X
x − g.xn
m
ϕc (x; G, g.X) =
Acnjℓm (G, R.X + t)Rj (∥x − g.xn ∥)Yℓ
∥x − g.xn ∥
n,j,ℓ,m


X
X
x − g.xn
ℓ
m
=
Acnjℓm′ (G, X)Rj (∥x − g.xn ∥)
Dmm′ (R)Yℓ
∥x − g.xn ∥
m
n,j,ℓ,m′


X
′
RT (x − g.xn )
=
Acnjℓm′ (G, X)Rj (∥x − g.xn ∥)Yℓm
∥x − g.xn ∥
n,j,ℓ,m′


X
g −1 .x − xn
−1
m′
′
=
Acnjℓm (G, X)Rj (∥g .x − xn ∥)Yℓ
∥g −1 .x − xn ∥
′
n,j,ℓ,m

= ϕc (g −1 .x; G, X)
Next,
E(g.xP , g.xL ) =

XZ
c

=

XZ
c

=

P
P
L
L
L
3
ϕP
c (x; G , g.X )ϕc (x; G , g.X ) d x

R3

−1
−1
ϕP
x; GP , XP )ϕL
x; GL , XL ) d3 x
c (g
c (g

R3

XZ
c

′
P
P
L ′
L
L
3 ′
ϕP
c (x ; G , X )ϕc (x ; G , X ) d x

R3

where the last line has substitution x′ = g −1 x with g volume preserving on R3 .
A.2

D ERIVATIONS

In this section we describe the derivations for the various equations presented in the main text. We
use the following convention for the (one-dimensional) Fourier transform and its inverse:
Z
1
F[f ](k) = √
e−ikx f (x) dx
(15a)
2π
Z
1
F −1 [f ](x) = √
eikx f (k) dk
(15b)
2π
Equation 5 It is well known (Wikipedia, 2023) that given a function over R3 with complex spherical harmonic expansion
X
f (r) =
fℓ,m (∥r∥)Yℓm (r/∥r∥)
(16)
ℓ,m

its Fourier transform is given by
f (k) =

X

(−i)ℓ Fℓ,m (∥k∥)Yℓm (k/∥k∥)

ℓ,m

13

(17)

where

Z ∞
√
1
Fℓ,m (k) = √
rfℓ,m (r)Jℓ+1/2 (kr) r dr
(18)
k 0
with Jℓ the ℓth -orderp
Bessel function of the first kind. Relating these to the spherical Bessel functions
jℓ via Jℓ+1/2 (x) = 2x/πjℓ (x), we obtain
r Z ∞
2
Fℓ,m (k) =
fℓ,m (r)jℓ (kr) r2 dr := Hℓ [fℓ,m ](k)
(19)
π 0
which is the form of Equation 6. To apply this to our scalar fields, we define the translation operator
Tr [f ](x) = f (x − r) and note its composition with the Fourier transform
(F ◦ Tr )[f ] = e−ik·r F[f ]
(20)
We then decompose the form of our scalar fields (Equation 2) into contributions from zero-origin
spherical harmonic expansions
X
ϕc (x) =
Txn [ϕcn ](x)
(21a)
n

ϕcn (x) =

XX

Acnjℓm Rj (∥x∥) Yℓm (x/∥x∥)

(21b)

j

ℓ,m

|

{z

}

ϕcnℓm (∥x∥)

Hence, the Fourier transform of each contribution is
X
F[ϕcn ](k/∥k∥) =
(−i)ℓ Hℓ [ϕcnℓm ](∥k∥)Yℓm (k/∥k∥)

(22)

ℓ,m

Equation 5 is then obtained via Equation 20 and the linearity of the Fourier and spherical Bessel
transforms.
Equation 9 We source (with some modifications) the derivation from Kovacs & Wriggers (2002).
We consider the cross-correlation
Z
c(R) =

ϕ(x)ψ(R−1 x) d3 x

(23)

R3
L
L
which is the same as Equation 8 with ϕ = ϕP
c and ψ = ϕc since ϕc is a real field. Expanding in
m
complex spherical harmonics Yℓ and radial bases Sj :

ϕ(x) =

X

Φjℓm Sj (∥x∥)Yℓm (x/∥x∥)

X

Ψjℓm Sj (∥x∥)Yℓm (x/∥x∥)

(24)

[Sj · Sj ′ ](∥x∥)[Yℓm · Yℓn′ ](x/∥x∥) d3 x

(25a)

ψ(x) =

j,ℓ,m

j,ℓ,m

We then obtain
X

c(R) =

ℓ
Dnm
′ (R)Φjℓm Ψj ′ ℓ′ m′

Z
R3

j,j ′ ,ℓ,ℓ′ ,m,n,m′

X

=

ℓ
Dnm
′ (R)Φjℓm Ψj ′ ℓ′ m′

j,j ′ ,ℓ,ℓ′ ,m,n,m′

=

X
ℓ,m,m′

ℓ
Dmm
′ (R)

X

Z ∞
|0

Z
[Yℓm · Yℓn′ ](r̂) dr̂
[Sj · Sj ′ ](r) r2 dr
S2
{z
}|
{z
}
Gjj ′

(25b)

δℓℓ′ δmn

Φjℓm Ψj ′ ℓm′ Gjj ′

(25c)

j,j ′

|

{z

ℓ
Imm
′

}

Now to evaluate the complex Wigner D-matrix, we adopt the extrinsic zyz convention for Euler
angles (applied right-to-left) and note that any rotation (ϕ, θ, ψ) can be decomposed as
R(ϕ, θ, ψ) = Rz (ϕ − π/2)Ry (π/2)Rz (π − θ)Ry (π/2)Rz (ψ − π/2)
(26)
| {z }
| {z }
| {z }
η

ξ

ω

Next, one can easily check (using the standard spherical harmonics) that the Wigner D-matrix for a
ℓ
rotation about the z-axis is diagonal and given by Dmn
(Rz (ω)) = δmn e−inω . Hence,
ℓ
Dmn
(R(ϕ, θ, ψ)) = e−imξ dℓmh e−hη dℓhn e−iωn
(27)
ℓ
ℓ
where d = D (Ry (π/2)) are constant and real. Complex conjugation then gives Equation 9.

14

Equation 12 The conditional likelihood is
log p(t | XC , R) = log

p(XC , R, t)
p(XC , R)

(28a)

= log p(XC , R, t) − log

Z

p(XC , R, t′ ) d3 t′

(28b)


exp E(XP , R.XC + t′ ) d3 t′

(28c)

R3

= log E(XP , XL ) − log

Z
R3

Similarly,
log p(R | XC , t) = log

p(XC , R, t)
p(XC , t)

(29a)

= log p(XC , R, t) − log

Z

p(XC , R′ , t) dR′

(29b)


exp E(XP , R.XC + t) dR′

(29c)

SO(3)

= log E(XP , XL ) − log

Z
SO(3)

Finally, we move t to the protein coordinates (invoking the invariance of the score E) to obtain a
form consistent with the rotational cross-correlations (Equation 8).
Equation 13 Given a pose XL = R.XC + t, we evaluate
XZ
L
C
3
E(XP , R.XC + t) =
ϕP
c (x)ϕc (x; R.X + t) d x

(30)

R3

c

The functional inner product is equivalent in Fourier space:
XZ
L
C
3
E(XP , R.XC + t) =
F[ϕP
c ](k) · F[ϕc ( · ; R.X + t)](k) d k

(31)

R3

c

Then with the translation operator T defined previously,
C
C
ϕL
c (x; R.X + t) = Tt [ϕ( · ; R.X )](x)

(32a)

C
−ik·t
C
F[ϕL
F[ϕL
c ( · ; R.X + t)](k) = e
c ( · ; R.X )](k)

(32b)

We then substitute into Equation 31 to obtain Equation 13.
P
Equation 14 Given a pose XL = R.XC + t, we assume that the field ϕP
c ( · ; X − t) and
C
L
ϕc ( · ; X ) are written in the real global spherical harmonic expansion:
X
P
P
ϕP
Bcjℓm
Sj (∥x∥)Yℓm (x/∥x∥)
(33a)
c (x; X − t) =
j,ℓ,m
C
ϕL
c (x; X ) =

X

L
Bcjℓm
Sj (∥x∥)Yℓm (x/∥x∥)

(33b)

j,ℓ,m

Then, analogously to Equation 25,
E(XP , R.XC + t) = E(XP − t, R.XC )
XZ
P
L
−1
=
ϕP
x; XC ) d3 x
c (x; X − t)ϕc (R
c

=

(34a)
(34b)

R3

X

ℓ
Dmm
′ (R)

X

P
L
Bcjℓm
Bcj
′ ℓm′ Gjj ′

(34c)

j,j ′

c,ℓ,m,m′

Complex conjugation has been omitted because the coefficients and D-functions are now real.

15

B

A LGORITHMIC D ETAILS

Below, we present in detail the four inference procedures introduced in Section 3.4. The three
blocks of computations are color-coded corresponding to protein preprocessing (green), ligand preprocessing (blue), and the core computation (red) and labelled with typical runtimes from Table 1
(unlabelled lines have negligible runtime). The various loop levels make clear that depending on
the workflow, the protein and ligand processing precomputations can be amortized and approaches
a negligible fraction of the total runtime. Note, however, that for readability we have presented
the algorithms assuming that all possible combinations (i.e., of proteins, ligand conformers, rotations, and translations) are of interest; if this is not true (for example in PDBBind, or in any typical
pose-scoring setting), then the full benefits of amortization may not be fully realized.
Algorithm 1: T RANSLATIONAL FFT
L
L
P
Input: Proteins {(GP
i , Xi )}, conformers {(Gh , Xh )}
P
L
Output: Docked poses (Xi , Xih ) ∀i, h
P
foreach (GP
// protein preprocessing
i , Xi ) do
P
P
P
Compute coefficients AP
// 65 ms
i = {Acjnℓm (Gi , Xi )} with neural network ;
P
Compute Fourier-space field values F[ϕP ]i using AP
,
x
;
//
7.0 ms
i
i
L
foreach (GL
// ligand preprocessing
h , Xh ) do
L
L
L
Compute coefficients AL
// 4.3 ms
h = {Acjnℓm (Gh , Xh )} with neural network ;
foreach Rk ∈ {R}grid ⊂ SO(3) do
ℓ
Compute rotated coefficients AL
h,k using D (Rk );
L
L
Compute Fourier-space field values F[ϕ ]h,k using AL
// 1.6 ms
h,k , Rk Xh ;
P
foreach (GP
i , Xi ) do
L
foreach (Gh , XL
h ) do
foreach Rk ∈ {R}grid ⊂ SO(3) do
L
Compute E(XP
i , Rk Xh + t), ∀t using FFT;
⋆ ⋆
L
Ek , tk ← {max, arg max}t E(XP
i , Rk Xh + t) ;
k ⋆ ← arg maxk Ek⋆ ;
L
⋆
XL
ih ← Rk⋆ Xh + tk⋆ ;

// pose optimization

// 160 µs

Algorithm 2: ROTATIONAL FFT
P
L
L
Input: Proteins {(GP
i , Xi )}, conformers {(Gh , Xh )}
P
L
Output: Docked poses (Xi , Xih ) ∀i, h
P
foreach (GP
// protein preprocessing
i , Xi ) do
P
P
P
Compute coefficients AP
// 65 ms
i = {Acjnℓm (Gi , Xi )} with neural network ;
for tk ∈ {t}grid ⊂ R3 do
P
P
Compute global expansion BP
// 80 ms
i,k = {Bcjℓm } from Ai , Xi − tk ;
L
foreach (GL
// ligand preprocessing
h , Xh ) do
L
L
L
Compute coefficients AL
=
{A
(G
,
X
)}
with
neural
network ;
// 4.3 ms
h
cjnℓm
h
h
L
L
Compute global expansion BL
=
{B
}
from
A
,
X
;
// 17 ms
cjℓm
h
h
h
P
foreach (GP
// pose optimization
i , Xi ) do
L
foreach (Gh , XL
)
do
h
foreach tk ∈ {t}grid ⊂ R3 do
L
Compute E(XP
// 650 µs
i − tk , R.Xh ), ∀R using FFT ;
⋆
⋆
L
Ek , Rk ← {max, arg max}R E(XP
−
t
,
R.X
+
t)
;
k
i
h
k ⋆ ← arg maxk Ek⋆ ;
⋆
L
XL
ih ← Rk⋆ Xh + tk⋆ ;

16

Algorithm 3: T RANSLATIONAL S CORING
P
L
L
Input: Proteins {(GP
i , Xi )}, conformers {(Gh , Xh )}, rotations {Rk }, translations {tℓ }
P
L
Output: Scores E(Xi , Rk Xh + tℓ ) ∀i, h, k, ℓ
P
foreach (GP
// protein preprocessing
i , Xi ) do
P
P
P
// 65 ms
Compute coefficients AP
i = {Acjnℓm (Gi , Xi )} with neural network ;
P
Compute Fourier-space field values F[ϕP ]i using AP
,
x
;
//
7.0 ms
i
i
L
foreach (GL
// ligand preprocessing
h , Xh ) do
L
L
L
Compute coefficients AL
=
{A
(G
,
X
)}
with
neural
network ;
// 4.3 ms
h
cjnℓm
h
h
foreach Rk do
ℓ
Compute rotated coefficients AL
h,k using D (Rk );
L
Compute Fourier-space field values F[ϕL ]h,k using AL
// 1.6 ms
h,k , Rk Xh ;
P
foreach (GP
i , Xi ) do
L
foreach (GL
h , Xh ) do
foreach Rk do
foreach tℓ do
L
Compute E(XP
i , Rk Xh + tℓ ) using Equation 13;

// scoring

// 1.0 µs

Algorithm 4: ROTATIONAL S CORING
P
L
L
Input: Proteins {(GP
i , Xi )}, conformers {(Gh , Xh )}, rotations {Rk }, translations {tℓ }
P
L
Output: Scores E(Xi , Rk Xh + tℓ ) ∀i, h, k, ℓ
P
foreach (GP
// protein preprocessing
i , Xi ) do
P
P
P
Compute coefficients AP
=
{A
(G
,
X
)}
with
neural
network ;
// 65 ms
i
i
i
cjnℓm
3
for tk ∈ {t}grid ⊂ R do
P
P
Compute global expansion BP
// 80 ms
i,k = {Bcjℓm } from Ai , Xi − tk ;
L
foreach (GL
// ligand preprocessing
h , Xh ) do
L
L
L
Compute coefficients AL
// 4.3 ms
h = {Acjnℓm (Gh , Xh )} with neural network ;
L
L
Compute global expansion BL
=
{B
}
from
A
,
X
;
// 17 ms
cjℓm
h
h
h
P
foreach (GP
i , Xi ) do
L
foreach (Gh , XL
h ) do
foreach Rk do
foreach tℓ do
L
Compute E(XP
i , Rk Xh + tℓ ) using Equation 14;

17

// scoring

// 8.2 µs

C

L EARNED S CALAR F IELDS

6qqw

6jap

6np2

Figure 2: Visualizations of learned scalar fields. All five channels of the ESF-N learned scalar
fields ϕL (top row) and ϕP (bottom row) are shown on the xy-plane passing through the center of
mass of the ligand, with a box diameter of 20 Å. Positive values of the field are in blue and negative
values in red. At left, the ligand and pocket structures are shown looking down the z-axis. Note that
as the fields are only 2D slices, not all 3D features visible in the structures are visible in the fields.

18

6uvp

6oxq

6jsn

6moa

Figure 3: Visualizations of learned scalar fields, continued.

19

D

E XPERIMENTAL D ETAILS

D.1

D ECOY S ET
⋆

Given a zero-mean ground-truth ligand pose XL , we generate 323 − 1 = 32767 decoy poses via
the following procedure.
• Sample 31 translational pertubations: ti ∼ N (0, I3 ), i = 1 . . . 31 and set t0 = 0, with
units in Å.
• Sample 31 rotational perturbations: Rj = FromRotvec(rj ), rj ∼ N (0, 0.5I3 ), j =
1 . . . 31 and set R0 = I3 .
• Sample 31 noisy conformers XC
k , k = 1 . . . 31 by sampling torsional updates ∆τk ∼
NT (0, (π/2)Im ) where NT is a wrapped normal distribution (Jing et al., 2022) and m is
the number of torsion angles. The torsional updates are applied to the smaller side of the
L⋆
molecule. Set XC
.
0 =X
⋆

C
L
L
• Set XL
.
ijk = Rj Xk + ti , i, j, k = 0 . . . 31 and discard X000 = X

PDB ID 6A73 is excluded from the procedure due to the high level of graph symmetry and significant
runtime for computing RMSDs for all decoys. Summary statistics for the decoy sets of the remaining
362 PDB IDs are presented in Figure 4.

1.0
0.8

0

5

10

15

20

CDF

0.6
0.4

0.002

0.2
0.000
0.0 0.2 0.4 0.6 0.8 1.0

0.0

0

RMSD

5

10

0

2
15

20

Figure 4: Decoy set statistics. Top left: histogram of RMSDs across all decoys sets (12 M total).
Bottom left: histogram of minimum RMSDs among the decoy sets. All sets have a pose less than
RMSD <1 Å from the true pose. Right: cumulative density function of RMSDs in each decoy set.
Bottom right inset: all decoy sets have at least 23 poses with RMSD <2 Å.
D.2

RUNTIME M EASUREMENTS

All runtime measurements were performed on a machine with 64 Intel Xeon Gold 6130 CPUs and
8 Nvidia Tesla V100 GPUs. Gnina was run with default thread count settings. All of our processes
were run on a single V100 GPU. For our method, we performed runtime analysis using CUDA
events to remove the effects of asynchronous CUDA execution. Script loading, model loading, and
algorithmic-level precomputations (which, if necessary, can be cached on disk) were excluded from
the analysis. For Gnina, we attempted to remove similar overhead by timing single-pose scoringonly runs as representative of constant overhead costs. We report conformer docking runtimes in
Table 3 using the PDBBind crystal structures; ESMFold runtimes are marginally shorter. Typical
20

runtimes reported in Table 1 and Appendix B are obtained from timing runs with our method across
the entire PDBBind test set.
D.3

DATASETS

As noted previously, we use train, validation, and test splits from Stärk et al. (2022). However, due
to RDKit parsing issues with Gnina-docked poses, the following 30 complexes are excluded (leaving
333 remaining) from all rigid conformer docking comparisons against Gnina: 6HZB, 6E4C, 6PKA,
6E3P, 6OXT, 6OY0, 6HZA, 6E6W, 6OXX, 6HZD, 6K05, 6NRH, 6OXW, 6RTN, 6D3Z, 6HLE,
6PY0, 6OXS, 6E3O, 6HZC, 6Q38, 6E7M, 6OIE, 6D3Y, 6D40, 6UHU, 6CJP, 6E3N, 6Q4Q, 6D3X.
Scoring comparisons include all test complexes except 6A73, for which decoy poses could not be
generated.
We download the 77 PDB IDs provided in Tosstorff et al. (2022) from the PDB to form the PDE10A
dataset, keeping the A chain of each assymetric unit and the Ligand of Interest (LOI) interacting
with it. We then align all ligands to the crystal structure of 5SFS using the procedure described
in Corso et al. (2023) for aligning ESMFold structures, except transforming the ligand rather than
the protein. This constitutes the construction of a cross-docking dataset due to the use of the same
pocket for all ligands. Due to RDKit parsing errors with the Gnina-docked poses, the following 7
PDB IDs are excluded from all comparisons: 5SFA, 5SED, 5SFO, 5SEV, 5SF9, 5SDX, 5SFC. The
remaining 70 ligands are shown superimposed on the 5SFS pocket in Figure 5.

Figure 5: PDE10A ligands aligned on 5SFS

21

E

H YPERPARAMETERS

Model hyperparameters The main model hyperparameters are the aspects of the scalar field parameterization (Equation 2) which dictate the number of coefficients Acnjℓm predicted by the neural
network. Intuitively, these determine the expressivity or “intrinsic resolution” of the scalar field
which can be parameterized by the ESF neural network. These hyperparameters and their search
spaces are detailed in Table 4, with full supporting experimental results in Appendix G.1.
Our data featurization and E3NN neural network architecture are largely adapted from the default
settings of Corso et al. (2023). Both the protein and ligand neural networks operate on all heavy
atom nodes; however, in the protein network only the alpha-carbons nodes emit Acnjℓm coefficients
that contribute to the scalar field. For simplicity, we also omit ESM features.
Table 4: Training-time model hyperparameters and their search spaces. The parameters corresponding to experiments in the main text are in bold.
Parameter

Search Space

# scalar field channels
# radial basis functions
RBF cutoff radius
Order of spherical harmonics

3, 5, 8
3, 5, 8
3 Å, 5 Å, 8 Å
1, 2, 3

Inference hyperparameters The primary inference-time hyperparameters control the “search resolution” over the FFT and grid-search degrees of freedom. For the grid-search, the resolution is controlled directly by the selection of grid points. For the FFT, the resolution is effectively controlled
by frequency domain representation of the scalar field. Specifically,
• In the translational case, this resolution is expressed in terms of a sampling resolution
which corresponds to the grid of frequency vectors at which the translational Fourier coefficients (Equation 5) are evaluated. To generate these frequency vectors, we use a 40 Å
cubical domain with periodic boundary conditions.
• In the rotational case, this resolution is given by the maximum order of spherical harmonics in the global spherical basis (Equation 7) used for evaluating rotational crosscorrelations.
The search space for these resolution hyperpameters is detailed in Table 5. Increasing the search
resolution improves accuracy but at the cost of core optimization runtime. The default settings
chosen for the main experiments are based on a qualitative assessment of the best tradeoff. Figure 6
visualizes this tradeoff, with full supporting experimental results in Appendix G.2.
Table 5: Inference-time model hyperparameters and their search spaces. The search grid points over
SO(3) are generated according to Zhong et al. (2019); Yershova et al. (2010). The translational
search grid points are defined by evenly spacing the stated number of points along each dimension
of a 8 Å cube. The parameters corresponding to experiments in the main text are in bold.
Procedure

Parameter

Search Space

TF

Translational signal resolution
Rotataional search grid points

0.5 Å, 1 Å
576, 4608

RF

Rotational signal order
Translational search grid points

10, 25, 50
73 , 93 , 133

22

Rot. grid pts.

Translational Optimization (TF)
Median RMSD (Å)

Ligand preprocessing
runtime (ms)

Cross-correlation
runtime (ms)

4608

1.1

7196

715

576

1.53

1.5

100

122

1

0.5
1
0.5
1
Translational cross-correlation resolution (Å)

0.5

930

928

Trans. grid pts.

Rotational Optimization (RF)
Median RMSD (Å)

Protein preprocessing
runtime (seconds)

Cross-correlation
runtime (ms)

133

1.05

0.9

197

200

994

1430

93

1.16

1.0

0.98

64

67

63

332

476

1486

73

1.25

1.15

1.16

30

30

31

157

224

704

10

25

50
10
25
50
10
Rotational cross-correlation resolution ( max)

25

50

Figure 6: Inference-time hyperparameters and their effect on performance and average runtime.
In the TF procedure (top), the protein preprocessing time is roughly constant and is not shown;
similarly in the RF procedure (bottom) the ligand preprcessing time is constant. The selected default
settings are boxed. All numbers are taken from full results on PDBBind in Appendix G.2

23

F

RUNTIME A MORTIZATION

In this section we discuss in greater detail the practical implications of runtime amortization for
various common workflows in molecular docking, focusing on rigid conformer docking as a proxy
task. Conceptually, the total runtime of a docking workflow is given by




# unique
# unique ligand
Cprot ·
+ Clig
+ Cpair · (# complexes)
(35)
proteins
conformers
where Cprot , Clig , Cpair refer to the protein precomputations, ligand precomputations, and crosscorrelation (i.e., pose optimization) runtimes, respectively. If the number of complexes grows faster
than either the number of unique proteins or ligand conformers individually, then the runtime contributions from the latter can be amortized across the complexes—meaning that the unit cost per
complex decreases as their number increases.
To illustrate this amortization, suppose that we have N protein pockets (for simplicity we assume
each is on a seperate protein) and M rigid conformers and wish to dock all M N complexes. Then,
following the exposition and typical runtimes provided in Section 3.4, we can compute (Table 6) the
total docking cost of this workflow for the TF procedure:
(72 ms) · M + (7400 ms) · N + (740 ms) · M N

(36)

Similarly, the total cost under the RF procedure is:
(58400 ms) · M + (21 ms) · N + (470 ms) · M N

(37)

Note that Cprot ≪ Clig for the TF procedure, whereas Cprot ≫ Clig for the RF procedure. This is because that the grid search over rotations in TF involves a new set of translational Fourier coefficients
for each rotation of the ligand, while the protein is kept fixed; on the other hand, the grid search
over translations in RF requires a new global spherical expansion Bnjℓm of the protein around each
grid point, but only once for the ligand around its center of mass. This discrepancy has significant
implications for the suitability of each procedure depending on the behavior of M, N :
• In the virtual screening-like setting, we keep a fixed protein (M = 1) and let the number
of ligands increase (N → ∞). Hence, the TF procedure is unsuitable, as the cost per
ligand (and thus per complex) is a constant ≈8 seconds, approximately on the same order as
traditional docking methods. On the other hand, the cost per ligand under the RF procedure
begins at nearly ≈60 s with a single ligand, but asymptotically approaches ≈500 ms as the
number of ligands increases (Figure 7, left).
• In the inverse-screening-like setting, we keep a fixed ligand (N = 1) and let the number of
proteins increase (M → ∞). Now the RF procedure is unsuitable, as the cost per protein
is a constant ≈60 s, which is far too high to be useful. On the under hand, the cost per
protein under the TF procedure begins at ≈8 s and asymptotically approaches ≈800 ms as
the number of proteins increases (Figure 7, right).
• In the general setting, we divide both Equation 36 and 37 by M N to find that when
M ≫ N (i.e., more proteins than ligands), the TF procedure is more efficient, and when
M ≪ N (more ligands than proteins), the RF procedure is more efficient. However, both
procedures converge to similar per-complex runtimes (740 ms vs 470 ms) in the limit of
M, N → ∞.

24

Table 6: Computing the runtime of the TF and RF procedures as a function of the number of
proteins M and ligands N . T and R are the number of points used in the grid search over rotations
and translations for TF and RF, respectively; they are fixed constants on the order of ≈1000 (we
use T = 729, R = 4608 by default). The rows in red contribute the greatest preprocessing runtime.
Translational Optimization (TF)
Computation

Count Runtime

Protein scalar field
M×
Protein FFT coeffs
M×
Ligand scalar field
N×
Ligand FFT coeffs
RN ×
Cross-correlation + RM N ×
=

Runtime per complex (s)

Rotational Optimization (RF)
Computation

Count Runtime

Protein scalar field
M×
Protein global coeffs
TM×
Ligand scalar field
N×
Ligand global coeffs
N×
Cross-correlation
+ TMN×

65 ms
7.0 ms
4.3 ms
1.6 ms
160 µs

=

(Equation 36)

(Equation 37)

Virtual screening (M = 1)

Inverse screening (N = 1)

101
102
N (# ligands)

101
M (# proteins)

60
50
40
30
20
10
0 0
10

60
TF
50
RF
40
30
20
10
0
103 100

65 ms
80 ms
4.3 ms
17 ms
650 µs

102

Figure 7: Amortized conformer docking runtimes per protein-ligand complex in the virtual screening
(left) and inverse screening (right) settings.

25

G

A DDITIONAL R ESULTS

G.1

M ODEL H YPERPARAMETERS

In Table 7, we explore the impact of varying model hyperparameters that determine the expressivity
of the scalar field from their default values. We use the ESF-N model and evaluate on the rigid
conformer docking task on PDBBind crystal structures. As the results illustrate, increasing the
expressivity of the scalar field can bring additional improvements in performance. On the other
hand, decreasing the expressivity degrades the performance of our method to the extent that it is no
longer competitive with the baselines (Table 3).
Table 7: Rigid conformer docking results for varying model hyperparameters. Runtimes are shown
averaged over complexes, excluding / including pre-computations that can be amortized.
TF
RF
Method

%
<2 Å

Med.
RMSD

Runtime

%
<2 Å

Med.
RMSD

Runtime

Baseline
# channels = 3
# channels = 8
# RBFs = 3
# RBFs = 8
RBF cutoff = 3 Å
RBF cutoff = 8 Å
Spherical harmonics ℓmax = 1
Spherical harmonics ℓmax = 3

72
70
71
63
77
72
76
62
80

1.10
1.07
1.15
1.27
1.07
1.16
1.16
1.35
0.99

0.7 s / 8.2 s
0.8 s / 6.5 s
0.9 s / 12 s
0.9 s / 8.0 s
0.8 s / 9.3 s
0.8 s / 8.4 s
0.8 s / 8.4 s
0.9 s / 7.6 s
0.9 s / 10 s

73
71
70
64
76
72
71
62
77

1.00
0.98
1.09
1.13
0.95
1.00
1.08
1.23
0.87

0.5 s / 68 s
0.4 s / 70 s
0.7 s / 71 s
0.5 s / 70 s
0.5 s / 74 s
0.5 s / 68 s
0.5 s / 69 s
0.5 s / 62 s
0.5 s / 81 s

G.2

I NFERENCE H YPERPARAMETERS

In Tables 8–11 below, we explore the impact of inference-time hyperparameters on the performance
and runtime of our method on the rigid conformer docking task. We use the ESF-N model variant
and experiment with the PDBBind crystal test set and PDE10A test set. As discussed in Appendix E,
for the TF procedure, we adjust (1) the number of grid points over SO(3) (576 or 4608); and (2)
the spatial resolution for translational cross-correlation (either 1 Å or 0.5 Å). For the RF procedure,
we adjust (1) the number of translational grid points in the 8 Å cube (73 , 93 , or 133 ); and (2) the
resolution for the rotational cross-correlation, expressed in terms of the maximal spherical harmonic
order (10, 25, or 50). The rows corresponding to results in the main Table 3 are bolded.
In all rows, the effective number of poses searched over via both degrees of freedom is computed.
To provide an idea of the impact of discretization, we compute the median RMSD of the closest
grid point to the ground-truth pose (decomposed into rotational and translational contributions).
This serves as a hard lower bound for the median RMSD of the output docked pose. In the TF
procedure, increasing the resolution is memory-intensive; thus, the RF procedure is more effective
at leveraging FFT to conduct fine-grained search over the accelerated degree of freedom. The default
reported performance is attained with a translational offset of 0.4 Å and a rotational offset of 0.16
Å. While performance improves with smaller grid offsets, the returns are rapidly dimishing.
The runtime of the method (averaged over 333 PDBBind complexes and 70 PDE10A complexes)
is reported and color-coded according to Appendix B: protein preprocessing (green), ligand preprocessing (blue), and the pose optimization (red). The effect of the non-FFT grid resolution is also
color-coded, i.e., in TF the explicit enumeration over SO(3) grid points directly scales the ligand
preprocessing, whereas in RF the enumeration over R3 scales the protein preprocessing. As the
tables show, the preprocessing of these explicit grid points contributes to the majority of the nonamortizeable runtime. In general, the SO(3) grid / ligand preprocessing in TF is less expensive,
however, it cannot be amortized when moving from PDBBind to PDE10A (where the ligands are
still distinct). On the other hand, the R3 grid / protein preprocessing time in RF is significantly
reduced (very roughly on the order of 70-fold, as expected) in PDE10A compared to PDBBind.

26

Table 8: PDBBind TF
Grid offset

Runtime (ms)

Trans.
resol.

SO(3)
grid pts.

Effective
# poses

Tr.

Rot.

All

Med.
RMSD

%
<2Å

Prot.
prep.

Lig.
prep.

Opt.

1 Å
1 Å
0.5 Å

576
4608
576

420k
3.4M
2.8M

0.52
0.50
0.25

0.84
0.42
0.80

0.98
0.67
0.84

1.53
1.10
1.50

63
72
64

65
72
70

931
7196
928

100
715
123

Table 9: PDBBind RF
Grid offset

Runtime (ms)

Trans.
grid pts.

SO(3)
ℓmax

Effective
# poses

Tr.

Rot.

All

Med.
RMSD

%
<2Å

Prot.
prep.

Lig.
prep.

Opt.

73
73
73
93
93
93
133
133

10
25
50
10
25
50
10
25

3.2M
45M
353M
6.8M
97M
751M
20M
291M

0.65
0.67
0.65
0.49
0.50
0.51
0.33
0.33

0.38
0.15
0.08
0.36
0.15
0.08
0.37
0.15

0.80
0.70
0.67
0.64
0.53
0.52
0.51
0.37

1.25
1.15
1.16
1.16
1.00
0.98
1.05
0.90

70
69
70
73
73
74
74
72

30k
31k
32k
64k
67k
63k
198k
200k

85
87
85
85
87
84
85
86

158
225
704
333
476
1487
995
1430

Table 10: PDE10A TF
Grid offset

Runtime (ms)

Trans.
resol.

SO(3)
grid pts.

Effective
# poses

Tr.

Rot.

All

Med.
RMSD

%
<2Å

1 Å
1 Å
0.5 Å
0.5 Å

576
4608
576
4608

420k
3.4M
2.8M
23M

0.51
0.50
0.26
0.26

0.88
0.48
0.89
0.44

1.00
0.69
0.93
0.51

1.85
1.11
2.05
1.00

56
64
50
73

Prot.
prep.

Lig.
prep.

Opt.

22
21
20
20

761
6159
756
6147

89
736
106
892

Table 11: PDE10A RF
Grid offset

Runtime (ms)

Trans.
grid pts.

SO(3)
ℓmax

Effective
# poses

Tr.

Rot.

All

Med.
RMSD

%
<2Å

Prot.
prep.

Lig.
prep.

Opt.

73
73
73
93
93
93
133
133

10
25
50
10
25
50
10
25

3.2M
45M
353M
6.8M
97M
751M
20M
291M

0.72
0.57
0.65
0.46
0.48
0.49
0.34
0.33

0.38
0.16
0.08
0.39
0.16
0.09
0.41
0.16

0.83
0.59
0.65
0.63
0.51
0.50
0.55
0.36

1.60
1.21
1.30
1.05
1.00
0.99
1.17
0.96

54
63
64
64
70
64
64
69

476
549
635
1014
946
943
2798
2912

44
42
59
42
43
42
42
45

161
227
718
327
465
1483
986
1469

27

G.3

P ERFORMANCE -RUNTIME T RADEOFF

In Figure 8, we further investigate the tradeoff between speed and performance offered by our
method compared to Gnina (with the Vina scoring function). While in the main results (Table 3) we run Gnina using all default settings, it is possible to reduce the runtime (and performance) by adjusting these settings. In particular, we explore setting --max_mc_steps and
--minimize_iters to 5 independently and in combination. Together with the default runs
and the --score_only runs, these trace out a Pareto frontier representing the tradeoff between
runtime per complex and <2 Å RMSD success rate. With the default settings, Gnina outperforms
all variants of our method on the PDBBind crystal and PDE10A test sets. However, Figure 8 shows
that we can reach previously inaccessible regions in the accuracy v.s. runtime tradeoff landscape.

% < 2 A RMSD

0.8

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0

0.6
0.4
0.5

0.2

0.0

0.0

0

0
2
5
10
15
PDBBind Runtime (s)

4
20

0

2
4
PDE10A Runtime (s)

6

Figure 8: Tradeoff between speed and accuracy using our method compared to Gnina on PDBBind
crystal structures (left) and PDE10A (right). In both cases, variants of our method (blue dots) enable
possibilities not reachable with Gnina (orange curve).

28