Finding Interpretable Class-Specific Patterns through Efficient Neural Search
Nils Philipp Walter1 , Jonas Fischer2 , Jilles Vreeken1
1

arXiv:2312.04311v1 [cs.LG] 7 Dec 2023

2

CISPA Helmholtz Center for Information Security,
Harvard T.H. Chan School of Public Health, Department of Biostatistics
nils.walter@cispa.de, jfischer@hsph.harvard.edu, vreeken@cispa.de

Abstract
Discovering patterns in data that best describe the differences
between classes allows to hypothesize and reason about classspecific mechanisms. In molecular biology, for example, this
bears promise of advancing the understanding of cellular processes differing between tissues or diseases, which could lead
to novel treatments. To be useful in practice, methods that
tackle the problem of finding such differential patterns have
to be readily interpretable by domain experts, and scalable to
the extremely high-dimensional data.
In this work, we propose a novel, inherently interpretable binary neural network architecture D IFF NAPS that extracts differential patterns from data. D IFF NAPS is scalable to hundreds of thousands of features and robust to noise, thus overcoming the limitations of current state-of-the-art methods in
large-scale applications such as in biology. We show on synthetic and real world data, including three biological applications, that, unlike its competitors, D IFF NAPS consistently
yields accurate, succinct, and interpretable class descriptions.

1

Introduction

Machine learning can be broadly categorized into predictive and discovery-based approaches. Predictive tasks, such
as object detection, protein folding (Jumper et al. 2021) and
fusion reactor control (Degrave et al. 2022), are aimed at
maximizing performance. Mastering such a given task often requires learning deep and intricate models from which
it is hard up to impossible to understand how it arrived at a
decision. In data-driven discovery, the goal is to find interpretable relations, called patterns, in the data that best describe observed classes. That is, the focus is on interpretability rather than maximizing performance. Discovery-based
approaches are in especially high demand in biology, where
the complex gene-regulatory dynamics and their differences
between tissues or across diseases remain unclear, but, when
elucidated, can offer new avenues for treatment and prevention. Here, symbolic explanations are essential for domain
experts, for example, patterns of gene expression that are
associated with cancer subtypes, to be able to directly understand and act on these patterns.
Although there exist massive amounts of highdimensional data, such as genetic human variation or
gene expression data, most existing approaches are not applicable as they either do not scale or are limited to pair-wise

interactions. Here, we suggest a novel neural network learning approach that follows the paradigm of neuro-symbolic
learning: leverage the predictive power of, and efficient
frameworks for neural networks, while constraining the
models such that learned patterns are fully interpretable.
In particular, we learn a modified NN architecture that in
the forward pass leverages binary weights and activations
to achieve symbolically interpretable intermediate features,
while leveraging efficient continuous optimization during
backpropagation (Fischer and Vreeken 2021).
To learn patterns that differentiate classes, such as healthy
and tumor tissue, we build an architecture that is comprised
of both a binary autoencoder and a separate classification
head, which we call D IFF NAPS. We propose a multi-task
objective to jointly optimize reconstruction and classification, driving learned patterns to differentiate between classes
through a bottleneck in the autoencoder (see Fig. 1). We additionally introduce regularizers that improve optimization
and emphasize interpretability of learned patterns.
We empirically evaluate D IFF NAPS on synthetic and realworld data, comparing against baseline approaches such as
classification trees, but also recent proposals such as rule
lists, statistical and compression-based pattern mining, and
neuro-symbolic learning. We show that D IFF NAPS faithfully
reconstructs patterns relevant for distinguishing between
classes, is robust to noise, and easily scales to hundreds of
thousands of features, which makes it unique among existing
work. We consider three high-dimensional biological applications, including breast cancer genomics, on which D IFF NAPS finds meaningful patterns that hold promise for giving
domain experts insight in the drivers of these diseases.

2

Related Work

Finding class-specific descriptions is at the core of
discovery-oriented approaches in machine learning and data
mining. A text-book example—and still widely used in
practice—is the decision tree, which yields an interpretable
decision path leading to a classification.
In data mining, emerging pattern mining (Dong and Li
1999; Garcı́a-Vico et al. 2018) and subgroup discovery
(Klösgen 1995; Atzmueller 2015) are classic methods that
aim to discover the conditions under which the class labels assume an exceptional distribution. Emerging pattern
mining seeks to find every such condition, which results

Figure 1: Left: The architecture of D IFF NAPS consists of a binarized autoencoder and a classifier attached to the hidden layer.
The neurons in the hidden layer encode patterns and are active if the corresponding pattern is present in the data. Right: The
table shows the parameters of D IFF NAPS. In the forward pass, the continuous weights W E are stochastically binarized (WbE ),
while the classifier weights W C are kept continuous. The bias b in the hidden layer is ceiled to bTd . To extract the differential
patterns (bottom right) per class P 1 and P 2 , both matrices, W E and W C , are deterministically binarized using the thresholds,
τE and τC . A pattern, encoded by a neuron, is given by the index set of all 1 in the corresponding row of the weight matrix
W E . For the differential patterns, the binarized classifier weight matrix functions as a multiplexer to assign patterns to classes.
in extremely many, highly redundant, and mostly spurious
results. Subgroup discovery yields the top-k patterns with
the strongest association with the target. While this circumvents the pattern explosion, the results are still redundant
(Van Leeuwen and Knobbe 2012). In contrast, we are interested in succinct and non-redundant descriptions.
Statistically significant pattern mining (Llinares-López
et al. 2015; Pellegrina, Riondato, and Vandin 2019) aims to
discover patterns that have statistically significantly different
distributions between classes. These methods tend to suffer
from the pattern explosion. That is, even on small data they
often find tens of thousands redundant patterns, partially due
to lack of multiple hypothesis test correction.
Pattern set mining (Bringmann and Zimmermann 2007;
Budhathoki and Vreeken 2015; Hedderich et al. 2022)
solves this by asking for a non-redundant set of classspecific patterns that together describe the data well. These
methods work well on small data, but as they are based on
combinatorial-search heuristics that are (at least) quadratic
in the number of features, they are mostly inapplicable to
high-dimensional data.
Rule-based classification (Lakkaraju, Bach, and Leskovec
2016; Dash, Gunluk, and Wei 2018; Chen and Rudin 2018;
Proença and van Leeuwen 2020; Hüllermeier, Fürnkranz,
and Loza Mencia 2020; McTavish et al. 2022; Huynh,
Fürnkranz, and Beck 2023; Lin et al. 2022) aims to find interpretable classification rules of the form if X1 = 1∧X5 =
1 then Y = 0. While such results are interpretable, these
methods primarily focus on prediction rather than description and, hence, miss out on important details. Additionally,
most are based on combinatorial optimization which prevents them from scaling to high-dimensional datasets.
Neuro-symbolic classification (Wang et al. 2020, 2021;
Kusters et al. 2022; Dierckx, Veroneze, and Nijssen 2023)
has been proposed to overcome these computational limitations. These approaches design neural architectures from

which, after training, symbolic classification rules can be extracted. Their optimization aside, in spirit these methods are
similar to traditional rule-based classifiers as they focus on
classification accuracy rather than complete rule discovery.
In contrast, D IFF NAPS combines data reconstruction with
classification to discover the human-interpretable explanations relevant for the classes present in the dataset.

3

Method

In this section, we introduce D IFF NAPS, a fully interpretable
binary neural network-based approach for finding patterns
that describe the differences between classes in (very) highdimensional data. We start by giving the intuition.

3.1

D IFF N APS in a Nutshell

Given a binary dataset and corresponding class labels, we
seek to find interpretable patterns, which succinctly and differentially describe the partitioning of the dataset induced
by the labels. That is, we want to find patterns that are more
prevalent in a class than in the rest of the data and, hence,
allow us to discriminate between classes.
To this end, we propose D IFF NAPS, a binary neuralnetwork architecture designed to find exactly such interpretable patterns. The architecture consists of a two-layer
binary autoencoder, combined with a classification head (see
Figure 1). The classification head is a fully connected layer,
attached to the hidden layer of the autoencoder.
During the forward pass, we interpret the continuous
weights wij ∈ [0, 1] in the autoencoder as Bernoulli variables distributed as B(wij ) and binarize them stochastically.
Each neuron performs a dot product between the binary
weights and input. The neuron is active if the 1s in the input align with the 1s in the weight vector. Thus the weight
vector can be interpreted as a pattern and hence we refer to
the hidden layer as the pattern layer. Intuitively, the weights

are optimized such that the autoencoder—the set of encoded
patterns—reconstructs the data well.
To find differential patterns, we need to reward those patterns that are specific for a class. We achieve this by adding a
classification head, corresponding to a logistic regression on
the pattern layer. That is, we seek to classify samples based
on the presence and absence of patterns.
To find a good set of patterns, the network is trained using
a multi-task loss. The autoencoder is trained to minimize the
reconstruction error, while the classification head is trained
to minimize the classification error. As such, the network is
driven towards the learning of relevant patterns in the data
that are at the same time differential between classes.

3.2

D IFF N APS in Detail

if a neuron is active, the pattern encoded in that neuron is
used as a whole for the reconstruction. Consequentially, to
achieve a low reconstruction loss, the patterns formed during
optimization must succinctly describe the data.
To reward differential patterns, we connect a classifier
to the pattern layer with continuous weights W C that is
tasked to predict the label of a sample based on the presence and absence of patterns. The classifier is linear, and,
hence, highly interpretable. To extract differential patterns,
we binarize weight matrices W E and W C by thresholding
with τe and τc , respectively. As described above, the patterns
in the pattern layer are given by the index set of all i’s such
that WbE [i, j] = 1. The discretized classifier weights allow
us to assign patterns to their respective classes. For a formal
description of the pattern extraction, we refer to App. A.2.

Next, we discuss D IFF NAPS in detail. We first introduce notation, and then, in turn, discuss the architecture, how to extract differential patterns, how to carry out the forward pass,
the multitask loss, and how to backpropagate errors through
D IFF NAPS.

Forward Pass We denote the size of the hidden dimension of the autoencoder by h and the binary weights of the
encoder as WbE ∈ {0, 1}h×m . We define a linear layer without bias as fW (x) = W x. For a binary input x ∈ {0, 1}m ,
we compute the activations of the pattern layer as

Notation We consider labeled binary datasets (X, Y ) ∈
{0, 1}n×m × {1, ..., K}n of n samples, m features and K
classes. We write Xi,j to refer to the value of the j-th feature
of the i-th sample. We denote the partition of the dataset for
class k by X k = {Xi | Yi = k}.
A pattern p is a subset of feature indices p ⊆ {1...m}
and represents feature co-occurrences. A row Xi contains a
pattern p iff Xij = 1, ∀j ∈ p. The support supp(p) of a
pattern p is the number of rows that contain p, and analogue
supp k (p) is the support where additionally Yi = k. We have

z = fE (x) = λE (fWbE (x)) .

P(p | k) =

supp k (p)
nk

and P(k | p) =

supp k (p)
,
supp(p)

where nk is the number of samples where Yi = k.
We say a pattern pk is differential for class k if it both has
a higher support in X k than in X \ X k , and the probability
of class k is highest for records that contain pk . Formally, iff
k = arg max P(pk | k ′ ) = arg max P(k ′ | pk ) .
k′ ∈{1...K}

k′ ∈{1...K}

Our goal is to find a set P k of such patterns per class k.
Architecture The architecture of D IFF NAPS consists of a
binary autoencoder and a classification head attached to the
hidden layer. We graphically depict it in Fig. 1.
The encoding and decoding layers of the autoencoder
share a set of continuous weights W E , which are learned
during backpropagation. The forward pass uses a binarized
version of this weight matrix WbE . A hidden neuron j represents a pattern, and a feature i is part of the pattern corresponding to neuron j, iff WbE [i, j] = 1. The activation function of the encoder λE is a binary step function centered at a
learned bias term, which represents how many features need
to be present for the neuron to ”fire”—i.e. for the pattern to
be considered present in the sample. We refer to the hidden
layer as the pattern layer.
The decoding layer performs the transposed linear transformation of the encoding layer i.e. Wbd = (WbE )T . Hence,

where λE : R → {0, 1} is the binary step function as defined by Fischer and Vreeken (2021). To steer the encoded
patterns to be differential rather than merely descriptive, we
attach a classifier to the pattern layer. This classifier has continuous weights W C ∈ [0, 1]K×h and computes a linear
transformation followed by a softmax of the binary hidden
activations ŷ = softmax(fW C (z)). That is, its output depends only on the presence or absence of patterns.
To ensure interpretability, we use the transposed encoder
weights as weights of the decoder WbD = (WbE )T ∈
{0, 1}m×h . The reconstruction x̂ of the input x is given by
x̂ = fD (x) = λD (fWbD (z)) ,
where λD is the activation of the decoder as defined by Fischer and Vreeken (2021), clamping the input to the interval
[0, 1] and rounding it to the closest integer.
Objective Function Our objective function consists of
four terms: one for the autoencoder, one for the classification, and two regularization terms. To optimize the classifier, we use the cross-entropy loss between the predicted logits ŷ and the one-hot encoding of the ground truth label y:
PK
lc (y, ŷ) = k=1 yk log(ŷk ). As binary tabular data tends to
be sparse, i.e., the number of ones #1 and number of zeros
#0 are highly unbalanced, we use a sparsity-aware reconstruction loss (Fischer and Vreeken 2021) that weighs the
importance of reconstructing a 1 proportional to the sparsity
of the data. For a sample x ∈ {0, 1}m and reconstruction
x̂ ∈ {0, 1}m , the reconstruction loss is
le (x, x̂) =

m
X

((1 − xj )α + xj (1 − α))|xj − x̂j | ,

j=1
#1s
where α = #1s+#0s
is the sparsity of the data.
Our overall goal is to find a succinct description of the
classes in terms of class-specific patterns encoded by the

neurons in the hidden layer. To promote such patterns, we
adapt the L2 -regularizer to penalize long patterns i.e. rows
with a lot of 1s. This adapted regularizer is given by

2
m
h
X
X

rs (W ) =
Wi,j  .
i=1

j=1

Instead of considering each weight individually, we sum the
rows before squaring them. This penalizes a pattern as a
whole by imposing a quadratic cost on the length of the pattern. Hence, the regularizer tilts the optimization to prefer
shorter patterns. To further push the weights to a binary solution we employ a W-shaped regularizer (Bai, Wang, and
Liberty 2019; Dalleiger and Vreeken 2022), defined as
X
rb (W ) =
min{r(w), r(w − 1)} ,
w∈W

r(w) = κ∥w∥1 + λ∥w∥22 .
This regularizer is based on the elastic-net regularizer and
the hyperparameters κ and λ specify the trade-off between
the ridge and lasso penalty. For κ = λ = 1, the regularizer
is depicted in Figure 3 in the Appendix. Compared to rs ,
the W-shape regularizer is applied element-wise to push the
individual weights towards zero or one.
In the forward pass, we apply stochastic quantization
Wb [i, j] ∼ B(W [i, j]). If all W [i, j], for j = {1...m}, have
the same value, a sample of a row is binomially distributed
with p = W [i, j] and m trials. The expected value is then
mW [i, j]. Considering a minimum of two features for a neuron to fire, this means that when all W [i, j] drop below 1/m
the neuron is on expectation ‘dead’. To prevent regularizers from zeroing out a neuron by pushing Wij below this
threshold, we offset the weights by −1/m before applying
the regularizers. For the same reason, we set the gradients
Ph
for rs to zero if j=1 W [i, j] < 1 .
Given the parameters of the network θ = {W E , W C } the
loss function for a dataset (X, Y ) is given by
n
X
le (yi , ŷi )+λc lc (xi , x̂i )+rs (W E )+rb (θ) ,
L(X, Y ; θ) =
i=1

where λc is a parameter that weighs the classification loss.
Backward Pass We minimize this loss function using gradient descent. For this, we need to compute the partial
derivatives with respect to the weights of the network. To
be able to pass gradients through step-functions, we use the
straight-through-estimator (STE), which is commonly employed in binary neural networks (Bengio, Léonard, and
Courville 2013). For a particular layer, gu denotes the upstream gradient. For the derivatives with respect to the autoencoder, we follow the approach of Fischer and Vreeken
(2021). In particular, for encoding layer W E and input x
∂fWbE

:= gu x⊤ ,

∂fWbE

:= (W E )⊤ gu .
dx
The derivative through the activation function of the deD
coder λD is given by ∂λ
∂x := 1gu . For the activation function of the pattern layer, the STE above is inapplicable. In the
∂W E

case that features are wrongly reconstructed, the resulting
loss would propagate negative gradients through the STE,
even to inactive neurons. Hence, we adapt the gated STE,
which gates the gradient depending on whether a neuron was
active in the forward pass. The derivatives for bias b and input x are

∂λE
gu if λE (x) = 1
,
:=
0
if λE (x) = 0
∂b

∂λE
gu
if λE (x) = 1
.
:=
max (0, gu ) if λE (x) = 0
∂x
In quantized neural networks, it has been observed that
quantizing the classification layer has a negative impact on
performance (Choi et al. 2018; Liu et al. 2018; Hubara et al.
2017). Thus we do not quantize the weights of the classifier during training. Although the classifier is not quantized,
the classifications are transparent and interpretable, since the
classification head is similar to logistic regression and the
weights are constrained to be in the interval [0, 1].
Finally, after a round of backpropagation, all weights are
clipped to the interval [0, 1]. This enables stochastic binarization for the autoencoder and the classifier for the next
forward pass and to transparently interpret the contribution
of a pattern to a certain class. We clamp the bias at a maximum of −1, such that at least two features have to be present
for a neuron to become active.
This concludes the formal description of D IFF NAPS.
D IFF N APS in Practice To use D IFF NAPS in practice, we
need to choose the number h of hidden neurons and set λc .
For medium to high-dimensional data, setting the size of
the hidden layer lower than the dimensionality of the data,
m, creates an inductive bias towards differential patterns.
Since to achieve both a low reconstruction loss and low classification loss, the patterns in the hidden layer have to be predictive, i.e., high P(k | pk ), and due to the bottleneck, the
patterns must cover the partition well, i.e., high P(pk | k).
For low dimensional data, choosing a small hidden layer
results in an under-parameterized network that will underfit.
Choosing a larger hidden layer, thus having more parameters, outweighs the benefits of the bottleneck.
Parameter λc weighs the effect of the reconstruction and
classification losses. The magnitude of the reconstruction
loss varies strongly among different datasets. In practice, we
increase λc until the classification error saturates.

4

Experiments

We compare D IFF NAPS five state-of-the art methods on synthetic and real-world data. In particular, we compare to decision trees (C ART, Breiman 1984), significant pattern mining (SP U M AN T E Pellegrina, Riondato, and Vandin 2019),
MDL-based label-descriptive (P REMISE, Hedderich et al.
2022) and classification rule learning (C LASSY, Proença and
van Leeuwen 2020), and neuro-symbolic classification rule
learning (R LL, Wang et al. 2021).
We additionally considered top-k subgroup discovery
(Lemmerich and Becker 2018), difference description (Budhathoki and Vreeken 2015), falling rule lists (Chen and

Rudin 2018; Lin et al. 2022), optimal sparse decision
trees (McTavish et al. 2022), and class-specific BMF (Hess
and Morik 2017), but found these do not scale to, or do not
find patterns on non-trivial data.
P REMISE and SP U M AN T E consider only binary classes.
To allow fair comparison in a multiclass scenario, we run
them in a one-versus-all for each class and merge the results.
The hyperparameters for the predictive approaches are
tuned based on accuracy on a hold-out set. For SP U M AN T E,
we used the default parameters given by the authors. We fit
the hyperparameters of D IFF NAPS based on our loss function. The experiments for the neural approaches, i.e. D IFF NAPS and R LL, are executed on GPUs. For more on the experimental setup, we refer to Appendix A.3.

4.1

Synthetic Data

To evaluate all methods on data with known ground we first
consider synthetic data. We measure success in terms of soft
F1 (Hedderich et al. 2022), by which we avoid overly penalizing methods that recover only parts rather than exact
matches of ground truth patterns. The formal definition can
be found in Appendix A.4. Informally, the soft F1 score does
not require strict equality between a discovered pattern pd
and the corresponding ground truth pattern pg but uses a soft
equality, i.e., the Jaccard distance of pd and pg .
Data Generation In the experiments below we generate
synthetic data as follows. We start with an empty data matrix of n rows and m features. We sample 10 patterns per
class, uniformly at random (u.a.r.) across features, drawing
their length from U(5, 15). We sample 20 common patterns
u.a.r, but draw their length from U(0.01m, 0.025m) to maintain the density of the data. Per class, we generate equally
many rows. Per row, we plant u.a.r. two common and three
class-specific patterns. We then apply both additive noise by
flipping ten 0s to 1s, as well as destructive noise by flipping
1s due to a pattern to 0s with a probability of 2.5%. Finally,
we assign the class label such that P(k | p ∈ Pk ) = 0.9. Unless specified otherwise, we report the average results over
five independently drawn datasets.
Scalability in m First, we consider how well D IFF NAPS
scales to high dimensional data. We fix the number of
classes K to 2, the number of rows n to 10 000, and vary
m ∈ {100, 500, 1k, 5k, 10k, 15k, 20k, 25k, 50k, 100k}. To
reduce the overlap across patterns in low-dim. data m <
1000, we sample 5 patterns per class and no shared patterns.
We run all methods and report their results in Fig. 2a,b.
Except for P REMISE and SP U M AN T E, all terminate within
24 hours. P REMISE runs out of time for m > 20k.
SP U M AN T E runs out of memory for m < 1k and m > 10k.
We see in Fig. 2a that C LASSY is one order of magnitude
slower than D IFF NAPS, R LL, and C ART all of which perform on par in terms of runtime.
Next, we inspect the average F1-scores, which we show
in Fig. 2a. We see that C LASSY, R LL and C ART perform
poorly, as they recover only small parts of the ground truth
patterns, and that SP U M AN T E varies in performance due to
having to sub-sample the data. P REMISE achieves scores of
approx. 0.5 across all m. D IFF NAPS consistently recovers

ground truth well across many orders of magnitudes of m.
For very high dimensional data, performance slowly deteriorates but still outcompetes the state-of-the-art by a wide
margin.
Multi-class Next, we examine how well D IFF NAPS scales
to a large number of classes. To this end, we generate data as above, varying the number of classes K ∈
{2, 5, 10, 15, ..., 25, 50}, generating 1 000 rows per class,
setting m = 5 000. We give the results in Fig. 2c. R LL,
C LASSY, C ART, and SP U M AN T E fail to recover more than
a small subset of the ground truth. In addition SP U M AN T E
runs out of memory for more than 10 classes. P REMISE
achieves scores of around 50% up to 20 classes, but fails
to terminate for K ≥ 40, as running in a one-versus-all setting incurs high computational costs. In contrast, D IFF NAPS
stably performs best in this setting.
Robustness to Noise Finally, we evaluate how robust
methods are to noise. Here, we set K = 2, m = 5 000,
and n = 2 000. First, we consider additive noise by varying
the number of randomly added 1s per row, from 0 to 100. In
the interest of space, we postpone the Figure to App. Fig. 5.
We find that SP U M AN T E rapidly fails to discover meaningful results, and runs out of memory for a = 100. In contrast,
D IFF NAPS and P REMISE are robust across varying a, with
D IFF NAPS outperforming its competitors by a wide margin.
Second, we consider destructive noise by varying the
probability of flipping a 1 to a 0, from 0% to 60%. We show
the results in Fig. 2d. C ART, C LASSY, and R LL all obtain
F1 scores of near-zero, SP U M AN T E performs slightly better on average but shows a large variance in the performance
across repetitions. P REMISE is the best among competitors,
but its performance declines rapidly even for small amounts
of destructive noise. In contrast, D IFF NAPS is robust, its performance virtually unaffected up to 20% destructive noise,
i.e., up to a signal-to-noise ratio of 6dB.

4.2

Real-World Data

Next, we evaluate D IFF NAPS on five biological datasets. We
consider phenotypical Cardio data (Ulianova 2017), a Disease diagnosis (Patil and Rathod 2020) dataset, two highdimensional binarized gene expression datasets for breast
cancer, BRCA-N and BRCA-S, that we derived from The
Cancer Genome Atlas (TCGA) (see App. A.5), and a human genetic variation data set (The 1000 Genomes Project
Consortium 2015; Fischer and Vreeken 2020).
We consider the same competitors as before, except R LL
as it returns no patterns for any data but Cardio. To obtain
results with SP U M AN T E we had to restrict it to 250 samples
for Cardio, 4000 for Disease, 50 for both BRCA datasets. We
could not find any setting to make it work on Genomes. We
report running time for all methods in App. Tab. 2.
Quantitative Results As the ground truth is unknown,
we report the number of discovered patterns, their average
length, and the area under the curve of what percentage of
the data the patterns cover when we order them by probability of seeing a class given a pattern (see App. A.5). Intuitively, this corresponds to sensitivity (how much do we

10−2
102

103
104
105
Number of features |m|

(a) Runtime (lower is better)

C LASSY

SP U M AN T E

102

P REMISE

R LL

1
0.8
0.6
0.4
0.2
0

SoftF1

100

C ART

SoftF1

102

SoftF1

time (min)

D IFF NAPS

1
0.8
0.6
0.4
0.2
0
103
104
105
Number of features |m|

10 20 30 40 50
Number of classes K

(b) F1-score (higher is better)

(c) F1-score (higher is better)

1
0.8
0.6
0.4
0.2
0
0

0.2
0.4
Destructive noise

0.6

(d) F1-score (higher is better)

Figure 2: Scalability. We show runtime (a) and F1-scores (b) when varying m, resp. F1-scores when varying the number of
classes K (c) and varying the amount of destructive noise (d). Our method, D IFF NAPS can confidently handle a large number
of features with a negligible increase in runtime, outperforms all state-of-the-art competitors significantly in terms of F1.
D IFF NAPS (ours)

C ART

C LASSY

P REMISE

SP U M AN T E

Dataset

n

m

K

#P

|P |

AUC

#P

|P |

AUC

#P

|P |

AUC

#P

|P |

AUC

#P

|P |

AUC

Cardio
Disease
BRCA-N
BRCA-S
Genomes

68k
5k
222
187
2.5k

45
131
20k
20k
225k

2
41
2
4
6

14
838
146
1k
732

2
2
9
2
7

.56
.84
.91
.86
.77

7k
1
1
22
127

6
2
2
2
4

.62
.00
.00
.31
.46

10
25
3
2
7

2
2
1
1
2

.36
.11
.45
.23
.36

28
187
–
–
–

1
3
–
–
–

.51
.84
–
–
–

346
2k
4k
0
–

4
3
3
0
–

.56
.39
.95
0
–

Table 1: Real world data We give the number of samples (n), features (m), and classes (K), and per method the number
of discovered patterns (#P ), average length of discovered patterns (|P |) and area under the curve (AUC) of the sensitivityspecificity plot (see App. Fig. 6, Sec A.5). We aborted experiments taking longer than 24h or running out of memory (–).
cover) versus specificity (how specific are patterns for that
class). To filter spurious patterns we compute this measure
over patterns for which at least 1/k + 0.1 of their probability mass is assigned to one class. For example, for the
binary setting only those patterns that occur at least 60%
in the class, where 50% would correspond to independence
(random coin flip).
We report basic statistics and results in Tab. 1. We observe
that D IFF NAPS performs well across all datasets, obtaining
AUC scores that are either best by a wide margin (Disease,
BRCA-S, Genomes) or close second best (Cardio, BRCA-N).
Consistent with our synthetic data study, our competitors
yield mixed results; they do not scale to high dimensional
data (P REMISE, SP U M AN T E), or return prohibitively many
or unspecific patterns (SP U M AN T E, C ART, resp. C LASSY).
Regarding the length of discovered patterns, we observe that those by D IFF NAPS reflect the complexity of the
datasets: on Cardio and Disease, which contain complex,
information rich features, it finds smaller patterns, while for
the other datasets, that consist of low-level molecular information as features, it finds longer patterns to capture complex relationships. In contrast, C LASSY generally discovers only few, medium-length patterns across datasets, while
C ART recovers more complex relationships that are, however, less descriptive of the classes as measured by the AUC.

Qualitative Results Next, we analyse the results of D IFF NAPS in detail and show their relevance for biological re-

search on the breast cancer datasets.1
Differentiating Breast Cancer and Healthy Tissue Breast
cancer (BRCA) is the most common cancer and the leading cause of death from cancer among women in the
world (Lukasiewicz et al. 2021). The exact underlying gene
regulatory dynamics are actively researched.
We apply D IFF NAPS on BRCA-N and discover 146 differential patterns of gene co-expression for BRCA and adjacent normal tissue. To see if these capture relevant molecular
differences, we run a statistical gene set over-representation
analysis using KEGG (see App. A.5), a manually curated
gold standard for molecular interactions, reactions, and relations (Kanehisa et al. 2017).
We first do a pooled analysis over all genes identified by
any discovered pattern for a class, i.e., the union of features
in the respective patterns. We find that enriched pathways
for tumor tissue correspond to known cancer drivers, such
as MAPK and WNT signalling, while for the healthy tissue we find pathways linked to the regulation of lipolysis
in adipocytes as well as PPAR signalling, both of which are
known to be dysregulated in BRCA (Yang et al. 2018; Zhao
1
Human variation data, such as the Genomes dataset, is an ideal
application for D IFF NAPS as it is a high-dimensional resource of
binary data in which we can uncover potential genetic predispositions of individuals to diseases, thus allowing to advance early
detection and treatment. However, in the available data, the target
class is the population membership of the individual, which raises
ethical concerns for detailed analysis. Sadly, no further meta-data
is available to meaningfully split Genomes for differential analysis.

et al. 2022). In short, D IFF NAPS discovers patterns that together describe complex, cancer-related functions.
Investigating individual patterns, we find that while many
identify general pathways like above, others are enriched for
specific pathways, such as PPAR. This shows the discovered
patterns reveal details that can potentially be used for discovering alternative treatment targets for these pathways.
Differentiating Cancer Subtypes It is well known that breast
cancer is not one single disease, e.g. the Luminal A, Luminal B, HER2+, and the Triple Negative subtypes all show
distinct molecular behaviour, response to treatment, and patient survival. To investigate whether D IFF NAPS can elucidate differences between these subtypes, we run D IFF NAPS
on BRCA-S, a balanced dataset of primary BRCA tissue with
subtype label, and again analyse the discovered patterns using a gene set over-representation analysis in KEGG.
Starting with a pooled analysis, we find significantly enriched pathways that capture specifics of classes. Luminal
A, for example, is defined by a lack of HER2. For this
subtype, D IFF NAPS discovers patterns that are enriched for
(i.e. related to) dilated cardiomyopathy. This is a common
side-effect in Trastuzumab treatment, a drug which targets
and depletes HER2 in HER2 positive subtypes (Crone et al.
2002). Luminal B is Estrogen receptor positive, meaning it
expresses this receptor. For this subtype, we find patterns
that are significantly enriched for sphingolipid metabolism.
This is an important component for cell survival, proliferation, and promotion of cell migration and invasion in Estrogen receptor-positive BRCAs (Corsetto et al. 2023). These
metabolites are also targets of treatment, and the discovered
patterns could reveal insights leading to potential new therapeutic targets.
Promising Novel Patterns On both BRCA datasets, we find
highly class-specific patterns, with average log-odds of
P(p | k) against P(p | ¬k) of ≈ 8 resp. ≈ 5. Encouragingly, the above analysis above showed that many of these
patterns capture complex biological processes related with
BRCA progression or tumorigenes. More exciting perhaps
are those patterns for which the genes are not yet annotated
in a pathway but are strongly associated with BRCA or its
subtypes. We are looking forward to conducting an in-depth
analysis with oncologists, relating these patterns with more
fine-grained subtypes or treatment groups.

5

Discussion

Experiments show that D IFF NAPS finds succinct sets of differential patterns, scales to hundreds of thousands of features, large number of classes, and is robust to noise.
On synthetic data, we saw that existing methods fail to recover significant portions of the planted differential patterns.
Rule-based methods only recovered small subsets of incomplete patterns. SP U M AN T E suffers from memory problems,
and returns overly large, redundant results. P REMISE does
account for redundancy, which results in better performance,
but its combinatorial search does not scale well. R LL and
C ART scale very well, but show poor performance on synthetic data. Surprisingly, none of the existing approaches are
robust to destructive noise.

On real world data, we find D IFF NAPS is the only approach that scales well and retrieves high-quality patterns.
While other approaches show good performance on individual datasets, e.g. C ART on Cardio and P REMISE on Disease
data, they fail to do so in general. We also note that C ART
and SP U M AN T E tend to return thousands of patterns, which
undermines the goal of human interpretation.
D IFF NAPS fulfills the goal we set for this work and
presents itself as a suitable candidate to take on the challenge of high-dimensional pattern mining in applications
like genomics. As encouraging its ability in retrieving classdescriptive patterns at scale is, there is of course no free
lunch. For example, on low-dimensional data of up to a hundred features, D IFF NAPS has a harder time differentiating
classes and individual patterns and performs ‘only’ on par
with other approaches. For such low-dimensional regimes,
employing methods with guarantees, that are usually infeasible for large-scale data is still preferential.
Similar to most existing work, D IFF NAPS considers only
conjunctions of features as patterns. In many applications,
relations can be more complex, such as mutually exclusive
features. It would make for engaging future work to study
extensions of D IFF NAPS to capture such relations. In a case
study on breast cancer datasets, we show that D IFF NAPS
discovers patterns that capture class-relevant biological processes. The results are not only encouraging, but also contain
many patterns for which the genes are not yet annotated to
a pathway or process, or the function of individual genes
is still unknown. These results offer an exciting opportunity to investigate novel links between genes and diseases
in follow-up studies with domain experts.

6

Conclusion

We studied the problem of discovering differential patterns,
i.e., patterns that succinctly describe and differentiate between the classes present in the data. Existing methods
are often limited to binary classes, do not scale to highdimensional data, or retrieve uninformative pattern sets.
To tackle this problem, we proposed a novel neural network architecture D IFF NAPS consisting of a binary autoencoder and a classification head. With a flat, binary architecture, the learned intermediate layer captures symbolic patterns. For the optimization, we proposed a multi-task objective to jointly optimize the reconstruction and classification,
thus driving learning of patterns that both reconstruct the
data well and differentiate between classes.
On synthetic and real-world data, including biological
case studies on breast cancer, we show that D IFF NAPS
strikes a unique balance among existing work, scales to
high-dimensional data, is robust to noise, and accurately retrieves differential patterns that are highly interpretable.

Acknowledgements
The BRCA datasets were derived from data made available
by the TCGA Research Network.2
2

https://www.cancer.gov/tcga

References
Atzmueller, M. 2015. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,
5(1): 35–49.
Bai, Y.; Wang, Y.-X.; and Liberty, E. 2019. Quantized Neural Networks via Proximal Operators. In Proceedings of
the International Conference on Learning Representations
(ICLR). OpenReview.
Bengio, Y.; Léonard, N.; and Courville, A. 2013. Estimating or propagating gradients through stochastic neurons for
conditional computation. arXiv preprint arXiv:1308.3432.
Breiman, L. 1984. Classification and regression trees. Routledge.
Bringmann, B.; and Zimmermann, A. 2007. The Chosen Few: On identifying valuable patterns. In Proceedings
of the 7th IEEE International Conference on Data Mining
(ICDM), Omaha, NE, 63–72.
Budhathoki, K.; and Vreeken, J. 2015. The Difference and
the Norm – Characterising Similarities and Differences between Databases. In Proceedings of the European Conference on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases (ECML PKDD), Porto,
Portugal. Springer.
Chen, C.; and Rudin, C. 2018. An optimization approach
to learning falling rule lists. In Proceedings of the International Conference on Artificial Intelligence and Statistics
(AISTATS), 604–612. PMLR.
Chen, E. Y.; Tan, C. M.; Kou, Y.; Duan, Q.; Wang, Z.;
Meirelles, G. V.; Clark, N. R.; and Ma’ayan, A. 2013. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics, 14: 128.
Choi, J.; Chuang, P. I.-J.; Wang, Z.; Venkataramani, S.;
Srinivasan, V.; and Gopalakrishnan, K. 2018. Bridging
the accuracy gap for 2-bit quantized neural networks (qnn).
arXiv preprint arXiv:1807.06964.
Corsetto, P. A.; Zava, S.; Rizzo, A. M.; and Colombo, I.
2023. The Critical Impact of Sphingolipid Metabolism in
Breast Cancer Progression and Drug Response. Int J Mol
Sci, 24(3).
Crone, S. A.; Zhao, Y. Y.; Fan, L.; Gu, Y.; Minamisawa, S.;
Liu, Y.; Peterson, K. L.; Chen, J.; Kahn, R.; Condorelli, G.;
Ross, J.; Chien, K. R.; and Lee, K. F. 2002. ErbB2 is essential in the prevention of dilated cardiomyopathy. Nat Med,
8(5): 459–465.
Dalleiger, S.; and Vreeken, J. 2022. Efficiently factorizing
boolean matrices using proximal gradient descent. Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 35: 4736–4748.
Dash, S.; Gunluk, O.; and Wei, D. 2018. Boolean decision rules via column generation. In Proceedings of the Annual Conference on Neural Information Processing Systems
(NeurIPS).
Degrave, J.; Felici, F.; Buchli, J.; Neunert, M.; Tracey, B.;
Carpanese, F.; Ewalds, T.; Hafner, R.; Abdolmaleki, A.;
de Las Casas, D.; Donner, C.; Fritz, L.; Galperti, C.; Huber, A.; Keeling, J.; Tsimpoukelli, M.; Kay, J.; Merle, A.;

Moret, J. M.; Noury, S.; Pesamosca, F.; Pfau, D.; Sauter, O.;
Sommariva, C.; Coda, S.; Duval, B.; Fasoli, A.; Kohli, P.;
Kavukcuoglu, K.; Hassabis, D.; and Riedmiller, M. 2022.
Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897): 414–419.
Dierckx, L.; Veroneze, R.; and Nijssen, S. 2023. RL-Net:
Interpretable Rule Learning with Neural Networks. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 95–107. Springer.
Dong, G.; and Li, J. 1999. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of
the ACM International Conference on Knowledge Discovery
and Data Mining (SIGKDD).
Fischer, J.; and Vreeken, J. 2020. Discovering Succinct Pattern Sets Expressing Co-Occurrence and Mutual Exclusivity. In Proceedings of the ACM International Conference
on Knowledge Discovery and Data Mining (SIGKDD), 813–
823.
Fischer, J.; and Vreeken, J. 2021. Differentiable pattern set
mining. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD),
383–392.
Garcı́a-Vico, A.; Carmona, C. J.; Martı́n, D.; Garcı́aBorroto, M.; and del Jesus, M. J. 2018. An overview of
emerging pattern mining in supervised descriptive rule discovery: taxonomy, empirical study, trends, and prospects.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(1): e1231.
Hedderich, M. A.; Fischer, J.; Klakow, D.; and Vreeken,
J. 2022. Label-descriptive patterns and their application
to characterizing classification errors. In Proceedings of
the International Conference on Machine Learning (ICML),
8691–8707. PMLR.
Hess, S.; and Morik, K. 2017. C-salt: Mining class-specific
alterations in boolean matrix factorization. In Proceedings
of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
(ECML PKDD), 547–563. Springer.
Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and
Bengio, Y. 2017. Quantized neural networks: Training neural networks with low precision weights and activations.
Journal of Machine Learning Research, 18(1): 6869–6898.
Hüllermeier, E.; Fürnkranz, J.; and Loza Mencia, E. 2020.
Conformal rule-based multi-label classification. In Proceedings of Advances in Artificial Intelligence (KI), 290–296.
Springer.
Huynh, V. Q. P.; Fürnkranz, J.; and Beck, F. 2023. Efficient
learning of large sets of locally optimal classification rules.
Machine Learning, 112(2): 571–610.
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.;
Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žı́dek,
A.; Potapenko, A.; Bridgland, A.; Meyer, C.; Kohl, S. A. A.;
Ballard, A. J.; Cowie, A.; Romera-Paredes, B.; Nikolov,
S.; Jain, R.; Adler, J.; Back, T.; Petersen, S.; Reiman, D.;
Clancy, E.; Zielinski, M.; Steinegger, M.; Pacholska, M.;
Berghammer, T.; Bodenstein, S.; Silver, D.; Vinyals, O.; Senior, A. W.; Kavukcuoglu, K.; Kohli, P.; and Hassabis, D.

2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873): 583–589.
Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; and
Morishima, K. 2017. KEGG: new perspectives on genomes,
pathways, diseases and drugs. Nucleic Acids Res, 45(D1):
D353–D361.
Klösgen, W. 1995. Explora: A Multipattern and Multistrategy Discovery Assistant. In Proceedings of the ACM International Conference on Knowledge Discovery and Data
Mining (SIGKDD).
Kusters, R.; Kim, Y.; Collery, M.; Marie, C. d. S.; and Gupta,
S. 2022. Differentiable Rule Induction with Learned Relational Features. In Proceedings of the International Workshop on Neural-Symbolic Learning and Reasoning.
Lakkaraju, H.; Bach, S. H.; and Leskovec, J. 2016. Interpretable decision sets: A joint framework for description and
prediction. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD),
1675–1684.
Lemmerich, F.; and Becker, M. 2018. pysubgroup: Easyto-use subgroup discovery in python. In Proceedings of the
European Conference on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases (ECML
PKDD), 658–662.
Lin, J.; Zhong, C.; Hu, D.; Rudin, C.; and Seltzer, M. 2022.
Generalized and Scalable Optimal Sparse Decision Trees.
arXiv:2006.08690.
Liu, Z.; Wu, B.; Luo, W.; Yang, X.; Liu, W.; and Cheng,
K.-T. 2018. Bi-real net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European
Conference on Computer Vision (ECCV), 722–737.
Llinares-López, F.; Sugiyama, M.; Papaxanthos, L.; and
Borgwardt, K. 2015. Fast and memory-efficient significant
pattern mining via permutation testing. In Proceedings of
the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 725–734.
Lukasiewicz, S.; Czeczelewski, M.; Forma, A.; Baj, J.;
Sitarz, R.; and Stanislawek, A. 2021. Breast CancerEpidemiology, Risk Factors, Classification, Prognostic
Markers, and Current Treatment Strategies-An Updated Review. Cancers (Basel), 13(17).
McTavish, H.; Zhong, C.; Achermann, R.; Karimalis, I.;
Chen, J.; Rudin, C.; and Seltzer, M. I. 2022. Fast Sparse Decision Tree Optimization via Reference Ensembles. In Proceedings of the AAAI Conference on Artificial Intelligence
(AAAI).
Patil, P.; and Rathod, P. 2020. Disease Symptom Prediction. https://www.kaggle.com/datasets/itachi9604/diseasesymptom-description-dataset.
Pellegrina, L.; Riondato, M.; and Vandin, F. 2019.
SPuManTE: Significant pattern mining with unconditional
testing. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD),
1528–1538.

Proença, H. M.; and van Leeuwen, M. 2020. Interpretable
multiclass classification by MDL-based rule lists. Information Sciences, 512: 1372–1393.
The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature, 526(7571): 68–
74.
Ulianova, S. 2017.
Cardiovascular Disease dataset.
https://www.kaggle.com/datasets/sulianova/cardiovasculardisease-dataset.
Van Leeuwen, M.; and Knobbe, A. 2012. Diverse subgroup
set discovery. Data Mining and Knowledge Discovery, 25:
208–242.
Wang, Z.; Zhang, W.; Liu, N.; and Wang, J. 2021. Scalable
rule-based representation learning for interpretable classification. In Proceedings of the Annual Conference on Neural
Information Processing Systems (NeurIPS), 30479–30491.
Wang, Z.; Zhang, W.; Ning, L.; and Wang, J. 2020. Transparent classification with multilayer logical perceptrons and
random binarization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 6331–6339.
Wilks, C.; Zheng, S. C.; Chen, F. Y.; Charles, R.; Solomon,
B.; Ling, J. P.; Imada, E. L.; Zhang, D.; Joseph, L.; Leek,
J. T.; Jaffe, A. E.; Nellore, A.; Collado-Torres, L.; Hansen,
K. D.; and Langmead, B. 2021. recount3: summaries and
queries for large-scale RNA-seq expression and splicing.
Genome Biol, 22(1): 323.
Yang, D.; Li, Y.; Xing, L.; Tan, Y.; Sun, J.; Zeng, B.; Xiang, T.; Tan, J.; Ren, G.; and Wang, Y. 2018. Utilization
of adipocyte-derived lipids and enhanced intracellular trafficking of fatty acids contribute to breast cancer progression.
Cell Commun Signal, 16(1): 32.
Zhao, B.; Xin, Z.; Ren, P.; and Wu, H. 2022. The Role of
PPARs in Breast Cancer. Cells, 12(1).

A
A.1

Appendix

and R LL, were run on machines with NVIDIA DGX A100
and AMD Rome 7742 CPUs. The others were run on Intel
Xeon(R) Gold 6244 machines with 256GB RAM. Individual
experiments were stopped after 24 hours or if they exceeded
256GB of RAM.

Clamping Function

The clamping function used in section 3.2 is given by

a if x < a,
clamp(x, a, b) = x if a ≤ x ≤ b,

b if b < x .
The W -shaped regualrizer is plotted in Fig. 3 for λ = κ = 1.
After each epoch, λ and κ are increased using an exponential
scheduler (Dalleiger and Vreeken 2022).
2

rb (x)

1.5
1
0.5
0
−1

−0.5

0

0.5
w

1

1.5

2

Figure 3: W -shaped regularizer. To push the weights towards a binary solution during optimization, which allows
a better quantization, we employ the W -shaped regularizer
discussed in the main paper (here, λ = κ = 1).

A.2

A.4

Synthetic Data

Formal Definition of Soft F1 Score To avoid overpenalization of methods that only recover sub-parts of the
individual patterns, we adopt the soft F1 score from Hedderich et al. (2022). Instead of using a strict equality for
computing recall and precision, we resort to using Jaccard
distance. Formally, we define the soft F1 score as

Method Details

Extracting Differential Patterns The differential patterns are extracted in two steps: First, we extract all patterns
encoded in the pattern layer. Next, we assign the extracted
patterns to the corresponding differential pattern sets P k .
Given the trained continuous weights of the autoencoder
W E ∈ [0, 1]h×m and weights of the classifier W C ∈
[0, 1]K×h , we start by extracting all patterns P encoded in
the pattern layer. For this, we binarize the weight of the autoencoder with a fixed threshold τe . The pattern pi encoded
in the i-th neuron is given by pi = {j | W E [i, j] > τe , j ∈
{1...m}}, for i ∈ {1, ..., h}. The overall pattern set is given
by P = {pi | pi ̸= ∅, i ∈ {1, ..., h}}. Next, we assign
the differential patterns to the classes, for which we use a
threshold τc to binarize the weights of the classifier. The differential patterns P k for class k ∈ {1, ..., K} are given by
P k = {pi | W C [k, i] > τc , pi ∈ P } .
Informally, a pattern pi is assigned to the differential pattern set Pk , if it is connected to the output for the classification of class k. In practice, the thresholds can be chosen
by grid search and choosing the pair (τte , τtc ), for which the
discretized network achieves the lowest reconstruction and
classification error.

A.3

Hyperparameter Optimization Hyperparameters are optimized as follows. For D IFF NAPS, we fine-tune reconstruction loss and classification accuracy on a hold-out set. For
C ART, we set the maximal depth to 20 to facilitate reasonable pattern sizes while not harming performance, and optimize Gini impurity. For SP U M AN T E, we mine the top 1
million patterns using a significance threshold α = 0.05, a
correction term γ = 0.01, and set the sampling rate to 25%
to keep it from running out of memory. For C LASSY, we
use a beam width of 200 and a maximum search depth of
20, which provides a good tradeoff between pattern-length
and computational burden. For R LL, we optimized the number of hidden layers, the number of neurons in the hidden
layer, the learning rate, and weight decay based on the performance on a hold-out set.

SoftPrec (Pd , Pg ) =

1 X
|pd ∩ pg |
,
max
pg ∈Pg |pd ∪ pg |
|Pd |
pd ∈Pd

SoftRec (Pd , Pg ) =

1 X
|pd ∩ pg |
max
,
pd ∈Pd |pd ∪ pg |
|Pg |
pg ∈Pg

SoftF1 (Pd , Pg ) =

2 ∗ SoftPrec ∗ SoftRec
,
SoftPrec + SoftRec

where we denote the sets of ground truth resp. discovered
patterns by Pg and Pd .
Additional Results Here, we report additional statistics
and results for the experiment on synthetic data. In Fig. 4a
and Fig. 4b, we report the precision and recall for the scalability in m (Sec. 4.1). D IFF NAPS performs equally well with
regard to precision and recall. In contrast, our competitors
achieve a significantly higher recall than precision. This is
especially prevalent for P REMISE and C ART, which explains
the overall low F1-score in Fig. 2b.
In Fig. 5, we report the F1-score for varying numbers
of random additive features a ∈ {0, 10, 20, ..., 90, 100}.
On average, D IFF NAPS outperforms all competitors, while
P REMISE has an overall smaller variance. SP U M AN T E suffers from large variance and degrades after 50 random features and runs out of memory a = 100.

Experimental Details

Hardware for Experiments We implemented D IFF NAPS
in PyTorch, and use the publicly available implementations
of other methods. Those that leverage a GPU, i.e. D IFF NAPS

A.5

Real data

Computing AUCs As we do not have the ground truth for
real world data, we resort to evaluating the area under the

Runtime
Dataset

Rows

Columns

k

D IFF NAPS (ours)

C ART

C LASSY

P REMISE

SP U M AN T E

Cardio
Disease
BRCA-N
BRCA-S
Genomes

68k
5k
222
187
2.5k

45
131
20k
20k
225k

2
41
2m
4m
6

1m33s
1m40s
18s
58s
8m59s

1s
1s
1s
1s
28s

15s
14s
33m40s
26m08s
8h20m

10s
8s
–
–
–

43s
48s
3h45m
2h31m
–

Table 2: Runtime real world data. We specify the time taken to compute the results. We abort experiments that ran out of
memory or took longer than 24h (–).
T REE

C LASSY

SP UMAN T E

P REMISE

RRL

D IFF NAPS

1

1

0.8

0.8
Soft-F1

SoftPrec

D IFF NAPS

0.6
0.4
0.2

T REE

C LASSY

SP UMAN T E

P REMISE

RRL

0.6
0.4
0.2

0
102

103

104

105

0

10 20 30 40 50 60 70 80 90 100
a

Number of features |m|

(a) Soft-Precision, better ↑

Figure 5: Results for additive noise We report the soft F1
score on synthetic data with different amounts of random
additive features a ∈ {0, 10, 20, ..., 90, 100}.

SoftRec

1
0.8
0.6
0.4

P is then given by

0.2
0
102

103

104

105

Number of features |m|

(b) Soft-Recall, better ↑

Figure 4: Additional results for Scalability. We provide the
soft precision score (a) and soft recall score (b) on the synthetic data with an increasing number of features.

cov(P, X) =

[

cov(p, X) .

p∈P

With Pqk , we denote the set of patterns such that such that
P(k | pk ) > p and is not spurious. Then for a threshold p,
the corresponding value on the y-axis yp is computed as
K

yp =

1 X | cov(Pqk , X k )|
.
K
|X k |
k=1

curve of what percentage of the data the patterns cover when
we order them by the probability of seeing a class given a
pattern. This can be roughly translated into sensitivity (how
much of the dataset do we cover) versus specificity (how
specific is the pattern for that class). To filter spurious patterns, we only consider patterns with a predictive probability
1
K +0.1 (i.e., at least slightly more likely than chance). More
k (pk )
formally on the x-axis we plot P(k | pk ) = supp
supp(pk ) . If a
pattern pk is very specific to a class k, then P(k | pk ) = 1
and if it is unspecific P(k | pk ) = 0
On the y-axis, we plot how much of the dataset is covered
(explained) given all the patterns that pass the threshold p
1
and have a predictive probability K
+ 0.1 . The coverage
cov(p, X) of a pattern p for a dataset X is defined as
cov(p, X) = {Xi | p ⊂ Xi } .
With a slight abuse of notation, the coverage of a pattern set

That is, per class k, we compute how much of the corresponding partition is covered and take the mean of those individual coverages.
Sensitivity-Specificity Curves Figure 6 shows the
sensitivity-specificity curves for the AUCs in (Table 1).
Processing of BRCA Data We obtained re-aligned bulk
RNA-seq data of TCGA BRCA samples through recount3 (Wilks et al. 2021). We first filtering samples into
primary tumor and adjacent normal tissue samples and keeping only protein-coding genes with non-zero expression in at
least one sample. We then remove duplicate samples of individuals, keeping the one with highest sequencing depth.
Gene expression counts were log-TPM transformed.
To binarize the expression data, for each gene, we set samples that have expression larger than the upper quartile to 1,
all others to 0. If the upper quartile is 0, we set all non-zero
samples to 1.
For the normal vs tumor data (BRCA-N), as many of the
competitors are sensitive to class-imbalance, we kept only

C ART

C LASSY

SP U M AN T E

P REMISE

1

1

0.8

0.8

0.6
0.4

0.6
0.4

0.6
0.4

Coverage

1
0.8

Coverage

1
0.8

Coverage

1
0.8

Coverage

Coverage

D IFF NAPS

0.6
0.4

0.2

0.2

0.2

0.2

0

0

0

0

0.6
0.4
0.2
0

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

Specificity

Specificity

Specificity

Specificity

Specificity

(c) BRCA-N

(d) BRCA-S

(e) Genomes

(a) Cardio

(b) Disease

Figure 6: Real world data. Here, we plot the specificity-coverage curves used to compute the AUCs reported in Tab. 1. The
curves are computed as explained in Sec. A.5.
the matching samples, i.e., where individuals where both adjacent normal as well as primary tumor tissue was available.
For the BRCA subtype data (BRCA-S), we followed the
same simple binarization scheme, and kept at most 50 samples per subtype to keep the data roughly balanced, sampling
at random without replacement. The four subtypes—luminal
A, luminal B, HER2+, and triple negative—where defined
based on annotated receptor status of the Estrogen recpetor
(ER), the Progesterone receptor (PR), and the human epidermal growth factor receptor 2 (HER2) available from the recount data. In particular, we define luminal A as (ER+, PR+,
HER2-), luminal B as (ER+, PR-, HER2-), HER2+ as obvious, and triple negative as (ER-, PR-, HER2-). We removed
all samples that do not belong to any of these subtypes or
where receptor status was not available.
Analysis of BRCA Patterns To analyze pattern sets discovered by D IFF NAPS qualitatively in terms of whether they
represent reasonable biological functions specific to a label,
we compute gene set over-representation statistics for gene
relationships annotated in the Kyoto Encyclopedia of Genes
and Genomes (KEGG). KEGG serves as a gold standard
for known biological pathways and relationships, including
hand-drawn and manually curated cellular pathways. A gene
set over-representation analysis tests whether an overlap of
a given gene set (e.g., a pattern or union of patterns) with an
annotated pathway is more likely than chance, where the null
overlap statistic is computed using a background gene set
(here: the set of genes in the dataset). We use the ENRICH R
software package for the gene set over-representation analysis and report results as significant with a p-value cutoff
of .05 (Chen et al. 2013). For an overall assessment we
consider pathways that are found for the union of genes
across all patterns for a class. We also enriched pathways
for each pattern individually, many of the patterns, however,
were too small to be considered for enrichment or contained
genes for which no annotation is available in KEGG. To
obtain an estimate of the average log-odds ratios of likelihood of a pattern
setP describing a class, we compute

P
P (p|l)
1
ln
p∈P
|P |
P (p|¬k) , where ln is the natural logarithm
and we do not compute those terms where P (p | ¬k) = 0,
which leaves us with a lower bound of the log-odds.