G UIDED R ECONSTRUCTION WITH C ONDITIONED D IFFUSION
M ODELS FOR U NSUPERVISED A NOMALY D ETECTION IN B RAIN
MRI S

arXiv:2312.04215v1 [eess.IV] 7 Dec 2023

Finn Behrendt
Hamburg University of Technology
Hamburg, Germany

Debayan Bhattacharya
Hamburg University of Technology
Hamburg, Germany

Lennart Maack
Hamburg University of Technology
Hamburg, Germany

Julia Krüger
Jung Diagnostics GmbH
Hamburg, Germany

Robin Mieling
Hamburg University of Technology
Hamburg, Germany
Roland Opfer
Jung Diagnostics GmbH
Hamburg, Germany

Alexander Schlaefer
Hamburg University of Technology
Hamburg, Germany

A BSTRACT
Unsupervised anomaly detection in Brain MRIs aims to identify abnormalities as outliers from a
healthy training distribution. Reconstruction-based approaches that use generative models to learn to
reconstruct healthy brain anatomy are commonly used for this task. Diffusion models are an emerging
class of deep generative models that show great potential regarding reconstruction fidelity. However,
they face challenges in preserving intensity characteristics in the reconstructed images, limiting their
performance in anomaly detection. To address this challenge, we propose to condition the denoising
mechanism of diffusion models with additional information about the image to reconstruct coming
from a latent representation of the noise-free input image. This conditioning enables high-fidelity
reconstruction of healthy brain structures while aligning local intensity characteristics of inputreconstruction pairs. We evaluate our method’s reconstruction quality, domain adaptation features
and finally segmentation performance on publicly available data sets with various pathologies. Using
our proposed conditioning mechanism we can reduce the false-positive predictions and enable a more
precise delineation of anomalies which significantly enhances the anomaly detection performance
compared to established state-of-the-art approaches to unsupervised anomaly detection in brain MRI.
Furthermore, our approach shows promise in domain adaptation across different MRI acquisitions
and simulated contrasts, a crucial property of general anomaly detection methods.
Keywords Unsupervised Anomaly Detection · Zero-Shot Segmentation · Brain MRI · Diffusion Models

1

Introduction

The interpretation of MRI scans is a critical task in medical imaging, providing valuable diagnostic information for
various neurological conditions Vernooij et al. (2007); Lundervold and Lundervold (2019). However, this process is
error-prone, time-consuming, and places a significant workload on available radiologists Bruno et al. (2015); McDonald
et al. (2015). To address these challenges and improve diagnostic efficiency, deep learning techniques like convolutional
neural networks (CNN) have shown great promise in assisting radiologists by automating certain aspects of the analysis
Lundervold and Lundervold (2019). A common task is the detection and delineation of Pathological structures in
the MRI scans such as tumors Perkuhn et al. (2018), White Matter lesions Moeskops et al. (2018) or Alzheimer’s
disease Islam and Zhang (2018). Supervised deep learning approaches have been proposed for these tasks, relying on

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

large-scale and balanced data sets for training, especially in MRI imaging, where there is considerable heterogeneity
across hospitals and scanners. However, gathering such data sets is a cumbersome and costly process. Therefore,
exploring alternative approaches that liberate the dependence on annotated data while being able to detect and locate
anomalies, holds great potential.
Unsupervised anomaly detection (UAD) in neuroimaging is an active research area with the potential to identify
abnormalities without relying on costly data annotation. In UAD methods the goal is to learn the underlying data
distribution of healthy brain MRI scans and detect anomalies as outliers from that learned distribution. Reconstructionbased UAD, in particular, trains generative models to reconstruct healthy anatomy, enabling the identification of
anomalies through discrepancies between (unhealthy) inputs and pseudo-healthy reconstructions at test time Baur
et al. (2021); Kascenas et al. (2022); Chen et al. (2020); Pinaya et al. (2022b). This is based on the assumption that
anomalous structures (e.g. tumors) in the input image are replaced by an estimation of healthy anatomical structures
similar to the training distribution. This approach addresses limitations of supervised learning, such as the requirement
for large-scale and balanced annotated data sets Johnson and Khoshgoftaar (2019); Karimi et al. (2020); Ellis et al.
(2022), which is particularly crucial in neuroimaging where pathologies can exhibit complex and variable morphological
characteristics. Additionally, the ability of UAD methods to perform zero-shot detection and segmentation of previously
unseen pathologies makes them highly practical in a wide range of clinical scenarios.
Recent advancements have shown promise in utilizing denoising diffusion probabilistic models (DDPMs) Ho et al.
(2020) for UAD in neuroimaging Pinaya et al. (2022a); Wyatt et al. (2022); Graham et al. (2022). DDPMs generate
images by denoising images that are corrupted by artificial noise, leveraging a high-dimensional latent space to preserve
spatial context and achieve high-fidelity reconstructions. However, a significant challenge remains in accurately
reconstructing healthy brain anatomy that exhibits anatomical coherence (specific brain structures such as ventricles
should appear at the same location in both, input and reconstruction) and aligned intensity characteristics with the input
image Behrendt et al. (2023). The forward and backward processes of DDPMs do not adequately capture the highly
variable local intensity distributions of MRI scans, leading to false positives in the residual map and impaired detection
performance, particularly when facing domain shifts at test time.
To address this challenge, we propose context-conditioned DDPMs (cDDPMs) for UAD in brain MRI. Our approach
involves training a DDPM to reconstruct healthy brain anatomy and incorporating a latent feature representation of
the noise-free input image into the denoising process. Thereby, we utilize a CNN-based image encoder to obtain the
feature representation. This representation is then used to linearly transform the feature maps of the denoising Unet
to incorporate the information in the denoising process. While the dense feature representation is not suitable for
high-fidelity reconstruction, it captures local intensity information of the input image that is partially lost during the
forward process of DDPMs. Hence, our approach is designed to align the intensities of input and reconstruction, which
is considered an important property for reconstruction-based UAD.
To gauge the impact of our conditioning approach on the quality of reconstruction, we analyze the intensity-based,
structural, and perceptual similarity between input and reconstructions using healthy brain MRIs. Furthermore,
we explore the domain adaptation capabilities of our approach by evaluating the histogram alignment of input and
reconstruction for out-of-domain data sets, unseen during training, and additionally simulate different contrast levels.
Finally, we analyze the unsupervised anomaly segmentation performance on diverse data sets, encompassing various
pathologies that are unseen during training.
Our approach demonstrates superior or competitive UAD performance compared to recent state-of-the-art architectures
on all tested data sets. It effectively addresses the domain shift inherent in different MRI data sets, showcasing its
potential for identifying pathologies even in the absence of large-scale annotated data sets.
In summary, the main contributions of this work are:
• We propose conditioned DDPMs to incorporate additional information of the noise-free input image during
the denoising process of DDPMs.
• We systematically analyze the effect our conditioning approach has on the reconstructed images and domain
adaptation capabilities and thereby show its promising features for the UAD task.
• We demonstrate the effectiveness of our conditioning approach for the UAD task by outperforming state-ofthe-art solutions and show its robustness to distribution shifts on various data sets.
This paper is organized as follows: In Section 2, we provide a review of relevant literature in the field of UAD in brain
MRI. In Section 3, we introduce DDPMs and subsequently explain our conditioning approach. In Section 4, we provide
details of the experimental setup. In Section 5, we present the results and subsequently discuss them in Section 6.
Finally, we provide a conclusion in Section 7.
2

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

Figure 1: Overview of our proposed approach. The encoder representations are learned along with the DDPM in the
main training stage to condition the denoising process. The feature maps of the denoising Unet are scaled and shifted
based on the projected encoder representation of the input image in each residual block. Each residual block consists
of two convolution operators (conv1, conv2), group normalization (Norm) and Sigmoid Linear Units (SiLU). During
evaluation, the residual map between unhealthy brain images and their healthy reconstructions is used for anomaly
detection.

2

Recent Work

Autoencoders (AE) have been the primary focus of recent research on reconstruction-based UAD in brain MRI. Although
these models exhibit potential in capturing the underlying healthy distribution, their effectiveness in UAD is limited by
their blurry reconstructions Baur et al. (2021). To overcome this limitation, researchers have focused on improving the
representations and reconstructions by adding skip connections with dropout Baur et al. (2020a), using multi-scale
features Baur et al. (2020b), or utilizing feature activation maps Silva-Rodríguez et al. (2022). Furthermore, online
outlier removal has been proposed for AEs Behrendt et al. (2022b). In parallel, Variational Autoencoders (VAE) have
been investigated for the UAD task Zimmerer et al. (2019a), focusing on enhancing the used context in 2D Zimmerer
et al. (2019b) and 3D Bengs et al. (2021); Behrendt et al. (2022a) or utilizing restoration methods Chen et al. (2020).
Also, Generative Adversarial Networks (GAN) have been proposed for UAD either as pure GAN Han et al. (2021);
Schlegl et al. (2019) or in combination with VAEs Baur et al. (2018) and VQ-VAEs Pinaya et al. (2022b).
While AEs with skip connections and a spatial latent space enable reconstructions of high fidelity, they tend to perform
3

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

a ’copy task’ which enables the reconstruction of unhealthy anatomy and therefore contradicts the UAD principle
Baur et al. (2021); Bercea et al. (2023b). Lately, Kascenas et al. (2022) have shown that AEs with skip connections
can be effectively used for UAD in brain MRI if they are regularized by an additional denoising task. Congruently,
DDPMs have shown promise in the field of UAD Wyatt et al. (2022); Pinaya et al. (2022a); Graham et al. (2022).
DDPMs provide high reconstruction fidelity, but due to the noising process, important information about the input
image can be lost. To address this, Behrendt et al. (2023) proposed patch-based DDPMs, that allow the use of parts
of the original image content to provide information for the reconstruction of the input image. However, using this
patching strategy increases complexity and computational effort and can lead to artifacts in regions of overlapping
patches. A more efficient approach is seen in conditioning the denoising process of DDPMs with knowledge of the input
image. Conditioned DDPMs have been successful in text-to-image synthesis tasks Rombach et al. (2022); Dhariwal
and Nichol (2021) and image-guided synthetic image generation Saharia et al. (2022); Wang et al. (2022); Wolleb
et al. (2022). However, in the specific case of UAD, the objective is not to generate new images or to transfer styles,
but to accurately estimate a given input image while ensuring that unhealthy anatomy is absent in the estimation.
Thus, directly conditioning DDPMs with information from the input image can pose a risk of reconstructing unhealthy
anatomy. Therefore, to achieve the UAD task, we develop a conditioning approach for DDPMs that can effectively
provide the denoising process with relevant context information of the individual input image without enabling the
reconstruction of unhealthy anatomy.

3

Methods

We propose cDDPMs where we use an image encoder network and embed the input image in a context vector c ∈ Rd
to condition the denoising Unet on meaningful features of the input image. Our motivation is that the additional
information in c guides the generation process towards consistent intensity characteristics across the input image
and its reconstruction. Hence, by introducing the context vector c we aim to recover local intensity information that
is lost during the forward (noising) process of DDPMs. We utilize an image encoder with a dense latent space to
extract information regarding the coarse shape and local intensity information of the noise-free input image. This latent
representation can then be used to condition the denoising process and supplement the individual context of the input
image without providing detailed pixel-wise information that could be used to perform a copy task. A general depiction
of our approach is shown in Fig. 1.
3.1

DDPMs

DDPMs are generative models that learn the underlying data distribution of images x ∈ RH,W,C with height H, width
W and C channels, given a training set. Training of DDPMs consists of two steps. The forward process, where an input
image x0 is gradually transformed to Gaussian noise xT = ϵ ∼ N (0, I) and the backward process, where reversing the
forward process is learned.
In the forward process, transforming x0 to xT follows a predefined schedule β1 , ..., βT , where intermediate versions
xt are derived as
√
xt ∼ q(xt |x0 ) = N ( ᾱt x0 , (1 − ᾱt )I),
Yt
with ᾱt =
(1 − βt ).
s=0

The time step t controls the amount of added noise and is sampled from t ∼ U nif orm(1, ..., T ). For edge cases, the
image xt is transformed to pure noise (t = T ) or no transformation is applied (t = 0). In the backward process, the
reconstructed image xrec
0 is recovered from xt by
YT
xrec
∼ p(xT )
pθ (xt−1 |xt ),
0
t=1

with pθ (xt−1 |xt ) = N (µθ (xt , t), Σθ (xt , t)).
Here, following Ho et al. (2020), µθ is estimated by a Unet Ronneberger et al. (2015) with trainable parameters θ,
t−1
and Σθ (t) = Σ(t) = 1−α
1−αt βt I is fixed. Variational inference is used to achieve a tractable loss function and the
variational lower bound (VLB) is derived as
LV LB = −log(pθ (x0 ))
+DKL (q(x1:T |x0 )||pθ (x1:T |x0 )).
which can be reformulated to
Lsimple = ||ϵ − ϵθ (xt , t)||2
4

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

by applying simplifications and by conditioning the denoising step on x0 , as shown in Ho et al. (2020). In our work,
instead of predicting the noise ϵ we perform the equivalent task of directly estimating xrec
= xt − ϵ. Hence, we derive
0
our loss function as
Lrec = |x0 − xrec
0 |.
To generate new images with DDPMs, typically, the backward step is applied in a step-wise fashion, to gradually
denoise a random noise vector. For the given UAD task, we do not aim for the generation of new images but to estimate
healthy brain anatomy given an input image. Therefore, we directly estimate xrec
0 given xt at test time as it is done in
Behrendt et al. (2023). The time step ttest < T , controls the level of noise to remove from xt at test time. Optionally,
to become agnostic to the noise magnitude, we use an ensemble of different values ttest = [250, 500, 750] and average
the reconstructions of each noise level, similar to Graham et al. (2022).
3.2

context-conditioned DDPMs

A general depiction of our conditioning approach is provided in Fig. 1. Formally, we condition the backward process of
DDPMs on a context vector c as follows
YT
xrec
∼ p(xT )
pθ (xt−1 |xt , c),
0
t=1

with pθ (xt−1 |xt , c) = N (µθ (xt , t, c), Σ(t)).
We use an image encoder Fenc to achieve a latent representation c = Fenc (x0 ) of the input image x0 where c ∈ Rd
with d as conditioning dimension.
To integrate the context vector c, we manipulate the denoising Unet of the DDPM. Therefore, we individually adapt the
features fi ∈ RHi ,Wi ,Ci at each level of the denoising Unet based on c where Hi , Wi and Ci are the respective feature
map dimensions. To achieve this, we adapt the time step conditioning of DDPMs as follows. First, the time step is
encoded using a sinusoidal position embedding. Next, t is projected to a vector ct ∈ Rd , by a multi-layer perceptron
(MLP I). Subsequently, we concatenate the context vector c and the time step vector ct , resulting in a conditioning
vector c′ ∈ R2·d . Finally, c′ is projected to c′proj ∈ R2·Ci by another multi-layer perceptron (MLP II) at each feature
level i. The vector c′proj is then split into half, where the first and last Ci elements resemble the scaling factor γ and the
shift value β. Inspired by Perez et al. (2018) the variables γ and β are then used to transform the individual feature
maps fi′ = fi ∗ (γ + 1) + β in each residual block.
By this, can learn both, the feature extraction c = Fenc (x0 ) and the individual feature adjustments of the denoising
Unet, based on the extracted context vector c in an end-to-end fashion. Optionally, to achieve a meaningful starting
point for the calculation of the context vector c = Fenc (x0 ), we pre-train the feature extraction of the image encoder
Fenc which is described in the next section.
3.3

Pre-Training

We utilize a generative pre-training strategy for Fenc . More precisely, we utilize masked pre-training where typically
transformer-based AEs are trained to reconstruct an image where a large fraction of patches are masked out He et al.
(2022). We utilize the SparK framework Tian et al. (2023), where sparse convolutions and hierarchical features are
used to enable the masked pre-training for CNNs. During pre-training, we utilize the same healthy training set as for
the main training task to learn the general feature representations that are required to capture important information of
the MRI scans. After the pre-training stage, we discard the decoder and only use Fenc and fine-tune it along with the
denoising Unet during the main training stage of the cDDPM.

4

Experimental Setup

4.1

Data Sets

Following the principle of UAD, we train our models for the reconstruction task on healthy data only (IXI). At test time,
we evaluate the models’ anomaly detection ability on unhealthy test sets of various pathologies (BraTS21, ATLAS,
MSLUB, WMH).
4.1.1

Training Data

We use the publicly available IXI data set1 as our healthy reference data set for training. This data set includes 560
3D brain MRI scans, collected from three different medical facilities. Of the training data, 158 samples are set aside
1

https://brain-development.org/ixi-data set/

5

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

for testing, while the remaining data is divided into 5 folds, each containing 358 training samples and 44 validation
samples for cross-validation.

4.1.2

Evaluation Data

For evaluation, we utilize four different publicly available data sets that contain different types of pathologies and the
corresponding manual expert annotations:
1. Multimodal Brain Tumor Segmentation Challenge 2021 (BraTS21) Baid et al. (2021); Menze et al. (2014);
Bakas et al. (2017)
2. multiple sclerosis data set from the University Hospital of Ljubljana (MSLUB) Lesjak et al. (2018)
3. Anatomical Tracings of Lesions After Stroke v2.0 (ATLAS) Liew et al. (2022)
4. White Matter Hyperintensity (WMH) data set Kuijf et al. (2019)
The BraTS21 data set includes 2040 3D brain routine MRI scans of patients with glioma with a pathologically confirmed
diagnosis. Accompanying the MRI scans, annotations from expert neuroradiologists are provided for 1251 scans that
delineate tumor sub-regions as categorical masks. In this work, fuse all sub-regions to obtain a binary segmentation
mask to evaluate the anomaly detection task. All scans are available as T1-weighted volumes with and without contrast
enhancement (T1-CE, T1) and T2-weighted or T2 fluid attenuated inversion recovery (T2, FLAIR) volumes. The 1251
annotated samples are split into an unhealthy validation set of 100 samples and an unhealthy test set of 1151 samples.
The MSLUB data set includes 3D brain MRI scans of 30 patients with multiple sclerosis (MS) lesions. For each patient,
along with the T1, T2 and FLAIR MRI scans, ground truth annotations are available derived based on multi-rater
consensus. The data is split into an unhealthy validation set of 10 samples and an unhealthy test set of 20 samples.
The ATLAS data set consists of 655 T1-weighted MRI scans of stroke patients, collected from 44 research cohorts. The
stroke lesions are annotated by domain experts and binary segmentation masks are provided. The data is split into an
unhealthy validation set of 175 samples and an unhealthy test set of 480 samples.
The WMH data set consists of 60 MRI scans of patients with white matter hyperintensities from three different
institutions and scanner types. WMH segmentation masks are derived from the consensus of two expert radiologists.
The data set is split into an unhealthy validation set of 15 samples and an unhealthy test set of 45 samples.
Across the data sets, different weightings are available. For the BraTS21 and MSLUB data set multiple weightings are
available (T1, T2, FLAIR), for the ATLAS data set, only T1-images and for the WMH data set T1 and FLAIR images
exist. As our training data set contains T1 and T2 images of each patient, we train our models on both weightings
separately and evaluate the BraTS21 and MSLUB data set on T2 images while for ATLAS and WMH, T1 images are
used.
An overview of the data set sizes is provided in Table 1.
Table 1: Data set Information regarding the number of samples per data split. The IXI data set contains only healthy
brain MRI scans and is considered as training set and to test the overall reconstruction quality of healthy brain anatomy.
The remaining data sets are used to evaluate the domain adaptation and segmentation performance.
Healthy
Training Samples

Healthy
Validation Samples

Healthy
Test Samples

IXI

358

44
Unhealthy
Validation Samples

158
Unhealthy
Test Samples

BraTS21
MSLUB
ATLAS
WMH

-

100
10
175
15

1151
20
480
45

Data set

4.2

Pre-Processing

We pre-process the images according to established pre-processing strategies for UAD in brain MRI Baur et al. (2021).
First, we resample all MRI scans to the isotropic resolution of 1 mm × 1 mm × 1 mm using cubic spline interpolation.
Second, we register all MRI scans to the SRI24-Atlas. Third, we remove the skull from the MRI scans by skull stripping
with HD-BET Isensee et al. (2019). Subsequently, we cut black borders and perform N4 bias field correction. Finally,
we pad the images to a unified size of 192 × 192 × 160. To save computational resources, we reduce volume resolution
6

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

by a factor of two and remove the 15 top and bottom slices parallel to the transverse plane leading to a final resolution
of 96 × 96 × 50 voxels.
4.3

Post-Processing

At test time, we derive a binary segmentation map from the residual map R = |x0 − xrec
0 | where regions of high
residuals indicate anomalies. To binarize R, we first apply the following post-processing steps that are commonly used
in the field of UAD in brain MRI Baur et al. (2021); Kascenas et al. (2022); Behrendt et al. (2023); Zimmerer et al.
(2019b). First, a median filter with a kernel size of K = 5 × 5 × 5 is applied to smooth the residual map and to remove
smaller false positives. Subsequently, we perform brain mask eroding for 3 iterations. This step is mainly applied to
filter out residuals that occur due to poor reconstructions at sharp edges near the brain mask Baur et al. (2021). We then
perform a greedy threshold search. Hereby, the test threshold is determined by searching for the best Dice score across
different thresholds on the unhealthy validation set, as proposed in Zimmerer et al. (2019b). After binarization, we
use connected component filtering to remove areas that include less than 7 voxels. We note that this post-processing
step aims to remove false-positive predictions, much smaller than the anomalies considered in this study. We provide a
systematic comparison of the post-processing strategies for each data set in the supplemental material.
4.4

Implementation Details

In our study, we compare our proposed method, called cDDPM, with multiple established baselines for UAD in brain
MRI. The baselines include AE, VAE Baur et al. (2021), its sequential extension SVAE Behrendt et al. (2022a), and
denoising AEs DAE Kascenas et al. (2022). We also compare with simple thresholding Thresh Meissen et al. (2022)
and the GAN-based AnoGAN Schlegl et al. (2019). For a direct comparison, we also include DDPM Wyatt et al. (2022)
and pDDPM Behrendt et al. (2023) as a counterpart to our proposed method.
We adapt the baseline implementations by tuning hyper-parameters based on the unhealthy validation set if required to
improve training stability and performance. We set βV AE to 0.001 for VAE and SVAE. For f-AnoGAN, we set the latent
size to 128 and the learning rate to 1e − 4.
For DDPM, pDDPM and cDDPM, the following adaptions are applied. We use structured simplex noise, as it has
shown to strongly improve the UAD performance on MRI images Wyatt et al. (2022). Furthermore, we uniformly
sample t ∈ [1, T ] with T = 1000 and either use a fixed value of ttest = T2 = 500, or an ensemble of different values
ttest = [250, 500, 750] and average the individual reconstructions of each noise level at test time. The denoising
network for all DDPM-based approaches is an Unet similar to Dhariwal and Nichol (2021), with channel dimensions of
[128, 256, 256]. As encoder network Fenc , we utilize a ResNet-backbone with a fully connected layer to match the
target dimension of c ∈ Rd with d = 128 as conditioning dimension. During pre-training, a mask-out ratio of 65 % is
used. For data augmentation, we utilize random -blur (p=0.25), -bias (p=0.25), -gamma (p=0.5) and -ghosting (p=0.5)
from the torchio library Pérez-García et al. (2021). If not specified otherwise, all models are trained for a maximum of
1600 epochs on NVIDIA V100 (32GB) GPUs, using Adam as optimizer, a learning rate of 1e − 5, and a batch size of
32. The best model checkpoint, as determined by performance on the healthy validation set, is used for testing. The
volumes are processed in a slice-wise manner, uniformly sampling slices with replacement during training and iterating
over all slices to reconstruct the full volume at test time. We implement all models in Pytorch (v0.10) 2 .
4.5

Experiments and Evaluation

To assess the reconstruction quality, the domain adaptation and the final UAD performance, we conduct various
experiments on different data sets, specified in the following.
4.5.1

Reconstruction Quality:

To evaluate the overall reconstruction quality, we utilize the held-out test set of the healthy IXI data set and calculate
similarity metrics between input and reconstruction. We consider the Structural Similarity Index Measure (SSIM) Wang
et al. (2004), the Peak Signal To Noise Ratio (PSNR) and the Learned Perceptual Image Patch Similarity (LPIPS) as
metrics to asses the reconstruction quality. For the feature-based LPIPs metric Zhang et al. (2018), features are extracted
by a resnet-based network, pre-trained on 3D medical data Chen et al. (2019). Furthermore, we report the overall
l1 − error for the healthy data set. As for UAD, only healthy anatomy should be estimated with high reconstruction
quality, it is of interest to consider the l1 error of healthy and unhealthy anatomy separately, given the unhealthy
evaluation data sets. Therefore, we calculate the l1 − error for both healthy and unhealthy anatomy, as indicated by
2

Code available at https://github.com/FinnBehrendt/Conditioned-Diffusion-Models-UAD

7

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

the annotation masks and calculate an l1 − ratio as follows:
l1 − ratio =

l1unhealthy
.
l1healthy

A higher value for the l1 − ratio indicates that the model successfully reconstructs the healthy anatomy while struggling
to reconstruct the unhealthy parts of the input, and vice versa. This ratio serves as a metric to assess the model’s
performance in capturing the distinction between healthy and unhealthy anatomical structures.
4.5.2

Domain Adaptation:

To investigate the domain adaptation ability of our proposed approach, we utilize both, healthy, in-domain data from the
IXI data set and unhealthy out-of-domain data from the BraTS21 data set. To evaluate the effect of the conditioning
mechanism, we utilize the IXI data set and simulate different levels of conditioning information and simulated domain
shift to investigate the reconstructions qualitatively. Thereby, we alter the available information of the image that is fed
to the image encoder to condition the cDDPM. To achieve this, we crop the conditioning image at a given width of 50%
and 100% where 100% indicates that the full input image is used as the conditioning image. Furthermore, we simulate
different contrast levels ranging from cl ∈ [0.3, 0.7, 1.0, 1.5, 2.0]. The images of different contrast levels are derived by
potentiating the gray values by the respective contrast level i.e. xcl=2
= x20 . In addition to the qualitative evaluation of
0
reconstructed, simulated data, we quantitatively assess the domain shift across input and reconstruction by comparing
their intensity histograms. Therefore, we first calculate and plot the histograms. Thereby, we partition the intensity
values into 500 bins and divide the raw count by the total number of counts and the bin width. For a quantitative analysis
of the distance between the intensity distributions, we calculate the Kullback-Leibler Divergence (KLD) as follows:
"
#
X
KLD = −
pinput log(pinput ) −
i

"

#
−

X

preconstruction log(preconstruction )

i

where p = [p1 , p2 , . . . , pn ] represents each intensity distribution.
4.5.3

Segmentation Performance:

To assess the segmentation performance for the UAD task, we utilize all unhealthy test sets. We report the average Dice
score across all predicted anomaly maps (DICE). The formula for the DICE score is given by:
DICE =

2 · |A ∩ B|
|A| + |B|

where A and B are the predicted anomaly map and the ground truth annotation, respectively.
To obtain a metric that is independent of the chosen threshold, we additionally calculate the Area Under Precision-Recall
Curve (AUPRC) as follows:
X
AUPRC =
(R(r) − R(r − 1)) · P (r).
r

Here, R(r) represents the recall at a given threshold or rank r, and P (r) represents the precision at the corresponding
recall R(r). The sum is taken over all thresholds or ranks r at which the precision and recall are computed.

4.5.4

Statistical Testing:

To conduct significance tests, we utilize the MLXtend library’s permutation test Raschka (2018) with 10,000 rounds of
permutations and a significance level of α = 5%. This test computes the mean difference of the considered scores of
two models for each permutation, and the resulting p-value is computed by counting the number of times the mean
differences were equal to or greater than the sample differences, divided by the total number of permutations.

5

Results

In this section, we first compare the overall reconstruction quality. Second, we evaluate the domain adaptation properties
of our approach and lastly, we report the Segmentation performance of all models.
8

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

Table 2: Comparison of the reconstruction quality of the different models with the best results highlighted in bold.
The asterisk * denotes superior performance with statistical significance compared to all baselines (p < 0.05). For
all metrics, the mean ± standard deviation across the different folds are reported. The arrows ↑ and ↓ indicate that
higher and lower values are favorable, respectively. The l1 − ratio is derived by dividing the l1 − error of unhealthy
anatomy by the l1 − error of healthy anatomy. DDPM-based models are evaluated by ensembling different values for
ttest = [250, 500, 750]
BraTS21 (T2)

MSLUB (T2)

ATLAS (T1)

WMH (T1)

Model

SSIM ↑

PSNR ↑

IXI (T2)
LPIPS (e-3) ↓

l1 − error (e-3) ↓

l1 − ratio ↑

l1 − ratio ↑

l1 − ratio ↑

l1 − ratio ↑

VAE
SVAE
AE
DAE

74.98 ± 0.54
77.87 ± 0.15
76.11 ± 0.27
98.69 ± 0.15*

23.38 ± 0.14
23.94 ± 0.06
23.41 ± 0.14
36.69 ± 0.38*

4.03 ± 0.50
3.31 ± 0.24
3.19 ± 0.54
0.14 ± 0.01

32.32 ± 0.64
29.08 ± 0.16
31.67 ± 0.41
8.14 ± 0.17*

3.52 ± 0.08
3.90 ± 0.05
3.84 ± 0.17
7.17 ± 0.63

2.92 ± 0.06
3.13 ± 0.05
3.26 ± 0.18
2.69 ± 0.15

4.43 ± 0.03
3.38 ± 0.11
4.40 ± 0.07
4.51 ± 0.15

2.36 ± 0.04
2.07 ± 0.01
2.36 ± 0.04
2.99 ± 0.14

DDPM
pDDPM
cDDPM (Ours)

93.96 ± 0.37
96.62 ± 0.25
96.80 ± 0.19

31.79 ± 0.26
34.58 ± 0.39
34.87 ± 0.23

0.49 ± 0.14
0.09 ± 0.04*
0.11 ± 0.05

14.29 ± 0.32
9.70 ± 0.43
9.68 ± 0.16

6.16 ± 0.53
7.16 ± 0.15
7.43 ± 0.17

3.37 ± 0.24
4.34 ± 0.13
4.49 ± 0.18

5.00 ± 0.23
5.58 ± 0.28
5.69 ± 0.27

3.16 ± 0.15
3.00 ± 0.16
3.12 ± 0.08

Figure 2: Simulating the conditioning effect of the cDDPM for 50 % of noise and 100% noise in the input image. In
the first block the input image, that is fed to the DDPM or cDDPM is shown. In the second block, the reconstructions
of cDDPM for different conditioning inputs are shown when a noise level of 50% is applied. In the third block, the
reconstructions of cDDPMs and DDPMs are compared at a noise level of 100%. From top to bottom, the contrast level
of the conditioning and input image is increased, respectively for all columns.

5.1

Reconstruction Quality

In Table 2 we compare baseline models regarding their ability to reconstruct the healthy anatomy, given the held-out
test set of the IXI data set. Overall, DAEs, pDDPMs and cDDPMs show high performance regarding the image-based
similarity metrics. In contrast, for the dense autoencoder-based baselines lower reconstruction quality is reported.
Comparing the DDPM-based approaches, both, pDDPM and cDDPM outperform the baseline DDPM in terms of
9

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

Figure 3: Comparison of the histograms for input-reconstruction pairs of the healthy IXI (left) and the unhealthy
BraTS21 (right) data set with original and augmented contrast. The top row shows the baseline DDPM without
conditioning and the bottom row our proposed cDDPM with conditioning. The Kullback-Leibler divergence (KLD) for
both histograms is indicated within each plot, respectively (lower values are preferable). Both models are evaluated by
ensembling different values for ttest = [250, 500, 750].
Table 3: Comparison of the evaluated models with the best results highlighted in bold. The asterisk * denotes superior
performance with statistical significance compared to all baselines (p < 0.05). For all metrics, the mean ± standard
deviation across the different folds are reported. A checkmark at SSL denotes that the encoder is pre-trained by selfsupervision. A checkmark at ENS denotes the ensembling of different values for ttest = [250, 500, 750]. Otherwise, a
fixed value of ttest = 500 is used for DDPM-based models.
Modification
Model

ENS

BraTS21 (T2)

SSL

cDDPM (Ours)
cDDPM (Ours)
cDDPM (Ours)

AUPRC [%]

DICE [%]

AUPRC [%]

DICE [%]

AUPRC [%]

19.69
23.99±6.15
28.81±1.26
31.93±0.42
31.51±1.94
45.37±4.40

20.27
21.08±6.23
25.72±1.54
30.30±0.52
28.80±1.92
49.38±4.68

6.21
4.86±2.02
6.16±0.58
6.04±0.26
7.23±0.90
3.88±1.35

4.23
3.77±1.32
4.46±0.20
4.81±0.12
5.71±0.90
4.47±0.78

4.41
9.91±1.80
15.08±0.28
10.49±0.67
14.91±0.33
8.53±0.28

1.71
9.17±1.41
15.27±0.30
10.06±0.55
14.76±0.46
12.45±0.94

9.38
6.25±1.00
5.23±0.82
6.64±0.01
4.53±0.36
7.31±1.02

4.72
3.45±0.33
4.16±0.13
3.11±0.04
4.16±0.19
6.32±0.88

44.25±1.49
44.50±2.20
49.36±0.66
49.78±0.85

49.98±2.41
50.73±3.09
54.70±0.53
55.10±0.57

4.80±1.98
6.46±2.05
9.40±1.23
9.21±1.35

6.36±1.84
6.31±1.40
9.88±0.59
10.11±0.45

12.90±0.89
14.67±0.86
12.95±0.45
13.24±0.93

15.67±0.76
17.56±1.12
17.37±0.39
17.76±1.17

10.03±1.06
9.63±1.06
8.03±0.62
7.97±0.95

8.86±0.95
8.12±1.17
7.78±0.51
7.65±0.91

51.34±1.68
52.35±0.95
53.37±1.80*

56.84±2.21
58.14±1.47
58.84±1.76*

10.71±1.52
11.09±0.87
11.51±1.24

10.13±1.19
10.85±1.02
11.13±1.26

19.06±1.27
18.95±1.94
19.99±1.55*

20.98±1.14
20.86±1.53
22.21±1.47*

9.94±0.55
9.92±1.26
9.88±1.22

9.28±0.42
9.23±1.08
9.33±1.07

✓

✓

WMH (T1)

DICE [%]

✓

✓
✓

ATLAS (T1)

AUPRC [%]

Thresh
AnoGAN
VAE
SVAE
AE
DAE
DDPM
DDPM
pDDPM
pDDPM

MSLUB (T2)

DICE [%]

reconstruction quality with statistical significance (p < 0.05). We additionally provide an analysis of the unhealthy-tohealthy error ratio based on the unhealthy test sets. Notably, here the l1 − ratio is highest for cDDPM across almost all
data sets, except for the WMH, where DDPM shows competitive performance.
5.2

Conditioning Effect

To evaluate the effect the additional conditioning input has on individual reconstructions, we simulate different
conditioning inputs, varying in the amount of used image information. Furthermore, we apply artificial contrast levels
for the input images to mimic strong domain shifts. For each conditioning input and contrast level, we provide the
reconstructions, generated by cDDPMs in Fig. 2 for a noise level of ttest = 500 (50 %). Moreover, we compare the
reconstruction of DDPMs and cDDPMs in the extreme case of pure noise as input (ttest = T = 1000 (100%)). For
a noise level of 50%, we observe that while the overall shape of the reconstruction is preserved across the different
conditioning masks, only at regions that are covered in the conditioning image, local intensity information is captured
in the reconstruction. At a noise level of 100%, we observe that the reconstructions of cDDPMs coarsely follow the
10

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

Figure 4: Exemplary reconstruction and anomaly map taken from the BraTS21 data set. From top to bottom, cDDPM,
pDDPM, DDPM and DAE are compared.
content provided by the conditioning image, leading to a blurry reconstruction of the input image. In contrast, for
unconditioned DDPMs, only a very generic reconstruction can be obtained that shares low similarity with the given
input image.
5.3

Domain Adaptation

To evaluate the domain adaptation to different data sets. In our experiments, we consider the healthy IXI data set as
in-domain data set and the unhealthy BraTS21 data set as out-of-domain data set. Note that for the BraTS21 data set,
we only consider regions that have been annotated as healthy. Thereby, we ensure to evaluate domain shifts regarding
scanner and brain diversity and not domain shifts that arise from unhealthy structures in brain MRIs. In Fig. 3, we
provide the histograms of input and reconstructions of the healthy IXI data set (left) and the unhealthy BraTS21 data
set (right). We observe that DDPMs show substantial discrepancies across the intensity distributions of input and
reconstruction. Particularly for simulated contrast levels, the histograms deviate. In contrast, the intensity distribution
of images reconstructed by cDDPMs exhibits higher similarities with the input intensity distribution for both in-domain
and out-of-domain data. Considering the quantitative KLD measurement, the KLD of DDPMs is increased by a factor
of 2.3, 4.0 and 17.0 for the original contrast, a contrast factor of 0.5 and 2.0, respectively compared to cDDPMs for the
IXI data set.
5.4

Segmentation Performance

Overall, improved performance is reported for cDDPMs compared to all baselines across all data sets, except for the
WMH data set, where the performance is on par with DDPMs. While the improvements for the cDDPM are statistically
significant for the BraTS21 and ATLAS data sets (p < 0.05), for the MSLUB and WMH data sets, no significant
difference can be observed. Furthermore, we report enhanced performance of cDDPMs when pre-training the encoder
(SSL checkmark in Table 3) and ensembling the reconstructions of different noise levels (ENS checkmark in Table 3)
for most data sets. Notably, the inference time of cDDPMs is reduced by ∼ 37% compared to pDDPMs and increased
11

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

BraTS21 (T2)
DICE [%]

DICE [%]

MSLUB (T2)

P REPRINT

10

5

50
40
30

200

400

600

800

200

400

600

ttest

ttest

ATLAS (T1)

WMH (T1)

800

DICE [%]

DICE [%]

20
15
10

10

8

6

5
200

400

600

800

200

ttest
DDPM

400

600

800

ttest
pDDPM

cDDPM

cDDPMens

Figure 5: Comparison of different noise levels ttest regarding the DICE for the MSLUB (left) and BraTS21 (right) data
set. The superscript ens denotes the ensembling of reconstructions from different noise levels ttest in [250, 500, 750].
by ∼ 2% compared to DDPMs.
Comparing the reconstructions and residual maps in Fig. 4, we observe crisp reconstructions for both cDDPMs and
pDDPMs, whereas the reconstructions of DDPMs show missing details, particularly at regions where the tumor is
located in the input image. We observe that while DAEs provide detailed reconstructions, also unhealthy anatomy
is reproduced. For, cDDPMs we observe aligned intensity information across input and reconstructions. Hence, the
residual map shows a higher contrast across normal and abnormal regions, which enables a better delineation of the
present pathology.
Considering Fig. 5, it can be observed that the achieved DICE score is dependent on the noise level that is applied at
test time. By applying the ensembling strategy, this dependency is reduced and consistent performance is achieved
without specifying an individual noise level.
We supply a collection of exemplary residual maps for different baseline methods in Fig. 6 where across all examples, the
different baselines show residual maps that are conceptually similar. Nonetheless, our suggested cDDPM demonstrates
the lowest occurrences of false positives and the most pronounced contrast in the residual maps, particularly in
comparison to DDPMs and DAEs.
We provide additional ablation studies of the applied post-processing steps in the supplementary material.

6

Discussion

Unsupervised anomaly detection in neuroimaging has gained significant attention due to its potential to identify
abnormalities without the need for costly data annotation. Compared to supervised approaches that rely on annotated
data sets, UAD methods take a different approach by learning the underlying data distribution of healthy brain anatomy
and identifying anomalies as outliers from that distribution.
In this study, we focus on DDPMs for UAD in brain MRI. These models generate images by reconstructing an input
that is corrupted by noise, leveraging the high-dimensional latent space to achieve high-fidelity reconstructions with
preserved spatial context. However, while the overall brain structure is reconstructed well, we show that the forward
and backward processes of DDPMs do not capture the highly variable intensity characteristics of MRI scans sufficiently,
12

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

resulting in false positives in the residual map and reduced detection performance. This becomes especially prominent
in the presence of domain shifts at test time.
To address this challenge, we propose context-conditioned DDPMs (cDDPMs) for UAD in brain MRI. Here, we train a
DDPM to reconstruct healthy brain anatomy and incorporate a latent feature representation of the noise-free input
image, derived by an additional image encoder as a conditioning input to the denoising process. While the additional
feature representation is not suitable for high-fidelity reconstruction Baur et al. (2021); Bercea et al. (2023b), we show
that it can capture local intensity information of the image to reconstruct. We demonstrate that by incorporating the
feature representation of individual input images, our proposed cDDPMs reconstruct the brain MRIs with more accurate
intensity information compared to the unconditional DDPMs. Additionally, we observe enhanced domain adaptation
capabilities to both, real and simulated intensity profiles with our conditioning mechanism. Finally, we demonstrate
that these appealing features of our approach are crucial properties to improve the segmentation performance in
reconstruction-based UAD.
Overall, we systematically evaluate the performance of our approach in terms of reconstruction quality, domain
adaptation, and segmentation performance based on five different data sets.

6.1

Reconstruction Quality

We compare the reconstruction quality of our method with baseline models on the healthy IXI data set in Table 2. For
the AE and (S)VAE, overall the worst reconstruction quality is reported. A reason for this is seen in the strict bottleneck
enforced by the dense latent space as it inhibits information flow Baur et al. (2021). In contrast, methods like DAE or
DDPMs that are not constrained by a dense latent space but by a noising strategy Kascenas et al. (2022), show improved
reconstruction performance. While pDDPMs and cDDPMs outperform the baseline DDPM in terms of reconstruction
quality, we observe that all models are outperformed by the baseline DAE. We note that the overall training objective of
the compared generative models is to reconstruct the image with high accuracy and hence copying the input image
would be a trivial solution. However, for the UAD task, it is crucial that the given input image is not solely copied
but that unhealthy anatomy is replaced by pseudo-healthy representatives. Hence, comparing only the reconstruction
quality of healthy anatomy does not necessarily reflect the usefulness for the UAD task. Therefore, we utilize the
l1 − ratio where high values indicate a better trade-off between the reconstruction of healthy and unhealthy anatomy
and vice-versa. While DDPMs and particularly the cDDPM achieve a high l1 − ratio, across all unhealthy data sets
it becomes evident, that the DAE fails to generalize to different pathology types, which is a crucial property of UAD
methods. A reason for that is seen in the chosen noise type in DAE that is highly optimized to the BraTS21 data set,
mimicking the visual appearance of tumors Bercea et al. (2023b); Lagogiannis et al. (2023).
In summary, the cDDPM shows improved reconstruction quality compared to DDPMs while preserving a high l1−ratio.
This indicates that the conditioning mechanism effectively captures information from the input image for an improved
reconstruction without providing too much detailed information that would enable the cDDPM to solely copy the input
image.
6.2

Domain Adaptation

We evaluate the domain adaptation capabilities of cDDPMs by simulating different contrast levels and conditioning
inputs in Fig. 2. The reconstructions show that while the overall shape is preserved across different conditioning
masks, meaningful reconstructions are achieved only in regions covered by the conditioning image. Particularly, the
conditioning image plays a critical role in capturing local intensity information. This demonstrates the ability of
cDDPMs to adapt to different contrast levels and to capture varying intensity information effectively. This becomes
even more evident when high noise levels are considered, where the only source of information concerning the given
input image is the conditioning image. Here, the reconstruction becomes totally dependent on the shape and intensity
characteristics of the conditioning image. The conditioning facilitates a blurred reconstruction of prominent local
intensity changes, suggesting that the conditioning mechanism allows the capture and reconstruction of local intensity
details from the input image. These findings indicate that cDDPMs effectively learn to balance information from
the noisy input image and the conditioning encoder features during training, adapting according to the level of noise
inherent in the input.
We explore the domain adaptation capabilities of cDDPMs in real-world scenarios where a different, out-of-domain
data set is used for testing. To assess the domain adaptation ability, we investigate the deviation between the intensity
distributions of the input and reconstruction by plotting histograms and calculating the Kullback-Leibler Divergence
(KLD) as a proxy in Fig. 3. Our findings reveal that cDDPMs exhibit improved performance in capturing and
estimating the intensity distribution. Particularly when simulating contrast levels considerably different from the
training distribution, cDDPMs demonstrate superior alignment of the histograms and lower KLD values. This analysis
highlights the potential of the conditioning mechanism in cDDPMs to effectively adapt to unseen variations in intensity
13

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

Figure 6: Exemplary residual maps from the BraTS21 data set (rows 1 and 2), the WMH data set (rows 3 and 4) and the
ATLAS data set (rows 5 and 6). From left to right, the input image, the residual maps and the ground truth are shown.

profiles and improve the coherence between input and reconstruction, contributing to improved domain adaptation in
real-world scenarios.
14

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

6.3

P REPRINT

Segmentation Performance

Overall, cDDPMs demonstrate competitive or superior results compared to traditional autoencoder-based approaches,
as well as the baseline DDPM and pDDPM, as presented in Table 3. We conclude that the high reconstruction quality,
the accurate modeling of global and local intensity information, and the effective domain adaptation capabilities that are
attributed to our conditioning approach are crucial to improve the UAD segmentation performance. Furthermore, we
observe that pre-training the encoder Fenc slightly enhances the segmentation performance in most cases, indicating
that starting from an already learned representation space has the potential to improve the overall integration of the
conditioning features, compared to simultaneously training the parameters of both, DDPM and Fenc from scratch.
Comparing the cDDPM to the DAE reveals that despite the DAEs’ superior reconstruction quality on healthy data,
they are outperformed by cDDPMs by a margin given the UAD task. A reason for this is seen in the DAEs’ ability to
reconstruct unhealthy anatomy, particularly for pathologies differing from the BraTS21 data set where the noise type is
not optimized for, as discussed in Subsection 6.1.
To further analyze the effectiveness of cDDPMs, we provide visual comparisons of reconstructions and residual maps
in Fig. 4. In comparison to pDDPMs, the reconstructions of cDDPMs demonstrate following the intensity information
of the respective input images, resulting in improved contrast and intensity alignment between input and reconstruction
pairs. This leads to a higher contrast in the residual maps, which facilitates the delineation of anomalies such as tumors.
Furthermore, it becomes evident that besides adding a data-dependent hyper-parameter, the patching strategy introduces
subtle artifacts at patch borders. Thus, our results indicate that cDDPMs make use of the additional information more
effectively compared to pDDPMs. Furthermore, cDDPMs provide reduced complexity and inference time as there is no
need for a costly patching strategy, making it a practical and efficient solution for UAD in brain MRI.
In Fig. 5, we explore the impact of noise levels on the segmentation performance. We demonstrate that cDDPMs
outperform the baseline models across different noise levels for most data sets. However, we also observe that the
noise level serves as a crucial hyper-parameter. In general, high noise levels tend to result in more blurry and generic
reconstructions, whereas low noise levels enable sharper reconstructions, including unhealthy anatomy. Thus, selecting
an appropriate noise level is essential to achieve reasonable performance. However, the optimal value for the applied
noise depends on the evaluated data set as shown in Fig 5. We assume that the main reason for this dependency is the
different size of pathologies in the considered data sets, as also indicated in Bercea et al. (2023a). To address this
dependency, we apply different noise levels and average the resulting reconstructions. Thereby, we effectively mitigate
the dependency on the noise level which enhances the model’s generalization abilities, which are vital for UAD methods.

6.4

Limitations and Future Work

Overall, while our approach demonstrates promising results, several limitations should be acknowledged. Our study
focuses primarily on brain MRI and may not generalize seamlessly to other imaging modalities or anatomical regions.
Further investigation and adaptation of our approach to different medical imaging domains or even industrial defect
detection would be beneficial.
Another potential avenue for improvement in our study is the inclusion of FLAIR (Fluid-Attenuated Inversion Recovery)
data, which could improve the overall performance Meissen et al. (2022) and provide valuable insights into white
matter abnormalities and lesions, especially in conditions like MS. Furthermore, it is important to acknowledge that
the available data sets for white matter hyperintensities and MS lesions, such as the MSLUB and WMH data sets, are
relatively small compared to the BraTS21 and ATLAS data sets. This limited sample size for WMH and MS lesions may
reduce the generalizability of our findings and the availability of a larger data set would provide a more comprehensive
representation of WMH and MS lesions, enabling more accurate and reliable evaluation.
Another avenue for future work is the incorporation of multi-scale image encodings into our conditioning mechanism.
Currently, our study does not utilize multi-scale analysis, which could be advantageous in capturing fine-grained details
and contextual information at different resolutions. By carefully integrating multi-scale image encodings without
allowing to copy the conditioning image, we see potential to enhance the performance of our cDDPMs in capturing
both global and local features of the input images.
To further enhance the utilized context, an additional direction for future work is to explore the use of 3D input for the
image encoder. We have shown that it improves the reconstruction quality for VAEs Bengs et al. (2021); Behrendt et al.
(2022a) and expect similar improvements for DDPMs. Currently, our approach operates on 2D slices of the MRI data,
which may limit the preservation of 3D context and spatial relationships between slices. By incorporating 3D input into
the image encoder, we can potentially capture and preserve the 3D structure and contextual information without the
need to train the full 3D DDPM.

15

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

7

P REPRINT

Conclusion

In this work, we addressed the task of reconstruction-based UAD in brain MRI. To this end, we proposed cDDPMs where
we introduced a conditioning mechanism to DDPMs that incorporates feature representations of noise-free input images
to the denoising process. We have shown that this conditioning mechanism effectively addresses challenges of accurate
reconstruction, intensity capture, and domain adaptation and thus enables a more accurate delineation of pathologies
from the generated residual maps. As a consequence, our approach outperformed state-of-the-art architectures for UAD
in brain MRI, on various publicly available data sets. Our findings contribute to the development of effective UAD
methods in brain MRI and have practical implications for the detection and segmentation of pathologies in clinical
scenarios, where domain shifts are likely.

Acknowledgements
This work was partially funded by grant numbers KK5208101KS0 and ZF4026303TS9 and by the Free and Hanseatic
City of Hamburg (Interdisciplinary Graduate School) from University Medical Center Hamburg-Eppendorf

References
Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Farahani, K., Kalpathy-Cramer, J., Kitamura,
F.C., Pati, S., et al., 2021. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic
classification. arXiv preprint arXiv:2107.02314 .
Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B., Farahani, K., Davatzikos,
C., 2017. Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic
features. Scientific data 4, 1–13.
Baur, C., Denner, S., Wiestler, B., Navab, N., Albarqouni, S., 2021. Autoencoders for unsupervised anomaly
segmentation in brain mr images: a comparative study. Med. Image Anal. , 101952.
Baur, C., Wiestler, B., Albarqouni, S., Navab, N., 2018. Deep autoencoding models for unsupervised anomaly
segmentation in brain mr images, in: MICCAI brainlesion workshop, Springer. pp. 161–169.
Baur, C., Wiestler, B., Albarqouni, S., Navab, N., 2020a. Bayesian skip-autoencoders for unsupervised hyperintense
anomaly detection in high resolution brain mri, in: IEEE ISBI, IEEE. pp. 1905–1909.
Baur, C., Wiestler, B., Albarqouni, S., Navab, N., 2020b. Scale-space autoencoders for unsupervised anomaly
segmentation in brain mri, in: Computer Assisted Radiology and Surgery, Springer. pp. 552–561.
Behrendt, F., Bengs, M., Bhattacharya, D., Krüger, J., Opfer, R., Schlaefer, A., 2022a. Capturing inter-slice dependencies
of 3d brain MRI-scans for unsupervised anomaly detection, in: Medical Imaging with Deep Learning. URL:
https://openreview.net/forum?id=db8wDgKH4p4.
Behrendt, F., Bengs, M., Rogge, F., Krüger, J., Opfer, R., Schlaefer, A., 2022b. Unsupervised anomaly detection
in 3d brain mri using deep learning with impured training data, in: 2022 IEEE 19th International Symposium on
Biomedical Imaging (ISBI), IEEE. pp. 1–4.
Behrendt, F., Bhattacharya, D., Krüger, J., Opfer, R., Schlaefer, A., 2023. Patched diffusion models for unsupervised
anomaly detection in brain mri. arXiv preprint arxiv.org/abs/2303.03758 .
Bengs, M., Behrendt, F., Krüger, J., Opfer, R., Schlaefer, A., 2021. Three-dimensional deep learning with spatial erasing
for unsupervised anomaly segmentation in brain mri. Computer Assisted Radiology and Surgery 16, 1413–1423.
Bercea, C.I., Neumayr, M., Rueckert, D., Schnabel, J.A., 2023a. Mask, stitch, and re-sample: Enhancing robustness and
generalizability in anomaly detection through automatic diffusion models, in: ICML 3rd Workshop on Interpretable
Machine Learning in Healthcare (IMLH). URL: https://openreview.net/forum?id=kTpafpXrqa.
Bercea, C.I., Wiestler, B., Rueckert, D., Schnabel, J.A., 2023b. Generalizing unsupervised anomaly detection: Towards
unbiased pathology screening, in: Medical Imaging with Deep Learning. URL: https://openreview.net/
forum?id=8ojx-Ld3yjR.
Bruno, M.A., Walker, E.A., Abujudeh, H.H., 2015. Understanding and confronting our mistakes: the epidemiology of
error in radiology and strategies for error reduction. Radiographics 35, 1668–1676.
Chen, S., Ma, K., Zheng, Y., 2019. Med3d: Transfer learning for 3d medical image analysis. arXiv preprint
arXiv:1904.00625 .
Chen, X., You, S., Tezcan, K.C., Konukoglu, E., 2020. Unsupervised lesion detection via image restoration with a
normative prior. Med. Image Anal. 64, 101713.
16

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

Dhariwal, P., Nichol, A., 2021. Diffusion models beat gans on image synthesis. NIPS 34, 8780–8794.
Ellis, R.J., Sander, R.M., Limon, A., 2022. Twelve key challenges in medical machine learning and solutions.
Intelligence-Based Medicine 6, 100068.
Graham, M.S., Pinaya, W.H., Tudosiu, P.D., Nachev, P., Ourselin, S., Cardoso, M.J., 2022. Denoising diffusion models
for out-of-distribution detection. arXiv preprint arXiv:2211.07740 .
Han, C., Rundo, L., Murao, K., Noguchi, T., Shimahara, Y., Milacski, Z.Á., Koshino, S., Sala, E., Nakayama, H.,
Satoh, S., 2021. Madgan: Unsupervised medical anomaly detection gan using multiple adjacent brain mri slice
reconstruction. BMC bioinformatics 22, 1–20.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are scalable vision learners, in:
CVPR, pp. 16000–16009.
Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models. NIPS 33, 6840–6851.
Isensee, F., Schell, M., Pflueger, I., Brugnara, G., Bonekamp, D., Neuberger, U., Wick, A., Schlemmer, H.P., Heiland,
S., Wick, W., et al., 2019. Automated brain extraction of multisequence mri using artificial neural networks. Human
brain mapping 40, 4952–4964.
Islam, J., Zhang, Y., 2018. Brain mri analysis for alzheimer’s disease diagnosis using an ensemble system of deep
convolutional neural networks. Brain informatics 5, 1–14.
Johnson, J.M., Khoshgoftaar, T.M., 2019. Survey on deep learning with class imbalance. Journal of Big Data 6, 1–54.
Karimi, D., Dou, H., Warfield, S.K., Gholipour, A., 2020. Deep learning with noisy labels: Exploring techniques and
remedies in medical image analysis. Med. Image Anal. 65, 101759.
Kascenas, A., Pugeault, N., O’Neil, A.Q., 2022. Denoising autoencoders for unsupervised anomaly detection in brain
mri, in: Medical Imaging with Deep Learning, PMLR.
Kuijf, H.J., Biesbroek, J.M., De Bresser, J., Heinen, R., Andermatt, S., Bento, M., Berseth, M., Belyaev, M., Cardoso,
M.J., Casamitjana, A., et al., 2019. Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge. IEEE transactions on medical imaging 38, 2556–2568.
Lagogiannis, I., Meissen, F., Kaissis, G., Rueckert, D., 2023. Unsupervised pathology detection: A deep dive into the
state of the art. IEEE Transactions on Medical Imaging , 1–1doi:doi:10.1109/TMI.2023.3298093.
Lesjak, Ž., Galimzianova, A., Koren, A., Lukin, M., Pernuš, F., Likar, B., Špiclin, Ž., 2018. A novel public mr image
dataset of multiple sclerosis patients with lesion segmentations based on multi-rater consensus. Neuroinformatics 16,
51–63.
Liew, S.L., Lo, B.P., Donnelly, M.R., Zavaliangos-Petropulu, A., Jeong, J.N., Barisano, G., Hutton, A., Simon, J.P.,
Juliano, J.M., Suri, A., et al., 2022. A large, curated, open-source stroke neuroimaging dataset to improve lesion
segmentation algorithms. Scientific data 9, 320.
Lundervold, A.S., Lundervold, A., 2019. An overview of deep learning in medical imaging focusing on mri.
Zeitschrift für Medizinische Physik 29, 102–127. URL: https://www.sciencedirect.com/science/article/
pii/S0939388918301181, doi:doi:https://doi.org/10.1016/j.zemedi.2018.11.002. special Issue: Deep Learning in
Medical Physics.
McDonald, R.J., Schwartz, K.M., Eckel, L.J., Diehn, F.E., Hunt, C.H., Bartholmai, B.J., Erickson, B.J., Kallmes, D.F.,
2015. The effects of changes in utilization and technological advancements of cross-sectional imaging on radiologist
workload. Academic radiology 22, 1191–1198.
Meissen, F., Kaissis, G., Rueckert, D., 2022. Challenging current semi-supervised anomaly segmentation methods for
brain mri, in: MICCAI brainlesion workshop, Springer. pp. 63–74.
Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J.,
Wiest, R., et al., 2014. The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on
medical imaging 34, 1993–2024.
Moeskops, P., de Bresser, J., Kuijf, H.J., Mendrik, A.M., Biessels, G.J., Pluim, J.P., Išgum, I., 2018. Evaluation of a
deep learning approach for the segmentation of brain tissues and white matter hyperintensities of presumed vascular
origin in mri. NeuroImage: Clinical 17, 251–262.
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A., 2018. Film: Visual reasoning with a general conditioning
layer, in: AAAI.
Perkuhn, M., Stavrinou, P., Thiele, F., Shakirin, G., Mohan, M., Garmpis, D., Kabbasch, C., Borggrefe, J., 2018.
Clinical evaluation of a multiparametric deep learning model for glioblastoma segmentation using heterogeneous
magnetic resonance imaging data from clinical routine. Investigative radiology 53, 647.
17

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

Pinaya, W.H., Graham, M.S., Gray, R., Da Costa, P.F., Tudosiu, P.D., Wright, P., Mah, Y.H., MacKinnon, A.D., Teo,
J.T., Jager, R., et al., 2022a. Fast unsupervised brain anomaly detection and segmentation with diffusion models.
arXiv preprint arXiv:2206.03461 .
Pinaya, W.H., Tudosiu, P.D., Gray, R., Rees, G., Nachev, P., Ourselin, S., Cardoso, M.J., 2022b. Unsupervised brain
imaging 3d anomaly detection and segmentation with transformers. Med. Image Anal. 79, 102475.
Pérez-García, F., Sparks, R., Ourselin, S., 2021. Torchio: A python library for efficient loading, preprocessing,
augmentation and patch-based sampling of medical images in deep learning. Computer Methods and Programs in
Biomedicine 208, 106236.
Raschka, S., 2018. Mlxtend: Providing machine learning and data science utilities and extensions to python’s scientific
computing stack. The Journal of Open Source Software 3.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent
diffusion models, in: CVPR, pp. 10684–10695.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in:
Medical Image Computing and Computer Assisted Intervention, Springer. pp. 234–241.
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi, M., 2022. Palette: Image-to-image
diffusion models, in: ACM, pp. 1–10.
Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U., 2019. f-AnoGAN: Fast unsupervised anomaly
detection with generative adversarial networks. Med. Image Anal. 54, 30–44.
Silva-Rodríguez, J., Naranjo, V., Dolz, J., 2022. Constrained unsupervised anomaly segmentation. Med. Image Anal.
80, 102526.
Tian, K., Jiang, Y., Diao, Q., Lin, C., Wang, L., Yuan, Z., 2023. Designing bert for convolutional networks: Sparse and
hierarchical masked modeling. arXiv:2301.03580 .
Vernooij, M.W., Ikram, M.A., Tanghe, H.L., Vincent, A.J., Hofman, A., Krestin, G.P., Niessen, W.J., Breteler, M.M.,
van der Lugt, A., 2007. Incidental findings on brain mri in the general population. New England Journal of Medicine
357, 1821–1828.
Wang, T., Zhang, T., Zhang, B., Ouyang, H., Chen, D., Chen, Q., Wen, F., 2022. Pretraining is all you need for
image-to-image translation. arXiv preprint arXiv:2205.12952 .
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E., 2004. Image quality assessment: from error visibility to structural
similarity. IEEE Transactions on Image Processing 13, 600–612. doi:doi:10.1109/TIP.2003.819861.
Wolleb, J., Bieder, F., Sandkühler, R., Cattin, P.C., 2022. Diffusion models for medical anomaly detection. arXiv
preprint arXiv:2203.04306 .
Wyatt, J., Leach, A., Schmon, S.M., Willcocks, C.G., 2022. Anoddpm: Anomaly detection with denoising diffusion
probabilistic models using simplex noise, in: CVPR, pp. 650–656.
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O., 2018. The unreasonable effectiveness of deep features as a
perceptual metric, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595.
Zimmerer, D., Isensee, F., Petersen, J., Kohl, S., Maier-Hein, K., 2019a. Unsupervised anomaly localization using
variational auto-encoders, in: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.T., Khan, A.
(Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Springer International
Publishing, Cham. pp. 289–297.
Zimmerer, D., Kohl, S., Petersen, J., Isensee, F., Maier-Hein, K., 2019b. Context-encoding variational autoencoder for
unsupervised anomaly detection, in: Medical Imaging with Deep Learning.

18

Behrendt et al. Conditioned Diffusion Models for UAD in Brain MRIs

P REPRINT

Supplementary Material
Post-Processing Analysis
Model

without CC

without MF

without BE

BraTS21

MSLUB

ATLAS

WMH

DAE
DAE
DAE
DDPM
DDPM
DDPM
cDDPM
cDDPM
cDDPM
pDDPM
pDDPM
pDDPM

✓
✗
✗
✓
✗
✗
✓
✗
✗
✓
✗
✗

✗
✓
✗
✗
✓
✗
✗
✓
✗
✗
✓
✗

✗
✗
✓
✗
✗
✓
✗
✗
✓
✗
✗
✓

45.34 ± 3.94 (-0.03)
39.75 ± 3.37 (-5.62)
44.36 ± 3.88 (-1.01)
44.42 ± 2.03 (-0.08)
36.56 ± 1.86 (-7.94)
43.81 ± 2.12 (-0.69)
53.43 ± 1.49 (0.06)
40.59 ± 1.27 (-12.78)
52.29 ± 1.70 (-1.08)
49.62 ± 0.84 (-0.16)
36.09 ± 0.35 (-13.69)
47.76 ± 0.86 (-2.02)

3.92 ± 1.19 (0.04)
3.94 ± 0.58 (0.06)
3.51 ± 0.89 (-0.37)
6.05 ± 2.10 (-0.41)
8.05 ± 1.39 (1.59)
5.70 ± 2.06 (-0.76)
10.70 ± 1.50 (-0.81)
11.05 ± 1.04 (-0.46)
10.00 ± 1.48 (-1.51)
10.09 ± 0.76 (0.88)
10.40 ± 0.59 (1.19)
8.13 ± 1.03 (-1.08)

8.53 ± 0.24 (0.00)
9.21 ± 0.32 (0.68)
8.77 ± 0.28 (0.24)
14.65 ± 0.30 (-0.02)
12.02 ± 0.45 (-2.65)
15.11 ± 0.32 (0.44)
19.92 ± 1.45 (-0.07)
16.56 ± 1.11 (-3.43)
20.41 ± 1.41 (0.42)
13.35 ± 0.22 (0.11)
12.14 ± 0.21 (-1.10)
13.71 ± 0.28 (0.47)

7.31 ± 0.91 (0.00)
7.01 ± 0.68 (-0.30)
6.80 ± 0.81 (-0.51)
10.27 ± 0.96 (0.64)
8.80 ± 0.71 (-0.83)
9.89 ± 1.00 (0.26)
9.86 ± 1.18 (-0.02)
8.34 ± 0.76 (-1.54)
9.36 ± 1.15 (-0.52)
7.62 ± 0.80 (-0.35)
7.33 ± 0.39 (-0.64)
7.30 ± 0.95 (-0.67)

Table 4: Post-processing analysis for all data sets. The checkmarks indicate the exclusion of Connected Component
(CC), Medianfiltering (MF) or Brain Eroding (BE) in the evaluation phase. For all models, the mean ± standard
deviation are provided. Color-coded absolute differences concerning the respective baseline models are provided in the
brackets.
In Table 4, we provide an analysis of the applied post-processing steps by excluding individual post-processing steps
from the evaluation protocol. We show that while the median filter shows to have a large effect, the other post-processing
techniques only show minor changes. Moreover, no post-processing strategy consistently works for all models or data
sets, motivating further research and a systematic study about the effect of different post-processing steps for UAD in
brain MRI.

19