1

DiaPer: End-to-End Neural Diarization with
Perceiver-Based Attractors

arXiv:2312.04324v1 [eess.AS] 7 Dec 2023

Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget

Abstract—Until recently, the field of speaker diarization was
dominated by cascaded systems. Due to their limitations, mainly
regarding overlapped speech and cumbersome pipelines, endto-end models have gained great popularity lately. One of the
most successful models is end-to-end neural diarization with
encoder-decoder based attractors (EEND-EDA). In this work,
we replace the EDA module with a Perceiver-based one and
show its advantages over EEND-EDA; namely obtaining better
performance on the largely studied Callhome dataset, finding
the quantity of speakers in a conversation more accurately, and
running inference on almost half of the time on long recordings.
Furthermore, when exhaustively compared with other methods,
our model, DiaPer, reaches remarkable performance with a very
lightweight design. Besides, we perform comparisons with other
works and a cascaded baseline across more than ten public wideband datasets. Together with this publication, we release the code
of DiaPer as well as models trained on public and free data.
Index Terms—Speaker Diarization, End-to-End Neural Diarization, Perceiver, Attractor, DiaPer

I. I NTRODUCTION
N the last years, there has been a big change of paradigm
in the world of speaker diarization. Competitive systems
until a few years ago were cascaded or modular [1]–[3],
consisting of different sub-modules to handle voice/speech
activity detection (VAD/SAD), embedding extraction (usually x-vector) over uniform segmentation, clustering, optional
resegmentation and overlapped speech detection (OSD) and
handling. The main disadvantages of this framework are that
each sub-module is trained independently and optimized for
different objectives and that the full pipeline is complex since
a few steps need to be applied sequentially, propagating errors
from one step to the next one. Furthermore, OSD performance
is usually not satisfactory, resulting in high overlap-related
errors in cascaded systems.
Since the appearance of end-to-end models, the ecosystem
has changed substantially with new approaches constantly
appearing [4]. Neural-based diarization models can be separated into different categories: single-stage systems, which
comprise only one model, and two-stage systems, which have
two steps where one is a variant of end-to-end model and
the other is either based on clustering or on another model.
Single-stage systems, such as end-to-end neural diarization
(EEND) [5], where diarization is modeled as per-speaker perframe binary classification, are trained directly for the task.
While the training can be done in different steps, the inference
is performed in a single stage. These methods face difficulties

I

Federico Landini, Mireia Diez, and Lukáš Burget are with Brno University
of Technology and Themos Stafylakis is with Omilia and Athens University
of Economics and Business.

in recordings with several speakers [6]. Two-stage systems
can be separated into different classes. Models such as target
speaker voice activity detection [7] are trained in an endto-end manner but make use of an initialization provided
by an existing (usually cascaded) model which has to be
run priorly at inference time. Other two-stage systems run
EEND on short segments (where few speakers are expected)
and then perform clustering to join the decisions on short
segments. They are known as EEND vector clustering (EENDVC) and different variants have been proposed [8]–[10]. These
approaches present advantages in dealing with several speakers
(potentially an unlimited number of them) while having an
edge over clustering-based methods on dealing with overlapped speech segments as EEND models usually do. This
categorization is, however, not strict. Some systems do not
exactly qualify as “single” or “two” stage as they have a single
stage but include some iterative procedure [11], [12].
The simplicity of single-stage EEND systems (where diarization is modeled as per-speaker per-frame binary classification) has brought more attention to them and several variations
have been proposed based on this framework. The two main
extensions are self-attention EEND (SA-EEND) [13] (where
BiLSTM layers are replaced by SA ones) and EEND with
encoder-decoder attractors (EEND-EDA) [14] (which enables
handling variable numbers of speakers), but several others
have been proposed: some of them have been designed for
the online scenario [15], [16] or making use of multiple
microphones [17], [18]. The Conformer architecture [19] was
used to replace the self-attention layers of SA-EEND in [20]
and of EEND-EDA in [21].
The Perceiver [22] is a Transformer [23] variant that employs cross-attention to project the variable-size input onto a
fixed-size set of latent representations. These latents are transformed by iterative self-attention and cross-attention blocks.
By encoding the variable-size input into the fixed-size latent
space, the Perceiver reduces the quadratic complexity of the
Transformer to linear. In this work, we utilize the Perceiver
framework to encode speaker information into the latent space
and then derive attractors from them. Using Perceivers allows
us to handle a variable number of speakers per conversation
while addressing some of the limitations of EDA with a fully
non-autoregressive (and iteration-free) scheme. Moreover, we
evaluate our model, DiaPer, on a wide variety of scenarios.
The contributions of our work are:
• Replacement of encoder-decoder structure in EEND-EDA
by a Perceiver-based decoder.
• Analysis of DiaPer’s performance under different architectural choices.

2

Attractors
Latents

a1 a2 . . . aA
e1

x1

x2
x3
...
...

II. R ELATED W ORKS
Among the EEND variants that are capable of dealing
with multiple speakers the most standard one is still EENDEDA [14]. This approach employs long short-term memory
(LSTM) layers for encoding frame embeddings and decoding
attractors that represent the speakers in the conversation.
However, one of the limitations of this approach is the
LSTM-based encoder-decoder mechanism itself. In practice,
the frame-by-frame embeddings fed to the LSTM encoder are
shuffled, clearly removing the time information, and hindering
the capabilities of this approach. This is done due to the
difficulties LSTMs have to “remember” speakers appearing at
the beginning of the conversation, especially when processing
long sequences. In [24], an alternative is proposed where the
input of the LSTM encoder is not shuffled and the LSTM
decoder incorporates an attention mechanism. Instead of using
zero vectors as input for the decoder, the input is obtained as
a weighted sum of the encoder outputs, providing the decoder
with better cues. A similar idea is explored in [25] where
the decoder is fed with summary representations calculated
together with embeddings produced by the frame encoder.
Some works have explored non-autoregressive approaches
for obtaining attractors with attention-based schemes. The first
of these works replaces the LSTM-based encoder-decoder with
two layers of cross-attention decoder [26]. In this configuration, the attractors are transformed using the frame embeddings
as keys and values and the input attractors, used as queries
in the decoder, are obtained as the weighted average of the
frame embeddings using their predicted posterior activities as
weights. However, a set of initial attractors has to be fed into
the decoder before an initial set of predictions is produced. The
initial attractors are given by running k-means clustering on
the frame embeddings and clustering to the number of speakers
in the recording. It is shown that this method can improve by
running a few refinement iterations.
In [27], the LSTM-based encoder-decoder is also replaced
by a cross-attention decoder; however, the set of initial queries
that are transformed into attractors is not defined by the output
of the model but they are learnable parameters. The methods
in [26], [27] have only shown their capabilities in the twospeaker scenario where the number of speakers is known and
where the architecture can be crafted to handle that specific
quantity. The extension to more speakers is definitely possible
but follow-up works have not yet been published.
A combination of the aforementioned works is utilized
in [12], [28]. In [28], in the context of SA-EEND for two
speakers, the initial diarization outputs are used to estimate

Attractor
existence
probabilities

Perceiver-based decoder

...
...

xT

Log-Mel
fbanks

Frame encoder

Thorough comparison with EEND-EDA to show DiaPer’s
improvements.
• Proposed architecture that is more lightweight and efficient
at inference time, yet performs better than EEND-EDA.
• Exhaustive comparison with other works on several corpora.
• Clustering-based baseline (including VAD and OSD + overlap handling) results on a variety of datasets and built with
public tools.
• Release of models trained on free publicly available data.
• Public code: https://github.com/BUTSpeechFIT/DiaPer.
•

e2
e3

...
...
...

...
...
...
...
...
...

...

...

eT

...

Linear

{

Frame embeddings

σ

p1 p2 . . . pA
y1,1 y1,2 . . . y1,A
y2,1 y2,2 . . . y2,A

y3,1 y3,2 . . . y3,A

σ

...

... . . . ...

...

... . . . ...

...

... . . . ...

...

... . . . ...

yT,1 yT,2 . . . yT,A

Per-frame
per-speaker activities

Fig. 1. DiaPer diagram.

initial attractors and they are refined iteratively with crossattention decoders with a fixed set of queries (one for each
of the speakers) attending to frame embeddings. In [12], the
LSTM-based encoder-decoder is also replaced by layers of
cross-attention decoder and three of the initial queries are fixed
(but learned during training) and represent “silence”, “single
speaker” and “overlap” while the other S queries represent
each of the speakers in the recording. In the first pass, only
the fixed queries are used and then the initial speaker queries
are estimated from the frame embeddings, using the average of
carefully selected frames given the predicted posterior activities. The set of S + 3 attractors is refined through a few crossattention layers in order to produce the final attractors used to
obtain the speech activity posteriors. It should be noted that
the inference procedure with this method is more complicated
than in the original EEND-EDA due to the iterative procedure
to estimate first silence, single speaker and overlap attractors
and then each of the speakers iteratively.
In [12], and more recently in [29] (which is concurrent to
this work), results are presented with a flexible quantity of
speakers but the model relies on an autoregressive scheme
since the speakers are iteratively decoded in a second step.
All these approaches present similarities with a more
generic architecture: the Perceiver [22] which iteratively refines a set of latents (queries in cross-attention) informed by
an input sequence (keys and values in cross-attention) but in
a complete non-autoregressive framework.
The model we propose in this work generalizes some of
the ideas described above and directly tackles the problem of
handling several speakers using Perceivers to obtain attractors
in an EEND-based framework. We name this approach DiaPer:
end-to-end neural diarization with Perceiver-based attractors.
III. T HE M ODEL
DiaPer shares many facets with other EEND models, such
as defining diarization as a per-speaker-per-time-frame binary
classification problem. Given a sequence of observations (features) X ∈ RT ×F where T denotes the sequence length and F
the feature dimensionality, the model produces Ŷ ∈ (0, 1)T ×S

3

which represent the speech activity probabilities of the S
speakers for each time-frame. Just like with EEND-EDA, the
model is trained so that Ŷ matches to the reference labels
Y ∈ {0, 1}T ×S where yt,s = 1 if speaker s is active at time
t and silent otherwise. The main difference between EENDEDA and DiaPer is in how the attractors are obtained given
the frame embeddings. As shown in Figure 1, DiaPer makes
use of Perceivers to obtain the attractors instead of the LSTMbased encoder-decoder.
The main two modules in DiaPer are the frame encoder
and the attractor decoder. As shown in Figure 2 and proposed
in [13], the frame encoder receives the sequence of frame
features X and transforms them with a few chained selfattention layers E = F rameEncoder(X) to obtain the frame
embeddings E ∈ RT ×D . The attractor decoder receives the
frame embeddings and produces attractors A = P ercDec(E)
with A ∈ RA×D 1 which are in turn compared with the frame
embeddings to determine which speaker is active at each timeframe: Ŷ = σ(EP ercDec(E)⊤ ).
In other words, the frame encoder is in charge of transforming the initial input features into deeper and more contextualized representations from which (a) the attractors will be
estimated, and (b) the frame-wise activation of each speaker
will be determined. Several encoder layers are used to extract
such representations and, in a similar way as presented in [27],
each layer also includes frame-speaker activities conditioning.
As shown in Figure 2, intermediate attractors are calculated
given the frame embeddings of each frame encoder layer.
The intermediate attractors are then weighted by intermediate
frame activities and transformed into the frame embedding
space to produce the conditioning.
More formally, the F rameEncoder consists of
(0)

weights and biases of the position-wise feed-forward layer,
1 ∈ RT is an all-one vector, ReLU (·) is the rectified linear
(l)
(l)
(l)
unit activation function, Qh ∈ RD×d , Kh ∈ RD×d , Vh ∈
(l)
D×d
D×D
R
, Oh ∈ R
are the query, key, value and output
D
is
projection matrices for the hth head and lth layer, and d = H
the dimension of each head. LN stands for layer normalization,
MHSA stands for multi-head self-attention and, FF stands for
feed-forward layer.
The conditioning is defined as follows
Condition(E(l−1) ) = Ŷ
Ŷ

(l−1)

(l−1)

P ercDec(E(l−1) )Wc

(10)

= σ(E(l−1) P ercDec(E(l−1) )⊤ ),

(11)

where P ercDec is the Perceiver-based attractor decoder,
Wc ∈ RD×D is a learnable parameter that weights the effect
of the intermediate attractors on the frame embeddings.
The decoder makes use of a chain of a few Perceiver blocks
as depicted in Figure 3. The set of learnable latents is transformed by each block utilizing the frame embeddings as keys
and values. One could have an equal amount of latents and
attractors, in which case the latents are an initial representation
transformed by the blocks to obtain the attractors. In practice,
we observed that this leads to instability in the training and that
obtaining the attractors as the linear combination of a larger
set of (transformed) latents performed better. More formally,
L(0) = M HA(0) (L, E(L) , E(L) )

(12)

L(b) = P ercBlockb (L(b−1) , E(L) )
 L(b−1) Q(b) (E(L) K(b) )⊤ 
(b)
(b)
h
h
√
Ch = Sof tmax
(E(L) Vh )
d
(b)
(b)
CA(b) = M HA(b) (L(b−1) , E(L) , E(L) ) = [C1 . . . CH ]O(b)

(13)
(14)

(15)
(b−1)

P ercBlockb (L

,E

(L)

) = M HSA

(b)1

(M HSA

(b)2

(CA

(b)

))
(16)

(1)

P ercDec(E(L) ) = WP ercBlockb (L(B) , E(L) ),

E(0) = [e1 , . . . , eT ]

(2)

E(l) = F rEncLayerl (E(l−1) + Condition(E(l−1) ))

(3)

where L ∈ RL×D is the set of latents, B is the number
of Perceiver blocks in the decoder (with 1 ≤ b ≤ B), H
(b)
is the number of heads (with 1 ≤ h ≤ H), Qh ∈ RD×d ,
(b)
(b)
(b)
Kh ∈ RD×d , Vh ∈ RD×d , Oh ∈ RD×D are the query,
key, value and output projection matrices for the hth head and
D
bth layer, and d = H
is the dimension of each head. M HA
stands for multi-head cross-attention and W ∈ RA×L is the
matrix that linearly combines latents to obtain attractors.
DiaPer decodes always the same fixed number of attractors,
denoted by A. As mentioned above, the attractors are obtained
as a linear combination of the latents. Therefore, the original
latents are encouraged to represent information about the
speakers in a general manner so that these representations
can be transformed (through cross- and self-attention) given a
particular input sequence in order to capture the characteristics
of the speakers in the utterance. Furthermore, in order to
encourage the model to utilize all latents, an extra “entropy
term” Le is added to the loss so that the weights that define the
linear combination of latents do not become extreme values
(i.e. no latent has a very high weight, therefore making all
others very small), where

et

= Win xt + bin
(0)

(0)

where 1 ≤ l ≤ L, and L is the number of self-attention
layers (F rEncLayerl denoting the lth self-attention layer)
and Win ∈ RD×F and bin ∈ RD are the weights and biases
of the input transformation on the frames.
(l−1)

= LN (E(l−1) )

(l−1)

(l−1)

Ē

= LN (Ē

Ê

(l−1)

F F (Ê

(4)
(l−1)

+ M HSA(l) (Ē
(l−1)

) = ReLU (Ê

))

(5)

(l)
(l)⊤
(l)
(l)⊤
W1 + 1b1 )W2 + 1b2

(6)
(l−1)

(l)

Ch = Sof tmax

 Ē

(l−1) (l) ⊤ 
(l)
Qh (Ē
Kh )

√

d

(l−1)

(Ē

(l)

Vh )
(7)

(l)

(l−1)

M HSA (Ē

(l)
(l)
) = [C1 . . . CH ]O(l)
(l−1)
(l−1)

F rEncLayer(E

) = Ê

(8)
(l−1)

+ F F (Ê

)

(9)

where H is the number of heads (with 1 ≤ h ≤ H), W1 ∈
RD×Df f , W2 ∈ RDf f ×D , b1 ∈ RDf f , b2 ∈ RD are the
1 In practice, S = A.

Le =

A
X
a=1

(17)

mean(Sof tmax(wa ) ∗ log Sof tmax(wa )) (18)

4

E(l)

E(l)

E

LayerNorm
Position-wise feed-forward

Wc

Weighted
Attractors

E(3)

LayerNorm

Frame-speaker Y
activities

σ

Self-attention layer
E(2)

MHS A(Ē(l−1))

(l−1)

Attractors

Self-attention layer

Multi-head self-attention
Ē

Con dit ion(E(l−1))

Self-attention layer

(l−1)

Ê

Transformation W

Self-attention layer

E(4)

E(1)

(l−1)

E(0)

LayerNorm

Detached

E(l−1)

Self-attention layer

Self-attention layer
Latents

Linear Layer

E(l−1)

Perceiver-based decoder

X

Fig. 2. Scheme of frame encoder (middle), detail of self-attention layer (left) and conditioning scheme (right).

Frame
embeddings

Latents

K, V

MH Cross
Attention

Q

Q
K, V MH Cross
Attention
MH Self
Attention

Q
K, V

Perceiver
block

MH Self
Attention

Q
K, V

Perceiver
block

K, V

Perceiver
block

losses. The main idea is that using the frame embeddings
produced by the frame encoder, we calculate losses using
the intermediate attractors given by the latents after each
Perceiver block. Analogously, using the attractors produced
by the Perceiver-based decoder, we calculate losses using the
intermediate frame embeddings given after each layer in the
frame encoder. The averages of the intermediate losses over
frame encoder layers and over Perceiver blocks are summed to
the losses L̂d (Y, Ŷ) and L̂a (r, p) which use “final” attractors
and “final” frame embeddings. Then, Ld and La are obtained
as

Q
Weighted
average

Ld = L̂d (Y, Ŷ) +

Attractors

Fig. 3. Scheme of Perceiver decoder.

La = L̂a (r, p) +

and wa ∈ RL is the row of W corresponding to attractor a.
In standard scaled dot-product attention [23], the softmax
is applied on the time-axis to normalize the attention weights
along the sequence length before multiplying with the values.
In Perceiver, cross- and self-attention on the latents are intertwined. We observed slightly better performance if, when
doing cross-attention, the softmax was applied to normalize
across latents rather than along the sequence length, i.e. each
frame embedding is “probabilistically” assigned to each latent
using weights that sum up to one. This and other decisions
are compared in the experimental section.
As usual for EEND-based models, the diarization loss Ld
is calculated as
T

L̂d (Y, Ŷ) =

X
1
min
BCE(ytϕ , ŷt ),
T S ϕ∈perm(S) t

(19)

where considering all reference labels permutations denote
permutation invariant training (PIT) loss.
Like in EEND-EDA, to determine which attractors are valid,
an attractor existence loss L̂a is calculated as L̂a (r, p) =
BCE(r, p) using the same permutation given by L̂d .
L̂d and L̂a are enough to train the model, but inspired by
other works [27], [30], [31], we decided to introduce auxiliary

L−1

B−1

L−1

B−1

l
b
1 X
1 X
L̂d (Y, Ŷ ) +
L̂d (Y, Ŷ )
L−1
B−1
l=1
b=1
(20)

1 X
1 X
L̂a (r, pl ) +
L̂a (r, pb ),
L−1
B−1
l=1
b=1
(21)

where p = [p1 , . . . , pA ] are the attractor posterior existence
probabilities and r = [r1 , . . . , rA ] are the reference presence
labels ri ∈ {0, 1} for 1 ≤ i ≤ A. pl are the posteriors using
the frame embeddings of the lth frame encoder layer and pb
are the posteriors using the bth Perceiver block.
The final loss to be optimized is L = Ld + La + Le .
One of the major disadvantages when using a nonautoregressive decoder is that the number of elements to
decode (attractors in this case) has to be set in advance and
this imposes a limit on the architecture. However, unlike the
original versions of EEND, we do not focus on a scenario
with a specific quantity of speakers but rather set the model
to have a maximum number of attractors A large enough
to handle several scenarios. This is done in one way or
another in all methods that handle “flexible” amounts of
speakers, i.e. when running inference with EEND-EDA, it is
necessary to decode a specific maximum number of attractors.
DiaPer decodes always the same number of attractors and, like
in EEND-EDA [14], a linear layer plus sigmoid determine
which attractors are valid, i.e. correspond to a speaker in the
conversation.

5

IV. E XPERIMENTAL S ETUP
A. Data
1) Training data: One of the key aspects of training
end-to-end diarization models is the training data. Neural
models require large amounts of training data annotated for
diarization which, in practice, are scarce. The compromise
solution consists in generating training data artificially by
combining segments of speech from different recordings.
Simulated mixtures [5] have been shown to enable the training
of EEND models but they have some disadvantages, mainly
related to their lack of naturalness. Some works [32]–[34] have
explored alternatives that allow these models to obtain better
performance. In this work, we opt for simulated conversations
(SC) for which public recipes are available2 and for which the
advantages over mixtures have been shown for real conversations with two and more speakers [33], [34].
Following this approach, different sets of SC were generated. To train 8 kHz models, 10 sets were created, each with a
different number of speakers per SC (ranging from 1 to 10) and
each containing 2500 h of audio. Utterances from the following
sets were used: Switchboard-2 (phases I, II, III) [35]–[37],
Switchboard Cellular (parts 1 and 2) [38], [39], and NIST
Speaker Recognition Evaluation datasets (from years 2004,
2005, 2006, 2008) [40]–[47]. All the recordings are sampled
at 8 kHz and, out of 6381 speakers, 90% are used for creating
training data. The Kaldi ASpIRE VAD3 is used to obtain
time annotations (in turn used to produce reference diarization
labels). To augment the training data, we use 37 noises from
MUSAN [48] labeled as “background”. They are added to the
signal scaled with a signal-to-noise ratio selected randomly
from {5, 10, 15, 20} dB.
In order to train 16 kHz models, a similar strategy was followed to also generate SC with different amounts of speakers
ranging from 1 to 10 per conversation, all comprised of 2500 h
of audio. Instead of telephone conversations, utterances were
taken from LibriSpeech [49] which consists of 1000 hours of
read English speech from almost 2500 speakers. The same
VAD as described above was used to produce annotations and
equivalent background noises were used, but in 16 kHz.
2) Evaluation data: Different corpora were used to evaluate
the models. For telephone speech, we utilized the speaker segmentation data from 2000 NIST Speaker Recognition Evaluation [50] dataset, usually referred to as “Callhome” [51] which
has become the de facto telephone conversations evaluation set
for diarization containing recordings with different numbers
of speakers as shown in Table I. We report results using the
standard Callhome partition4 , denoting the partitions as CH1
and CH2. We also report results on the subset of 2-speaker
conversations to which we refer as CH1-2spk and CH2-2spk.
Results on Callhome consider all speech (including overlap
segments) for evaluation with a forgiveness collar of 0.25 s.
We also report results on the conversational telephone speech
(CTS) domain from the Third DIHARD Challenge [52], which
consists of previously unpublished telephone conversations

TABLE I
I NFORMATION PER LIST FOR C ALLHOME PARTS 1 AND 2.
No. speakers

2

3

4

5

6

7

# Hours (2-spk)

CH1
CH2

155
148

61
74

23
20

5
5

3
3

2
0

8.70 (3.19)
8.55 (2.97)

from the Fisher collection. The development and evaluation
sets in the “full” set consist of 61 2-speaker 10-minute recordings each. Originally 8 kHz signals, they were upsampled to
16 kHz for the challenge and downsampled to 8 kHz to be used
in this work. As usual on DIHARD, all speech is evaluated
with a collar of 0 s.
Besides telephone conversations, we compared the models
on a variety of wide-band datasets. As the models we evaluate
are trained on single-channel data, when the datasets contain
microphone array data, we mix all channels in the microphone
array (far-field) or headsets (near-field). Training sets (or
development, if train sets are not available) are utilized for
fine-tuning. The databases considered are:
• AISHELL-4 [53], using the train/evaluation split provided.
• AliMeeting [54], using the train/eval/test split provided.
Unlike in the M2MET Challenge, oracle VAD is not used.
• AMI [55], [56], using the full-corpus-ASR partition into
train/dev/test and the diarization annotations of the “only
words” setup described in [57]5 .
• CHiME6 [58], using the official partition and annotations
from CHiME7 challenge [59] into train/dev/eval.
• DIHARD2 [60], using the official partition.
• DIHARD3 [52], using the official “full” partition in order
to have a more distinct corpus wrt DIHARD 2.
• DipCo [61], using the official partition and annotations from
CHiME7 challenge [59] into dev/eval.
• Mixer6 [62], using the official partition and annotations
from CHiME7 challenge [59] into train/dev/eval but, given
that the train part has only one speaker per recording, we
only consider the dev and eval parts.
• MSDWild
[63], using the official partition into
few.train/many.val/few.val as train/dev/test following
other published results.
• RAMC [64], using the official partition.
• VoxConverse [65], using the official partition into dev/test
and latest annotations6 .
More information about each dataset can be found in Table II.
The choice of forgiveness collar for calculating DER corresponds to the least forgiving choice (i.e. collar of 0 s) except
in cases where a challenge or the authors proposed differently.
In no case is used any kind of oracle information (such as
VAD) in order to have full pipeline comparisons.
B. Models
As the main baseline for this work, we utilize end-toend neural diarization with encoder-decoder attractors (EENDEDA) [14] which is the most popular EEND approach that can
handle multiple speakers. The architecture used was exactly

2 https://github.com/BUTSpeechFIT/EEND dataprep
3 http://kaldi-asr.org/models/m4

5 https://github.com/BUTSpeechFIT/AMI-diarization-setup

4 Sets listed in https://github.com/BUTSpeechFIT/CALLHOME sublists

6 Version 0.3 in https://github.com/joonson/voxconverse/tree/master

6

TABLE II
I NFORMATION ABOUT THE NUMBER OF FILES , THE MINIMUM AND MAXIMUM NUMBER OF SPEAKERS PER RECORDING AND THE NUMBER OF HOURS PER
PARTITION AS WELL AS EVALUATION COLLAR , TYPES OF MICROPHONE AND CHARACTERISTICS OF EACH EVALUATION DATASET.
Dataset

train
#files #spk

AISHELL-4
191
AliMeeting
209
AMI
136
14
CHiME6
DIHARD2
–
DIHARD3 full
–
–
DipCo
Mixer6
243
MSDWild
2476
RAMC
289
VoxConverse
–

3-7
2-4
3-5
4
–
–
–
1
2-7
2
–

#h
107.53
111.36
80.67
35.68
–
–
–
183.09
66.1
149.65
–

development
#files #spk # h

test
#files #spk

#h

–
8
18
2
192
254
5
59
177
19
216

20
60
16
4
194
259
5
23
490
43
232

12.72
10.78
9.06
10.05
22.49
33.01
2.6
6.02
9.85
20.64
43.53

–
–
2-4
4.2
4
9.67
4
4.46
1-10 23.81
1-10 34.15
4
2.73
2 44.02
3-10 4.1
2
9.89
1-20 20.3

5-7
2-4
3-4
4
1-9
1-9
4
2
2-4
2
1-21

the same as that described in [14] and we used our PyTorch
implementation7 . 15 consecutive frames of 23-dimensional
log Mel-filterbanks (computed over 25 ms every 10 ms) are
stacked to produce 345-dimensional features every 100ms.
These are transformed by the frame encoder, comprised of 4
self-attention encoder blocks (with 4 attention heads each) into
a sequence of 256-dimensional embeddings. These are then
shuffled in time and fed into the LSTM-based encoder-decoder
module that decodes attractors, which are deemed as valid if
their existence probability is above a certain threshold. A linear
layer followed by the sigmoid function is used to obtain speech
activity probabilities for each speaker (represented by a valid
attractor) at each time step (represented by an embedding).
Part of the setup for DiaPer is shared with the baseline,
namely the input features, the frame encoder configuration (except in experiments where the number of layers was changed),
and the mechanism for determining attractor existence.
Following standard practice with EEND models, the training
scheme consists in training the model first on synthetic training
data and then performing fine-tuning (FT) using a small
development set of real data of the same domain as the
test set. In the experiments with more than two speakers, a
model initially trained on synthetic data with two speakers per
recording is adapted to a synthetic set with a variable number
of speakers and finally fine-tuned to a development set.
As clustering-based baseline, we utilize a VBx-based [57]
system in two flavors: 8 kHz and 16 kHz. Two VADs were
used: Kaldi ASpIRE8 and pyannote’s. The best one of the
two was chosen for each dataset based on performance
on the development set. To handle overlap, the OSD from
pyannote [66] is run and the second speakers are assigned
heuristically [67] (closest in time speaker). For results on
AMI, Callhome and DIHARD 2, the hyperparameters of VBx
were the same as those used in [57]. For the other sets,
discriminative VBx (DVBx) [68] was used to find optimal
hyperparameters automatically.
C. Training
Most trainings were run on a single GPU. The batch size
was set to 32 with 200000 minibatch updates of warm-up
7 https://github.com/BUTSpeechFIT/EEND
8 http://kaldi-asr.org/models/m4

DER
Microphone
collar (s)
0
0
0
0.25
0
0
0.25
0.25
0.25
0
0.25

array
array & headset
array & headset
array
varied
varied
array
varied
varied
mobile phone
varied

Characteristics
Discussions in Mandarin in different rooms
Meetings in Mandarin in different rooms
Meetings in English in different rooms
Dinner parties in home environments
Wide variety of domains
Wide variety of domains
Dinner party sessions in the same room
Interviews and calls in English
Videos of daily casual conversations
Phone calls in Mandarin
Wide variety of videos (different languages)

respectively. Following [14], the Adam optimizer [69] was
used and scheduled with noam [23]. For a few trainings with
a variable number of speakers where 4 GPUs were used, the
batchsize and warm-up steps were adapted accordingly. Other
hyperparameters (i.e. dropout, learning rate) can be seen in the
training configuration files shared in the repository.
For FT on a development set, the Adam optimizer was used.
Both EEND-EDA and DiaPer were fine-tuned with learning
rate 10−5 for Callhome 2 speakers due to the low amount
of development data and with 10−4 for whole Callhome and
DIHARD 3 CTS. For all the other datasets, DiaPer was finetuned on the train set using learning rate 10−6 until the
performance on the development set stopped improving (or,
in case there was no official training set available, FT on the
development set til not further improvement on the test set).
During training (with 2-speaker SC), adaptation (with a
variable number of speakers SC), and FT (with in-domain
data), batches were formed by sequences of 600 Mel-filterbank
outputs, corresponding to 1 minute, unless specified otherwise
(i.e. the analysis in Section V-E). These sequences are randomly selected from the generated SC9 . During inference, the
full recordings are fed to the network one at a time. In all
cases, when evaluating a given epoch, the checkpoints of the
previous 10 epochs are averaged to run the inference.
To compare EEND-EDA and DiaPer on equal ground, we
train both models for the same number of epochs, evaluate
them after regular intervals and choose the best performing on
the development set. For comparisons on 2-speaker scenarios
of Callhome, each model is trained for 100 epochs on telephony SC. Every 10 epochs, the parameters of the 10 previous
checkpoints are averaged and performance is evaluated on
CH1-2spk set to determine the best one. The performance of
such model is reported on CH2-2spk set and DIHARD3 CTS
full eval before and after FT.
When doing adaptation to more speakers for comparison on
Callhome, the best performing 2-speaker model as described
above is selected as initialization. The adaptation to a SC set
with different amounts of speakers per recording is run for
75 epochs. The parameters of 10 models are averaged every
5 epochs and performance is evaluated on CH1 to determine
9 The acute reader will notice that it might not be possible to see as many
as 10 speakers in 1 minute, this is addressed in the experimental section.

7

the best one. The performance of such model is reported on
CH2. This model is also used as initialization when doing FT
to a development set. To avoid selecting results on the test set,
all fine-tunings are run for 20 epochs and the parameters of
the last 10 epochs are averaged to produce the final model.
For comparisons on the variety of wide-band sets, three
variants of DiaPer are trained. An 8 kHz model following a
similar approach as described above: trained for 100 epochs
on SC of 2 speakers created with telephony speech and then
adapted to the SC with 1-10 speakers set for 100 epochs. The
16 kHz is trained in the same manner but using SC generated
from LibriSpeech. Two flavors of this “wide-band DiaPer” are
used, one with 10 attractors and another with 20 attractors to
analyze the impact on datasets with several speakers. For the
comparisons on wide-band sets, results are also shown without
and with FT.
D. Metrics
Diarization performance is evaluated in terms of diarization
error rate (DER) as defined by NIST [70] and using dscore10 .
During inference time, the model outputs are thresholded at
0.5 to determine speech activities. For evaluation sets where
a forgiveness collar is used when calculating DER, a median
filter with window 11 is applied as post-processing over the
speech activities. If the forgiveness collar is 0 s, no filtering
is applied and, instead of running the inference with 10
frames subsampling in the frame encoder, 5 frames only
are subsampled as this provides a better resolution in the
output. However, due to the high memory consumption when
processing very long files, for CHiME6 a subsampling of 15
frames had to be used. To analyze the models’ quality in terms
of finding the correct number of speakers, confusion matrices
for correct/predicted numbers of speakers are presented for SC
with 10 recordings for each quantity of speakers from 1 to 10.
V. E XPERIMENTS
A. Selection of parameters
In order to shed some light on the influence of different
aspects of the architecture in DiaPer, we present first a comparison of the performance when varying some key elements. We
start from the best configuration we found, namely: 3 Perceiver
blocks in the attractor decoder, 128 latents, 4 self-attention
layers in the frame-encoder and 128-dimensional latents, frame
embeddings and attractors. This configuration is marked with
a gray background in the comparisons. The models are trained
on 2-speaker SC and no FT is applied.
Table III shows the impact of the number of Perceiver
blocks in the attractor decoder. Out of the configurations
explored, having 3 blocks presents the best performance.
Table IV shows how the number of latents can affect the
performance. Differences are small for all amounts equal to or
below 256, even with as few as 8. Nevertheless, given that the
number of parameters is very similar for any configuration,
we keep 128 latents as having more could ease the task when
more speakers appear in a recording.
10 https://github.com/nryant/dscore

TABLE III
C OMPARISON ON CH1-2 SPK WHEN VARYING THE NUMBER OF P ERCEIVER
BLOCKS IN THE ATTRACTOR DECODER .
# Blocks

1

2

3

4

5

DER (%)
8.27 8.41 7.96 8.44 8.09
# Parameters (M) 3.1
3.7
4.3
4.9
5.5

TABLE IV
C OMPARISON ON CH1-2 SPK WHEN VARYING THE NUMBER OF LATENTS .
# Latents

128

256

512

DER (%)
8.15 8.14 8.29 8.10 7.96
# Parameters (M) 4.29 4.29 4.29 4.30 4.31

8

16

32

64

8.10
4.32

8.54
4.36

TABLE V
C OMPARISON ON CH1-2 SPK WHEN VARYING THE NUMBER OF LAYERS IN
FRAME ENCODER .
# Layers

3

4

5

6

DER (%)
8.18 7.96 8.33 8.31
4.3
4.9
5.5
# Parameters (M) 3.7

Table V presents a comparison when varying the number of
layers in the frame encoder. Standard SA-EEND and EENDEDA use 4 and some works have used 6 layers. In the case of
DiaPer, we do not observe large differences in the performance
and obtain the best performance with 4.
Finally, Table VI shows the impact of the model dimensions on the performance. Increasing the dimensionality of
latents, frame embeddings and attractors further than 128 does
not show improvements in terms of DER but increases the
number of model parameters significantly. Figure 4 shows
performance throughout the epochs for the development set. It
is clear how more dimensions allow for a faster convergence;
however, more than 128 do not provide more gains in terms
of final performance. In addition, more dimensions make
the training less stable: using 512 would always lead to
instability. Configurations with less than 128 dimensions (64
and 32) can improve further and after 200 epochs reduce
the DER by about 1 point but still with worse final results
than other configurations. These findings show that reasonable
performances can be achieved even with more lightweight
versions of DiaPer.

Fig. 4. Performance on CH1-2spk for different model dimensions (latents,
frame embeddings and attractors).

8

TABLE VI
C OMPARISON ON CH1-2 SPK WHEN VARYING THE MODEL DIMENSION
( LATENTS , FRAME EMBEDDINGS AND ATTRACTORS ).
Dimensions

32

DER (%)
12.90
# Parameters (M)
0.7

64

128

256

384

9.30
1.6

7.96
8.16
8.52
4.3
12.9
26.6

TABLE VII
DER (%) ON CH1-2 SPK WITH DIFFERENT ABLATION COMPARISONS .
DiaPer

7.96

Without normalization of loss per #speakers
11.10
Without frame encoder conditioning
8.55
Without intermediate loss in frame encoder
8.53
Without intermediate loss in Perceiver blocks
8.43
Perceiver cross-attention across time (instead of latents)
8.07

B. Ablation analysis
Different decisions were made when developing DiaPer and
some have a big impact on the performance. Table VII presents
a comparison of DiaPer in the best configuration shown above
and when removing some of the operations performed during
training. The first one refers to the normalization of the loss
by the reference quantity of speakers, as shown in Eq. 19.
DiaPer always outputs A attractors and the loss is calculated
for all of them, even if only training with 2-speakers SC. If the
loss is not normalized by the amount of speakers, the model
tends to find less speech, increasing the missed speech rate
considerably.
Another ablation is with respect to the frame encoder
conditioning described in Figure 2. Similarly to [27], where the
scheme was introduced, removing it worsens the performance
by around 0.5 DER. Comparable degradation is observed by
removing the loss reinforcements in both frame encoder and
Perceiver blocks.
Finally, the attention normalization in the cross-attention
calculations inside the Perceiver blocks is performed across
latents in DiaPer. If done across time, as it is usually done,
slightly worsens the performance. We have also explored using
across-time normalization in half of the heads and acrosslatents in the other half but the performance was not better
than using across-latents in all heads.
While publications always focus on the positive aspects of
the models, we believe there is substantial value in sharing
those options that were explored and did not provide gains.
Among them were:
• use absolute positional encoding when feeding the frame
embeddings into the attractor decoder (no improvement).
• use specaugment for data augmentation (no improvement).
• following [26], [71], add a speaker recognition loss to
reinforce speaker discriminative attractors (slightly worse
results).
• following [72], include an LSTM-based mechanism to
model output speaker activities through time (worse performance).
• model silence with a specific attractor (worse performance).
• length normalize frame embeddings and attractors before
performing dot-product to effectively compute cosine sim-

(a) CH Part 2 (2 speakers)

(b) DH3 CTS full eval

Fig. 5. DER (%) for telephone recordings of Callhome and DIHARD 3
conversational telephone speech (CTS) with 2 speakers.

ilarity (worse performance).
use cross-attention to compare frame embeddings and attractors instead of dot-product (worse performance).
• as analyzed in [72]–[74], use power set encoding to model
the diarization problem instead of per-frame per-speakers
activities (worse performance). In particular, we believe that
the reason for this approach not to work with DiaPer is
that, when handling many speakers, the number of classes
in the power set is too high and most of them are not
well represented. This approach has much more potential
in limited quantity of speakers scenarios as shown in [74].
Implementations of most of these variants can be found in our
public implementation in https://github.com/BUTSpeechFIT/
DiaPer to enable others to easily revisit them.
•

C. Two-speaker telephone conversations
Even though DiaPer is specifically designed for the scenario
with multiple speakers, as it is common practice, in this section
we first present results for the 2-speaker telephone scenario.
It should be noted that both EEND-EDA and DiaPer, when
trained only with 2-speaker SC learn to only output activities
for 2 speakers, even if they are prepared to handle a variable
number of them. Figure 5 compares the performance on two
sets before and after FT to the in-domain development set.
Both EEND-EDA and DiaPer were trained on the same data
with 5 different seeds to produce the error bars. Results show
that DiaPer can reach significantly better performance on both
datasets, both with and without FT.
Figure 6 presents a comparison between EEND-EDA and
DiaPer inference times. Although DiaPer is slower for very
short recordings, it can run considerably faster when processing several-minute recordings. This speed-up is given only by
the Perceiver-based attractor decoder (instead of the LSTMbased of EEND-EDA) since the rest of the model is the same.
Even more, these results correspond to an input downsampling
factor of 10 but if more precision was used, the length of
the frame-embedding sequences increases which would show
further advantage for DiaPer for the same recording lengths.
Notably, EEND-EDA has 6.4 million parameters while DiaPer
has only 4.6 million showing that the model not only runs
faster when processing long sequences but also makes more
efficient use of the parameters.
Table VIII presents an exhaustive comparison with all
competitive systems at the time of publication under the same
conditions: all speech is evaluated and no oracle information

9

Fig. 7. DER (%) for CH Part 2 with varying number of speakers.
TABLE IX
DER (%) COMPARISON ON CH2. F OR EACH METHOD (EEND-EDA AND
D IA P ER ), SELECTING THE BEST MODEL ON CH1 OUT OF THE 5 RUNS .
Fig. 6. Inference time for EEND-EDA and DiaPer for recordings from 1
minute to 1 hour running 5 times each inference with a downsampling factor
of 10. In black is the percentage of time taken by DiaPer wrt EEND-EDA.
Ran on Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz.

TABLE VIII
DER (%) COMPARISON ON CH2-2 SPK WITH OTHER METHODS . F OR OUR
RESULTS , WE SELECTED THE MODEL WITH THE BEST PERFORMANCE ON
CH1 OUT OF THE 5 RUNS . T YPE CAN BE CLUSTERING (C), 1- STAGE (1-S),
OR 2- STAGE (2-S) SYSTEM . (I) STANDS FOR ITERATIVE , MEANING THERE
IS AN ITERATIVE PROCESS AT INFERENCE TIME .
#Param. Data
(Million) (kHour)

No
FT

With
FT

9

N/A

9.92

6.4
4
4.2
??
12.8
16.1
≈6.7
5.7
10.9
8.5
8.5
11.6
11.6

2.4
2.5
4.7
2
2.4
16
15.5
2.5
2.5
2.5
2.5
2.4
24.7

–
9.65
–
–
–
–
–
8.81
8.52
13.8
–
–
–

8.07
7.18
6.82
6.7
7.37
7.04
6.91
7.77
7.12
7.58
7.36
6.79
5.69

4.2
8
2.4
5.5
19.4

–
–
–
–
–

7.18
6.46
7.83
7.1
5.73

2.5
2.5

8.77
8.05

7.96
7.5111

System

Type Code

VAD + VBx + OSD

C

✓

17.9

1-S (I)
1-S (I)
1-S
1-S (I)
1-S
1-S (I)
1-S
1-S
1-S
1-S (I)
1-S (I)
1-S (I)
1-S (I)

✓

EEND-EDA [14]
EEND-EDA Confor. [32]
CB-EEND [20]
DIVE [11]
RX-EEND [30]
EDA-TS-VAD [75]
EEND-OLA [72]
EEND-NA [27]
EEND-NA-deep [27]
EEND-IAAE [28] (it=2)
EEND-IAAE [28] (it=5)
AED-EEND [12]
AED-EEND-EE [29]

✓
✓

EEND-VC [76]
2-S
WavLM + EEND-VC [77] 2-S
EEND-NAA [26]
2-S (I)
Graph-PIT-EEND-VC [78] 2-S
EEND-OLA + SOAP [72] 2-S

✓

≈8
≈840
8
≈5.5
15.6

EEND-EDA
DiaPer

✓
✓

6.4
4.6

1-S (I)
1-S

✓

is used. Data refers to the number of hours of data for
supervision. For end-to-end models, it can be real or synthetic
data and for the clustering-based baseline, it consists of all
data used to train the x-vector extractor, VAD and OSD.
Methods are divided into groups depending on if they are
single or two-stage. Even though DiaPer does not present the
best performance among all approaches, it reaches competitive
results with fewer parameters and even without FT.

11 It is worth mentioning that out of the 5 runs, the best DER on Part 2 was
7.38 but that did not correspond to the lowest DER on Part 1. Analogously,
for EEND-EDA it was 7.78.

System

All

2-spk

3-spk

4-spk

5-spk

6-spk

EEND-EDA
+ FT CH1

16.70
15.29

8.99
7.54

13.84
14.01

24.57
20.84

33.10
33.34

46.25
41.36

DiaPer
+ FT CH1

14.86
13.60

9.10
7.39

12.70
12.08

19.18
19.62

29.52
30.25

41.81
28.84

D. Multiple-speakers telephone conversations
Figure 7 presents the comparison for recordings with multiple amounts of speakers where EEND-EDA and DiaPer
are trained on the same data. Once again, DiaPer presents
significant advantages over EEND-EDA both before and after
fine-tuning to the development set. Table IX shows the DER
for different numbers of speakers per conversation where gains
are observed in almost all cases. The largest differences are
for recordings with more speakers, suggesting the superiority
of DiaPer in handling such situations.
Finally, Table X shows the comparison of DER components.
It can be observed that without fine-tuning DiaPer does not
improve the confusion error of EEND-EDA but rather missed
and false alarm (FA) speech. A closer look at the inherent VAD
and OSD performances of the two models allows us to see
that DiaPer improves considerably the OSD recall with similar
OSD precision. Therefore, most of the improvement is related
to more accurate overlapped speech detection. Nevertheless,
it should be pointed out that precision and recall slightly
above 50% are still very low. There is clearly large room for
improving the performance in this aspect.
EEND-EDA has been shown to have problems handling
several speakers (i.e. not being able to find more than the
quantity seen in training and significantly miscalculating the
number of speakers when more than 3 are present in a
conversation) [6], [79]. To compare DiaPer’s performance
in this sense we trained 5 of both such models with the
same procedure and evaluated them on a set of 100 SC with
10 recordings for each number of speakers from 1 to 10.
Confusion matrices between the number of real (reference)
speakers and the number found by the system were calculated
for each model. The averages of such confusion matrices
for the 5 DiaPer and 5 EEND-EDA models are presented in
Figure 8. Although both EEND-EDA and DiaPer are trained
on the same data with only up to 7 speakers per SC (matrices
above), EEND-EDA is able to find more speakers. Yet, DiaPer
is considerably more accurate for SC with up to 6 speakers.
When both EEND-EDA and DiaPer are trained with up to 10

10

TABLE X
C OMPARISON ON CH2. F OR EACH METHOD , SELECTING THE BEST MODEL
ON CH1 OUT OF THE 5 RUNS . DER AND ITS THREE COMPONENTS AND
PRECISION AND RECALL FOR VAD AND OSD PERFORMANCE .
System

DER
(%)

Miss
(%)

FA
(%)

Conf.
VAD
OSD
(%) P (%) R (%) P (%) R (%)

EEND-EDA 16.70 7.08 4.88
+ FT CH1 15.29 8.24 2.61

4.73
4.44

93.3
95.8

97.6
94.5

50.0
63.8

41.9
38.3

DiaPer
14.86 6.16 3.90
+ FT CH1 13.60 7.80 2.06

4.80
3.74

93.1
95.4

98.1
95.3

51.5
64.1

52.1
44.8

(a) EEND-EDA

(b) DiaPer

(a) 1 minute, 50 epochs

(b) 1 minute, 100 epochs

(c) 4 minutes, 50 epochs

(d) 4 minutes, 100 epochs

Fig. 9. Confusion matrices for DiaPer adapted to telephony SC with 1 to 10
speakers per recording using different sequence lengths to create the batches:
1 minute (top) and 4 minutes (bottom).

(c) EEND-EDA

(d) DiaPer

Fig. 8. Confusion matrix average of five models evaluated on SC when
adapted for 50 epochs with 2-7 speakers (above) and 1-10 speakers (below).

speakers per SC (matrices below), we can see that DiaPer is
still considerably more accurate. However, its performance is
limited when the number of speakers is 8 or more.
One element to consider is that all the models above
were trained and adapted using batches of 1-minute-long
sequences. It is less likely for 10 speakers in a simulated
conversation to be heard in only one minute. For this reason,
we also performed adaptation of one model using 4-minutelong sequences. While sequences of 1 minute have on average
3.6 speakers, sequences of 4 minutes have 5.2, allowing the
model to see higher quantities of speakers per training sample
during training. A comparison is presented in Figure 9 after
50 and 100 epochs training with 1 and 4 minutes sequences.
A slight advantage is observed when using 4 minutes after 50
epochs but such advantage increases after 100 epochs.
Finally, Table XI presents comparisons with other publications on Callhome Part 2 using all recordings. Again,
all speech is evaluated and no oracle information is used.
For these comparisons, we utilize one of the models trained
seeing SC up to 7 speakers (since Callhome does not contain
recordings with more speakers). Results show that even if DiaPer has a competitive performance, many methods can reach
considerably better results. The main advantage of DiaPer is its
lightweight nature, having the least number of parameters in

comparison with all other methods. Exploring larger versions
of DiaPer (i.e. increasing the model dimension) which could
lead to better performance in multi-speaker scenarios is left
for future research.
Many previous works present comparisons with clusteringbased methods. Although such methods do not deal with
overlap intrinsically, it is possible to run an overlapped speech
detector and assign second speakers heuristically in order to
present a more fair comparison. Interestingly, when utilizing a
few years old VAD, VBx and OSD systems, and therefore not
highly overtuned, the results are still on par with many end-toend models showing the relevance of these types of systems
even at current time.
E. Wide-band scenarios
Most works on end-to-end models focus on the telephone
scenario and use Callhome (which is a paid dataset) as
benchmark. We believe that this is partly because synthetic
data (needed for training such models) match this condition
quite well. However, there are many wide-band scenarios of
interest when performing diarization and only few works have
analyzed their systems on a wide variety of them [10], [74].
Following this direction, and pursuing a more democratic field,
in this section we use DiaPer on a wide variety of corpora
(most of which are of public and free access) and show the
performance for the same model (before and after FT) across
domains.
Since most of the scenarios present many speakers per
conversation, all DiaPer models were adapted to the set of 110 speakers per recording using sequences of 4 minutes. The
12 It is worth mentioning that out of the 5 runs, the best DER on Part 2 was
13.16 but that did not correspond to the lowest DER on Part 1.
13 It is worth mentioning that out of the 5 runs, the best DER on Part 2 was
23.81 but that did not correspond to the lowest DER on Part 1.

11

TABLE XI
DER COMPARISON ON CH2 WITH OTHER METHODS . F OR OUR RESULTS ,
WE SELECTED THE MODEL WITH THE BEST PERFORMANCE ON CH1 OUT
OF THE 5 RUNS . T YPE CAN BE CLUSTERING (C), 1- STAGE (1-S) OR
2- STAGE (2-S) SYSTEM . (I) STANDS FOR ITERATIVE , MEANING THERE IS
AN ITERATIVE PROCESS AT INFERENCE TIME .
#Param. Data
(Million) (kHour)

No
FT

With
FT

System

Type Code

VAD + VBx + OSD

C

✓

17.9

9

N/A 13.63

EEND-EDA [14]
EDA-TS-VAD [75]
EEND-OLA [72]
AED-EEND [12]
AED-EEND-EE [29]

1-S (I)
1-S (I)
1-S
1-S (I)
1-S (I)

✓
✓

6.4
16.1
6.7
11.6
11.6

15.5
16
15.5
15.5
24.7

–
–
–
–
–

15.29
11.18
12.57
14.22
10.08

EEND-VC [76]
EEND-GLA [79]
WavLM + EEND-VC [77]
Graph-PIT-EEND-VC [78]
EEND-OLA + SOAP [72]
EEND-VC MS-VBx [9]

2-S
2-S
2-S
2-S
2-S
2-S

✓

≈8
10.7
≈840
≈5.5
15.6
≈840

4.2
15.5
8
5.5
19.4
5.5

–
–
–
–
–
–

12.49
11.84
10.35
13.5
10.14
10.4

EEND-EDA
DiaPer

1-S (I)
1-S

6.4
4.6

15
15

16.70 15.29
14.86 13.6012

✓
✓
✓

Scoring with collar 0 s
VAD + VBx + OSD

C

✓

17.9

9

N/A 26.18

pyannote 2.1 [10]

2-S

✓

23.6

2.9

32.4 29.3

EEND-EDA
DiaPer

1-S (I)
1-S

✓
✓

6.4
4.6

2.5
2.5

28.73 25.77
27.84 24.1613

8 kHz model was trained on telephony SC and two 16 kHz
models were used. Both wide-band models were trained on
LibriSpeech-based SC where one model had 10 attractors (like
the 8 kHz model) and another had 20 attractors to allow for
more speakers. All models are evaluated without and with
FT. For corpora where a multi-speaker train set is available,
the train set is used for FT until no more improvements are
observed on the development set. If no train set is available,
the dev set is used for FT until the performance on the test
set does not improve further. Therefore, results on these latter
corpora should be taken with a grain of salt.
Looking at the results, in some cases, there was overfitting
when performing FT on the development set (since those sets
did not have a train set). In DipCo, this is most likely due to
the limited amount of data. In VoxConverse, the distribution of
the number of speakers per recording is skewed towards more
speakers in the test set and FT on the dev set makes the model
find fewer speakers than without FT. Even more, recordings
with more speakers are longer, making the overall error higher
after FT on the test set. As for AliMeeting near mix, DiaPer
(20att) has slightly worse performance on the test set but the
decision to stop the FT was made observing the performance
on the dev set, for which there were improvements.
In comparison with the best results published at the time
of writing, DiaPer performs considerably worse in most of
the scenarios. However, it should be noted that in many cases
the best results correspond to systems submitted to challenges
which usually consist of the fusion of a few carefully tuned
models. DiaPer, like any end-to-end system, is very sensitive to
the type of training data. This is highly noticeable in the high
errors before fine-tuning for all far-field scenarios: AISHELL-

4, AliMeeting far mix, AMI mix array, CHiME6 and DipCo;
and relatively lower errors for exclusively close-talk scenarios:
AliMeeting near mix, AMI mix headset, Mixer6 and in the
comparison between DIHARD 2 and DIHARD 3 full where
the latter contains a large portion of telephone conversations.
All SC (used to train the models) are generated with speech
captured from short distances (telephone for the 8 kHz system
and LibriSpeech for the 16 kHz ones). Using reverberation
could improve the situation, but it has not been explored so far
in this context. Not having enough amount of data matching
the testing scenario is a strong drawback for the fine-tuning of
end-to-end models as observed with DipCo and VoxConverse.
Conversely, Mixer6 and RAMC with large amounts of FT data
and relatively simple setups are among the scenarios with the
largest relative improvement given by the FT. Even if in most
cases the performance is not on par with other approaches,
DiaPer’s final performance is very competitive for MSDWild
and RAMC.
The main goal of this comparison was to present a unified
framework evaluated across different corpora. More tailored
models could be trained if we used SC with specific numbers
of speakers per recording (matching the evaluation data).
Likewise, the output post-processing (subsampling and median
filter) could be adapted for each dataset. This should definitely
result in better performance and is left for future work.
We can also see that even a standard cascaded system
can reach competitive results on a few datasets. This shows
the importance and relevance of these systems as baselines
nowadays even when end-to-end solutions are the most studied
in the community.
Regarding the comparison between 8 kHz and 16 kHz DiaPers, in most cases, the latter reaches better performance
both without and with FT. Even though the 8 kHz model was
trained with more conversational data, this does not provide
advantages over the 16 kHz model trained on LibriSpeechbased SC. However, the effect of FT is in most cases considerably large, reducing the differences between 8 kHz and
16 kHz models. Creating synthetic training data that resembles
real ones remains an open challenge for most scenarios.
With respect to the number of attractors in the model, we
can observe that overall having more of them is beneficial.
This is actually not a drawback for DiaPer since the quantity
of attractors does not impact severely on the number of
parameters or computations. It is left for future work to explore
the effect of larger numbers of attractors (i.e. using 40 or 80).
VI. C ONCLUSIONS
In this work, we have presented DiaPer, a new variant of
EEND models that makes use of Perceivers for modeling
speaker attractors. A detailed analysis of the architectural
decisions was presented, including ablations. In a thorough
comparison on telephone conversations, we showed performance gains wrt EEND-EDA, the most widespread end-to-end
model that handles multiple speakers.
We also presented results on several wide-band datasets
comparing the performance with a standard cascaded system
and with the best-published results at the time of writing.

12

DipCo mix

Mixer6 mix

MSDWild

RAMC

VoxConverse

22.23
41.31
36.41

84.01
78.51
68.54

27.86
49.92
34.10

20.49
38.24
24.08

56.19
64.33
45.09

38.09
19.11
14.69

18.81
34.62
17.99

18.33
32.61
20.90

6.69
32.37
31.63

VAD+VBx+OSD 16 15.84
DiaPer
16 48.21
DiaPer+FT
16 41.43
DiaPer (20att)
16 47.86
DiaPer (20att)+FT 16 31.30

28.84
38.67
32.60
34.35
26.27

22.61 34.61
28.19 57.07
27.82 49.75
23.90 52.29
24.44 50.97

22.42
36.36
32.94
35.08
30.49

70.42
78.25
70.77
77.51
69.94

26.67
43.75
32.97
44.51
31.23

20.28
34.21
24.12
34.82
22.77

49.22
48.26
Overfit
43.37
Overfit

35.60
21.03
13.41
18.51
10.99

16.86
35.69
15.46
25.07
14.59

18.19
38.05
21.11
32.08
18.69

6.12
23.20
Overfit
22.10
Overfit

Best
published
results

16.76[80] 23.8[10] —
14.0[10] 23.3[74] —
13.2[74] 23.5[90] —

DIHARD 3
full

23.49 34.14
33.70 54.69
28.93 50.49

DIHARD 2

AliMeeting
near mix

29.60
45.40
31.60

CHiME6
mix

AliMeeting
far mix

8 14.46
8 49.29
8 42.66

AMI mix
headset

AISHELL-4

VAD+VBx+OSD
DiaPer
DiaPer+FT

AMI mix
array

System

SR (kHz)

TABLE XII
DER (%) COMPARISON ON A VARIETY OF TEST SETS . OVERLAPS ARE EVALUATED AND ORACLE VAD IS NOT USED . SR STANDS FOR SAMPLING RATE .
U NDERLINED RESULTS DENOTE SINGLE SYSTEMS AND OVERLINED RESULTS CORRESPOND TO FUSIONS OR MORE COMPLEX MODELS .

22.2[10] 18.0[74] 32.46[81] 26.4[9]
17.32[82] 22.36[83] 7.27[83] 21.96[63] 19.90[64] 4.0[84]
22.0[74] 16.95[82] 27.25[83] 26.88[85] 16.94[86] 22.04[87] 6.14[87] 33.6[88] 14.37[25] 4.39[89]
19.53[82] 13.00[29] 25.11[87] 24.64[29] 16.76[82] 16.36[87] 5.65[81] 16.0[74] —
4.35[91]

Even though DiaPer attains competitive performance in some
domains, it is considerably worse in others.
Several aspects are left to study in the future such as changes
in the frame encoder, where it seems that the self-attention
layers have reached a limit and which present the major
hardware bottleneck when handling very long recordings.
Furthermore, the frame-encoder and Perceiver blocks could be
coupled more tightly to improve the quality of representations
(frame embeddings and attractors) simultaneously.
While DiaPer presents a relatively lightweight end-to-end
solution, one avenue for yet more compact models could be
parameter sharing: some of the blocks in the architecture could
have tied parameters in order to obtain similar results with
fewer parameters.
Finally, even if some works have appeared in this direction,
how to define proper training sets for end-to-end models is
still a very under-explored topic and we believe that further
analyses are necessary to bridge the gap in performance
between narrow-band and wide-band corpora.
With the aim of facilitating reproducible research, we release the code that implements DiaPer as well as models
trained on public and free data.
ACKNOWLEDGMENTS
We thank the members of the diarization sub-group in
BUT for valuable suggestions, especially Anna Silnova for
feedback on the manuscript, and Dominik Klement for advice
to run DVBx. We also thank Marc Delcroix, Zhengyang Chen,
Shota Horiguchi, Zhihao Du, and Chin-Yi Cheng for sharing
details about the number of parameters/hours of training data
in some models (and all other authors for having shared that
information in their work already).
The work was supported by the Czech Ministry of Interior
project No. VJ01010108 ”ROZKAZ”, Czech National Science
Foundation (GACR) project NEUREM3 No. 19-26934X, and
Horizon 2020 Marie Sklodowska-Curie grant ESPERANTO,
No. 101007666. Computing on IT4I supercomputer was supported by the Czech Ministry of Education, Youth and Sports
through the e-INFRA CZ (IDs 90140 and 90254).

R EFERENCES
[1] G. Sell et al., “Diarization is Hard: Some Experiences and Lessons
Learned for the JHU Team in the Inaugural DIHARD Challenge.” in
Interspeech, 2018, pp. 2808–2812.
[2] F. Landini et al., “BUT System for the Second DIHARD Speech
Diarization Challenge,” in ICASSP. IEEE, 2020.
[3] T. J. Park et al., “Auto-tuning spectral clustering for speaker diarization
using normalized maximum eigengap,” IEEE Signal Processing Letters,
vol. 27, pp. 381–385, 2019.
[4] ——, “A review of speaker diarization: Recent advances with deep
learning,” Computer Speech & Language, vol. 72, p. 101317, 2022.
[5] Y. Fujita et al., “End-to-End Neural Speaker Diarization with
Permutation-Free Objectives,” in Proc. Interspeech, 2019.
[6] S. Horiguchi et al., “Encoder-decoder based attractors for end-to-end
neural diarization,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 30, 2022.
[7] I. Medennikov et al., “Target-Speaker Voice Activity Detection:
A Novel Approach for Multi-Speaker Diarization in a Dinner Party
Scenario,” in Proc. Interspeech, 2020, pp. 274–278. [Online]. Available:
http://dx.doi.org/10.21437/Interspeech.2020-1602
[8] K. Kinoshita et al., “Integrating end-to-end neural and clustering-based
diarization: Getting the best of both worlds,” in ICASSP. IEEE, 2021.
[9] M. Delcroix et al., “Multi-Stream Extension of Variational Bayesian
HMM Clustering (MS-VBx) for Combined End-to-End and Vector
Clustering-based Diarization,” in Proc. INTERSPEECH, 2023.
[10] H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle,
benchmark, and recipe,” in Proc. INTERSPEECH, 2023.
[11] N. Zeghidour et al., “DIVE: End-to-end speech diarization via iterative
speaker embedding,” in ASRU. IEEE, 2021.
[12] Z. Chen et al., “Attention-based Encoder-Decoder Network for End-toEnd Neural Speaker Diarization with Target Speaker Attractor,” in Proc.
INTERSPEECH, 2023, pp. 3552–3556.
[13] Y. Fujita et al., “End-to-end neural speaker diarization with selfattention,” in ASRU. IEEE, 2019, pp. 296–303.
[14] S. Horiguchi et al., “End-to-End Speaker Diarization for an Unknown
Number of Speakers with Encoder-Decoder Based Attractors,” Interspeech, 2020.
[15] E. Han et al., “BW-EDA-EEND: Streaming end-to-end neural speaker
diarization for a variable number of speakers,” in ICASSP. IEEE, 2021.
[16] Y. Xue et al., “Online end-to-end neural diarization with speaker-tracing
buffer,” in SLT. IEEE, 2021.
[17] S. Horiguchi et al., “Multi-channel end-to-end neural diarization with
distributed microphones,” in ICASSP. IEEE, 2022.
[18] ——, “Mutual Learning of Single-and Multi-Channel End-to-End Neural Diarization,” in SLT. IEEE, 2023.
[19] A. Gulati et al., “Conformer: Convolution-augmented Transformer for
Speech Recognition,” in Proc. Interspeech, 2020.
[20] Y. C. Liu et al., “End-to-End Neural Diarization: From Transformer to
Conformer,” in Proc. Interspeech, 2021.
[21] T.-Y. Leung et al., “Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty,” in Proc. Interspeech, 2021.

13

[22] A. Jaegle et al., “Perceiver: General perception with iterative attention,”
in International conference on machine learning. PMLR, 2021.
[23] A. Vaswani et al., “Attention is all you need,” Advances in neural
information processing systems, vol. 30, 2017.
[24] Z. Pan et al., “Towards End-to-end Speaker Diarization in the Wild,”
arXiv preprint arXiv:2211.01299, 2022.
[25] S. J. Broughton et al., “Improving End-to-End Neural Diarization Using
Conversational Summary Representations,” in Interspeech, 2023.
[26] M. Rybicka et al., “End-to-end neural speaker diarization with an
iterative refinement of non-autoregressive attention-based attractors,” in
Proc. Interspeech, vol. 2022, 2022, pp. 5090–5094.
[27] Y. Fujita et al., “Neural Diarization with Non-Autoregressive Intermediate Attractors,” in ICASSP. IEEE, 2023.
[28] F. Hao et al., “End-to-end neural speaker diarization with an iterative
adaptive attractor estimation,” Neural Networks, vol. 166, pp. 566–
578, 2023. [Online]. Available: https://www.sciencedirect.com/science/
article/pii/S089360802300401X
[29] Z. Chen et al., “Attention-based Encoder-Decoder End-to-End
Neural Diarization with Embedding Enhancer,” arXiv preprint
arXiv:2309.06672, 2023.
[30] Y. Yu et al., “Auxiliary loss of transformer with residual connection for
end-to-end speaker diarization,” in ICASSP. IEEE, 2022.
[31] Y.-R. Jeoung et al., “Improving Transformer-Based End-to-End Speaker
Diarization by Assigning Auxiliary Losses to Attention Heads,” in
ICASSP. IEEE, 2023.
[32] N. Yamashita et al., “Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization,” in Proc. The Speaker and
Language Recognition Workshop (Odyssey), 2022.
[33] F. Landini et al., “From Simulated Mixtures to Simulated Conversations
as Training Data for End-to-End Neural Diarization,” in Interspeech,
2022.
[34] ——, “Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization,” in ICASSP. IEEE, 2023.
[35] D. Graff et al., “Switchboard-2 phase I, LDC98S75,” 1998.
[36] ——, “Switchboard-2 phase II, LDC99S79,” Web Download. Philadelphia: LDC, 1999.
[37] ——, “Switchboard-2 phase III, LDC2002S06,” Web Download.
Philadelphia: LDC, 2002.
[38] ——, “Switchboard Cellular Part 1 audio LDC2001S13,” Web Download. Philadelphia: LDC, 2001.
[39] ——, “Switchboard Cellular Part 2 audio LDC2004S07,” Web Download. Philadelphia: LDC, 2004.
[40] N. M. I. Group, “2004 NIST SRE LDC2006S44,” 2006.
[41] ——, “2005 NIST SRE Training Data LDC2011S01,” 2006.
[42] ——, “2005 NIST SRE Test Data LDC2011S04,” 2011.
[43] ——, “2006 NIST SRE Evaluation Test Set Part 1 LDC2011S10,” 2011.
[44] ——, “2006 NIST SRE training Set LDC2011S09,” 2011.
[45] ——, “2006 NIST SRE Evaluation Test Set Part 2 LDC2012S01,” 2012.
[46] ——, “2008 NIST SRE Training Set Part 1 LDC2011S05,” 2011.
[47] ——, “2008 NIST SRE Test Set LDC2011S08,” 2011.
[48] D. Snyder et al., “Musan: A music, speech, and noise corpus,” arXiv
preprint arXiv:1510.08484, 2015.
[49] V. Panayotov et al., “Librispeech: an asr corpus based on public domain
audio books,” in ICASSP. IEEE, 2015.
[50] M. Przybocki et al., “NIST SRE LDC2001S97,” Philadelphia, New
Jersey: Linguistic Data Consortium, 2001.
[51] “NIST SRE 2000 Evaluation Plan,” https://www.nist.gov/sites/default/
files/documents/2017/09/26/spk-2000-plan-v1.0.htm .pdf.
[52] N. Ryant et al., “The Third DIHARD Diarization Challenge,” in Proc.
Interspeech, 2021, pp. 3570–3574.
[53] Y. Fu et al., “AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference
Scenario,” in Proc. Interspeech, 2021.
[54] F. Yu et al., “M2MeT: The ICASSP 2022 multi-channel multi-party
meeting transcription challenge,” in ICASSP. IEEE, 2022.
[55] J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in
International workshop on machine learning for multimodal interaction.
Springer, 2006, pp. 28–39.
[56] W. Kraaij et al., “The AMI meeting corpus,” in Proc. International
Conference on Methods and Techniques in Behavioral Research, 2005.
[57] F. Landini et al., “Bayesian HMM Clustering of x-vector Sequences
(VBx) in Speaker Diarization: Theory, Implementation and Analysis on
Standard Tasks,” Computer Speech & Language, vol. 71, 2022.
[58] S. Watanabe et al., “CHiME-6 Challenge: Tackling Multispeaker Speech
Recognition for Unsegmented Recordings,” in Proc. 6th International
Workshop on Speech Processing in Everyday Environments, 2020.

[59] S. Cornell et al., “The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios,” arXiv
preprint arXiv:2306.13734, 2023.
[60] N. Ryant et al., “Second DIHARD challenge evaluation plan,” Linguistic
Data Consortium, Tech. Rep, 2019.
[61] M. Van Segbroeck et al., “DiPCo–Dinner Party Corpus,” arXiv preprint
arXiv:1909.13447, 2019.
[62] L. Brandschain et al., “The Mixer 6 corpus: Resources for cross-channel
and text independent speaker recognition,” in Proc. of LREC, 2010.
[63] T. Liu et al., “MSDWild: Multi-modal Speaker Diarization Dataset in
the Wild,” in Proc. Interspeech, 2022.
[64] Z. Yang et al., “Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset,” in Interspeech,
2022.
[65] J. S. Chung et al., “Spot the Conversation: Speaker Diarisation in the
Wild,” in Proc. Interspeech, 2020, pp. 299–303.
[66] H. Bredin et al., “pyannote.audio: neural building blocks for speaker
diarization,” in IEEE ICASSP, 2020.
[67] S. Otterson et al., “Efficient use of overlap information in speaker
diarization,” in ASRU. IEEE, 2007, pp. 683–686.
[68] D. Klement et al., “Discriminative Training of VBx Diarization,” arXiv
preprint arXiv:2310.02732, 2023.
[69] D. P. Kingma et al., “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[70] “NIST Rich Transcription Evaluations,” https://www.nist.gov/itl/iad/mig/
rich-transcription-evaluation, version: md-eval-v22.pl.
[71] S. Maiti et al., “End-to-end diarization for variable number of speakers
with local-global networks and discriminative speaker embeddings,” in
ICASSP. IEEE, 2021, pp. 7183–7187.
[72] J. Wang et al., “TOLD: a Novel Two-Stage Overlap-Aware Framework
for Speaker Diarization,” in ICASSP. IEEE, 2023.
[73] Z. Du et al., “Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information,” arXiv preprint
arXiv:2111.13694, 2021.
[74] A. Plaquet et al., “Powerset multi-class cross entropy loss for neural
speaker diarization,” in Proc. INTERSPEECH, 2023.
[75] D. Wang et al., “Target speaker voice activity detection with transformers
and its integration with end-to-end neural diarization,” in ICASSP.
IEEE, 2023.
[76] K. Kinoshita et al., “Advances in Integration of End-to-End Neural and
Clustering-Based Diarization for Real Conversational Speech,” in Proc.
Interspeech, 2021, pp. 3565–3569.
[77] S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full
stack speech processing,” IEEE Journal of Selected Topics in Signal
Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[78] K. Kinoshita et al., “Utterance-by-utterance overlap-aware neural diarization with Graph-PIT,” in Proc. Interspeech, 2022.
[79] S. Horiguchi et al., “Towards neural diarization for unlimited numbers
of speakers using global and local attractors,” in ASRU. IEEE, 2021.
[80] Y. Chen et al., “Interrelate Training and Searching: A Unified Online
Clustering Framework for Speaker Diarization,” in Interspeech, 2022.
[81] N. Kamo et al., “NTT Multi-Speaker ASR System for the DASR Task
of CHiME-7 Challenge,” CHiME-7 Challenge, 2023.
[82] M.-K. He et al., “ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-speaker Embedding,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 2023.
[83] L. Ye et al., “The IACAS-Thinkit System for CHiME-7 Challenge,”
CHiME-7 Challenge, 2023.
[84] S. Baroudi et al., “pyannote. audio speaker diarization pipeline at
VoxSRC 2023,” The VoxCeleb Speaker Recognition Challenge, 2023.
[85] S. Horiguchi et al., “End-to-end speaker diarization as post-processing,”
in ICASSP. IEEE, 2021, pp. 7188–7192.
[86] ——, “The Hitachi-JHU DIHARD III system: Competitive end-to-end
neural diarization and x-vector clustering systems combined by DOVERLap,” arXiv preprint arXiv:2102.01363, 2021.
[87] R. Wang et al., “The USTC-NERCSLIP Systems for the CHiME-7
DASR Challenge,” arXiv preprint arXiv:2308.14638, 2023.
[88] T. Liu et al., “BER: Balanced Error Rate For Speaker Diarization,” arXiv
preprint arXiv:2211.04304, 2022.
[89] D. Karamyan et al., “The Krisp Diarization system for the VoxCeleb
Speaker Recognition Challenge 2023,” The VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23), 2023.
[90] D. Raj et al., “GPU-accelerated Guided Source Separation for Meeting
Transcription,” in Proc. INTERSPEECH, 2023.
[91] D. Wang et al., “Profile-Error-Tolerant Target-Speaker Voice Activity
Detection,” arXiv preprint arXiv:2309.12521, 2023.