TLCE: Transfer-Learning Based Classifier Ensembles for Few-Shot
Class-Incremental Learning
Shuangmei Wanga , Yang Caoa , Tieru Wua,b,∗
a Jilin University, No. 2699 Qianjin Street, Changchun, 130012, China

arXiv:2312.04225v1 [cs.CV] 7 Dec 2023

b Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China

Abstract
Few-shot class-incremental learning (FSCIL) struggles to incrementally recognize novel classes from few examples
without catastrophic forgetting of old classes or overfitting to new classes. We propose TLCE, which ensembles
multiple pre-trained models to improve separation of novel and old classes. TLCE minimizes interference between
old and new classes by mapping old class images to quasi-orthogonal prototypes using episodic training. It then
ensembles diverse pre-trained models to better adapt to novel classes despite data imbalance. Extensive experiments
on various datasets demonstrate that our transfer learning ensemble approach outperforms state-of-the-art FSCIL
methods.
Keywords:
Few-Shot Learning, Class-Incremental Learning, Ensemble Learning

1. Introduction
Deep learning has sparked substantial advancements
in various computer vision tasks. These advancements
are mainly due to the emergence of large-scale datasets
and powerful GPU computing devices. However, deep
learning-based methods exhibit limitations in recognizing classes that have not been incorporated into their
training. In this scenario, there has been significant research conducted on Class-Incremental Learning (CIL),
which focuses on dynamically updating the model using only new samples from each additional task, while
preserving knowledge about previously learned classes.
On the other hand, the process of obtaining and annotating a sufficient quantity of data samples presents challenges in both complexity and expense. Certain studies
are dedicated to investigating CIL in situations where
data availability is limited. Specifically, researchers
have explored the concept of few-shot class-incremental
learning (FSCIL), which aims to continuously learn new
classes using only a limited number of target samples.
As a consequence, two issues arise: the potential for
catastrophic forgetting of previously learned classes and
the risk of overfitting to new concepts. Furthermore,

∗ Corresponding authors.

Preprint submitted to Springer

Constrained Few-Shot Class-Incremental Leaning (CFSCIL) [1] introduce that this particular learning approach abides by explicit constraints related to memory and computational capacity. These constraints include the necessity to maintain a consistent computational cost when acquiring knowledge about a new class
and ensuring that the model’s memory usage increases
at most linearly as additional classes are introduced.
To solve the above issues, recent studies [2, 3, 4]
focus on addressing these challenges by emphasizing
the acquisition of transferable features through initially
utilizing the cross-entropy (CE) loss during training in
the base session, while also subsequently freezing the
backbone to facilitate adaptation to new classes. CFSCIL [1] employs meta-learning to map input images to quasi-orthogonal prototypes in a way that minimizes interference between the prototypes of different
classes. Although C-FSCIL has demonstrated superior performance, we find a prediction bias arising from
class imbalance and data imbalance. We also observe
that the process of assigning hyperdimensional quasiorthogonal vectors to each class demands a substantial
number of samples and iterations. This undoubtedly
presents a challenge when it comes to allocating prototypes to novel classes that possess only a limited amount
of samples.
In this paper, we propose TLCE, a transfer-learning
December 8, 2023

based few-shot class-incremental learning method
that ensembles various classifiers memorized different
knowledge. One main inspiration is pretraining a deep
network on the base dataset and transferring knowledge
to the novel classes [5, 6] has been shown as the strong
baseline for the few-shot classification. On the other
hand, little interference between the new classes and
the old classes is key. Hence, we leverage the advantages offered by the aforementioned classifiers through
ensemble learning. Firstly, we employ meta-learning
to train a robust hyperdimensional network (RHD) according to C-FSCIL. This allows us to effectively map
input images to quasi-orthogonal prototypes for base
classes. Secondly, we integrate cosine similarity and
cross-entropy loss to train a transferable knowledge network (TKN). Finally, we compute the prototype, i.e.,
the average of features, for each class. The classification of a test sample is simply determined by finding its
nearest prototype measured by the weighted integration
combines the different relationships.
Comparing to C-FSCIL, our TLCE adopts the similar idea of assigning quasi-orthogonal prototypes for
base classes to reduce minimal interference. The
key difference is the attempt to perform well on all
classes equally, regardless of the training sequence employed through classifier ensembles. We conduct extensive comparisons with state-the-art few-shot classincremental classificaiton methods on miniImageNet
[7] and CIFAR100 [8] and the results demonstrate the
superiority of our TLCE. Ablation studies on different
ensembles, i.e., different weights between the robust
hyperdimensional network and transferable knowledge
network also show the necessity of ensembling two classifiers for better results.
In summary, our contributions are as follows:

FSL seeks to develop neural models for new categories using only a small number of labeled samples.
Meta-learning [9] is extensively utilized to accomplish
few-shot classification. The core idea is to use the
episodic training paradigm to learn generalizable classifiers or feature extractors for the data of the base classes
in an optimization-based framework [10, 11, 12], as
well as learn a distance function to measure the similarity among feature embeddings through metric-learning
[13, 14, 15, 16]. On the other hand, pretraining classifiers or image encoders on the base dataset and then
adapting them the novel classes via transfer learning
[5, 6] has been shown as the strong baseline for the
few-shot classification. Based on the meta-learned feature extractor or the pretrained deep image model, we
can perform nearest-neighbor (NN) based classification
which has been proven as a simple and effective approach for FSL. Specially, the prediction is determined
by measuring the similarity or distance between the test
feature and the prototypes of the novel labeled features.
Due to the limited number of samples, the prototypes
computed from the few-shot novel class data may cannot represent the underlying data distribution. Several
methods [17, 18, 19, 20] have been proposed to perform
data calibration to obtain better samples or prototypes
of the novel class recently. Inspired by those representative few-shot methods, we attempt to leverage different
training paradigms to acquire diverse models to calculate target prototypes for the few-shot class-incremental
learning tasks.
Class Incremental Learning. CIL aims to build a
universal classifier among all seen classes from a stream
of labeled training sets. Current CIL algorithms can be
roughly divided into three categories. The first category utilizes former data for rehearsal, which enables
the model to review former instances and overcome forgetting [21, 22, 23]. The second category estimates the
importance of each parameter and keeps the important
ones static [24, 25, 26]. Other methods designs algorithms to maintain the model’s knowledge and discriminability. For example, knowledge distillation-based
methods build the mapping between old and new models [27, 28, 29]. On the other hand, several methods
aim to find bias and rectify them like the oracle model
[30, 31, 19]. FSCIL can be seen a particular case of the
CIL. Therefore, we can learn from some of the above
methods.
Few-Shot Class-Incremental Learning. FSCIL introduces few-shot scenarios where only a few labeled
samples are available into the task of class-incremental
learning. To achieve FSCIL, many works attempt to
solve the problem of catastrophic forgetting and seri-

1. We propose TLCE, transfer-learning based classifier ensembles to improve the novel class set separation and maintain the base class set separation.
2. Without additional training and expensive computation, the proposed method can efficiently explore the comprehensive relation between prototypes and test features and improve the novel class
set separation and maintain the base class set separation.
3. We conduct extensive experiments on various
datasets and the results show our efficient method
can outperform SOTA few-shot class-incremental
classification methods.
2. Related Work
Few-Shot Learning.
2

ously overfitting from different perspective. TOPIC [32]
employs a neural gas network to preserve the topology
of the feature manifold from a cognitive-inspired perspective. SKD [33] and ERDIL [34] use knowledge distillation to to balance the preserving of old-knowledge
and adaptation of new-knowledge. Feature-space based
methods focus on obtaining compact clustered features
and maintaining generalization for future incremental
classes [35, 36, 37]. From the perspective of parameter space, WaRP [38] combines the advantages of F2M
[4] to find flat minimums of the loss function and FSLL
[39] for parameter fine-tuning. They push most of the
previous knowledge compactly into only a few important parameters so that they can fine-tune more parameters during incremental sessions. From the perspective
of hybrid approaches, some works combine episodic
training [3, 40, 1], ensemble learning [41, 42], and so
on. C-FSCIL [1] maps input images to quasi-orthogonal
prototypes such that the prototypes of different classes
encounter small interference through episodic training.
However, achieving quasi-orthogonality among all prototypes for the classes poses difficulties when dealing
with novel classes that have only a limited number of
labeled samples. MCNet [41] trains multiple embedding networks using diverse network architectures to
to enhance the diversity of models and enable them to
memorize different knowledge effectively. Similar to
the above method, our method is based on ensemble
learning, while we train two shared architecture networks using different loss function and training methods. [42] enhances the expression ability of extracted
features through multistage pre-training and uses metalearning process to extract meta-feature as complementary features. Please note that a novel generalization
model is one with no overlapping among novel class
sets and no interference with base classes. In contrast
to these methods, we ensemble a robust hyperdimensional (HD) network for base classes and a trasnferable
knowledge network for novel classes from a whole new
perspective.

spiration from [43, 1] and employ episodic training to
map the base datasets to quasi-orthogonal prototypes,
thereby minimizing interference of base classes during
incremental sessions. Secondly, we pretrain a model
from scratch in a standard supervised way to gain transferable knowledge space. Finally, we have integrated
explicit memory (EM) into the previously mentioned
embedding networks. This has been done in a manner that allows the EM to store the embeddings of labeled data samples as class prototypes within its memory. During the testing process, we utilize the nearest prototype classification method based on similarity
thereby meeting the classification requirements for all
seen classes. Note that we only need to compute the new
class prototypes using the aforementioned models and
update the EM because training only takes place within
the base session. Figure 1 demonstrates the framework
of our method. In the following, we provide technical details of the proposed method for few-shot classincremental classification.
3.1. Problem Statement
FSCIL learn continuously from a sequential stream
of tasks. Suppose we have a constant stream of labeled
training data denoting D1 , D2 , . . . , DT , where Dt =
t
|
{(xi , yi )}|D
i=1 . In the t-th task, the set of labels is denoted as C t , where where ∀i , j, C i ∩ C j = ∅. The
total number of classes in this task is represented by
|C t |. D1 with ample data is called the base session,
while Dt (where t > 1) pertains to the limited training
set involving new classes (called incremental session).
We follow the conventional few-shot class-incremental
learning setting, i.e., build a series of N-way K-shot
N×K
datasets Dt = {(xi , yi )}i=1
, where N is the number of
novel classes and K is the number of port samples in
each novel session. For each session t, the model only
have access to the dataset Dt . After training with Dt ,
the model needs to recognize all encountered classes in
∪ s≤t C s .
3.1.1. Robust Hyperdimensinal Network (RHD)
Due to the ”curse” of dimensionality, a randomly
selected vector has a high probability of being quasiorthogonal to other random vectors. As a result, when
representing a novel class, the process not only contributes incrementally to previous learning but also
causes minimal interference. Hence, we follow CFSCIL [1] to build a RHD network during the base session.
Our method is comprised of three primary components: a backbone network , an extra projection, and a

3. Method
In this section, we propose the FSCIL method using model ensemble. An ideal FSCIL learning model
should ensure that the newly added categories do not
interfere with the old ones and maintain a distinct
separation between them. The motivations mentioned
above prompt us to solve the aforementioned problems by combining a robust hyperdimensional memoryaugmented neural network and a transferable knowledge model through ensemble. Firstly, we draw in3

FC layer

Robust
Hyperdimensional
Network

~90∘

EM

Random Episode
Selection

Transferable
Knowledge
Network

FC layer
cosine

New classes

Update EM
Incremental
Session

RHD
Test samples

TKN

Final prediction
Extract feature
frozen

Figure 1: An illustration of the proposed method pipeline. F is the backbone network and G is a projection layer. The RHD and TKN have a shared
architecture. We obtained different network parameters by using various training methods and loss functions. In the incremental session, we freeze
the RHD and TKN parameters.

fully connected layer. The backbone network maps the
samples from the input domain X to a feature space. In
order to construct an embedding network that utilizes
a high-dimensional distributed representation, the backbone network is joined with a projection layer. Then we
have
µ1 = Fθ1 (x), µ2 = Gθ2 (µ1 ),
(1)

from each other in the hyperdimensional space. We replace the fully connected layer with the EM and build a
series of |D1 |-way K-shot tasks where |D1 | is the number
of base classes and K is the number of support samples
in each task. In every task, the projection layer produces
a support vector for every training input. To represent
each class, we calculate the average of all support vectors that belong to a specific class, thereby generating a
single prototype vector for that class. Within the EM,
prototypes are saved for each class. Specifically, the
prototype for a given class i is determined in the following manner:

where µ1 ∈ Rd f is the intermediate feature of input x,
d f is the dimension of the feature space, µ2 ∈ Rd is the
output feature of the intermediate feature µ1 , and θ1 , θ2
are the learnable parameters of the backbone network
and the projection layer, respectively.
Firstly, we jointly train both Fθ1 and Gθ2 from scratch
in the standard supervised classification using the base
session data to derive powerful embeddings for the
downstream base learner. The empirical risk to minimize can be formulated as:
min
θ1 ,θ2

Lce ((W T µ2 ), y),

pRi =

1 X
Gθ (Fθ (x)),
|Si | x∈S 2 1

(3)

i

where Si is the set of all samples from class i and |S c |
is the number of samples. Given a query sample q and
prototypes, we compute the cosine similarity for class i
as follows:

(2)

where Lce (·, ·) is cross-entropy loss (CE) and W T is the
learnable parameters of the fully connected layer.
Lastly, we build on top of the meta-learning setup to
allocate nearly quasiorthogonal vectors to various image classes. These vectors are then positioned far away

S iR = cos(tanh(Gθ2 (Fθ1 (q)), tanh(pRi )),

(4)

where tanh(·) is the hyperbolic tangent function and
cos(·, ·) is the cosine similarity. In hyperdimensional
4

written as:

memory-augmented neural networks [43], the hyperbolic tangent has demonstrated its usefulness as a nonlinear function that regulates the activated prototypes’
norms and embedding outputs. Additionally, cosine
similarity tackles the norm and bias problems commonly encountered in FSCIL by emphasizing the angle between activated prototypes and embedding outputs while disregarding their norms [44]. Given the cosine similarity score S iR for every class i, we utilize a
soft absolute sharpening function to enhance this attention vector, resulting in quasi-orthogonal vectors [43].
Softabs attention The softabs attention function is defined as
ϵ(S R )
,
h(S iR ) = P|D1 | i
R
j=1 ϵ(S j )

µ2 = Gθ2 (µ1 ) = Gθ2 (Fθ1 (x)),
∥Wi ∥ = ∥µ2 ∥ = 1.

1
1
+
.
1 + e−(β(c−0.5)) 1 + e−(β(−c−0.5))

(8)

The quantity wi is the calculated cosine similarity between the feature µ2 and the weight parameter Wi for
class i. The loss function is given by:
T

ey j
1X
log( P|C 1 | )
L=−
T j=1
ewi
i=1

T
1X

=−

e∥W j ∥∥µ2 ∥ cos(θ j )
log( P|C 1 |
)
∥Wi ∥∥µ2 ∥ cos(θi )
T j=1
e
i=1

=−

ecos(θ j )
1X
log( P|C 1 |
),
T j=1
ecos(θi )

(5)

(9)

T

where ϵ(·) is the sharpening function:
ϵ(c) =

(7)

wi = WiT µ2 = ∥Wi ∥∥µ2 ∥ cos(θi ) = cos(θi ),

i=1

where T is the number of training images and the quantity y j describes the cosine similarity towards its ground
truth class for image j.

(6)

The sharpening function includes a stiffness parameter
β, which is set to 10 as in [43].

3.2. Incremental Test
By employing the incremental-frozen framework, we
can reduce the storage requirements by only preserving
the prototypes of all the encountered classes and updating the exemplar memory (EM) when introducing new
classes. This way, we can effectively manage the limitations imposed by memory and computational capacities. Firstly, we utilize the robust hyperdimensional network and transferable knowledge network to calculate
the prototypes PR and PT . Once we acquire the prototypes for the novel classes, we can promptly update the
EM. It is important to note that the EM does not update
the prototypes of the old classes, as RHD and TKN remain fixed in the subsequent session. Then, we save all
the prototypes for the classes that have been appeared
so far within the EM. Finally, we can derive the ultimate classification outcome by evaluating the similarity
measure between the test sample and each prototype.
Suppose we have a test sample q. According to Eq.
4, we can compute separate similarity S R and S T for
each classifier RHD and TKN individually. Then, we
can combine classifiers through weighted integration by
considering both scores to obtain the final score S as:

3.1.2. Transferable Knowledge Network (TKN)
It is difficult to ensure quasi-orthogonality among all
prototypes for each class due to the presence of novel
classes that only have a small number of labeled samples. Inspired by transfer learning based few-shot methods, we explore various transferable models. The most
straightforward approach involves utilizing a model that
has been pre-trained from the scratch using standard
supervised classification techniques. We employ this
model as a baseline for our analysis.
In SimpleShot [45], it demonstrates that using nearest
neighbor classification, where features are simply normalized by L2 norm and measured by Euclidean distance, can obtain competitive results in few-shot classification tasks. The squared Euclidean distance after L2
normalization is equivalent to cosine similarity. Utilizing cosine similarity as a distance metric for quantifying
data similarity has two implications: 1) during training,
it focuses on the angles between normalized features
rather than the absolute distances within the latent feature space, and 2) the normalized weight parameters of
the fully connected layer can be interpreted as the centroids or centers of each category [36]. So we combine
cosine similarity with cross-entropy loss to train a more
transferable network. To simplify calculations of cosine
similarity in the final fully connected layer, we set the
bias to zero. Then the data prediction procedure can be

S = λ ∗ S R + (1 − λ)S T ,

(10)

where λ ∈ [0, 1] is the hyperparameter. This approach
allows us to leverage the strengths of multiple classifiers
and intelligently merge their outputs, leading to a more
accurate final result.
5

Table 1: Quantitative comparison on the test set of miniImageNet in the 5-way 5-shot FSCIL setting. ”Average Acc.” is the average performance of
all sessions. ”Final Improv.” calculates the improvement of our method in the last session. † The results of [1] are obtained using its released code.

Method

1
iCaRL [27]
61.31
EEIL [28]
61.31
61.31
NCM [46]
TOPIC [32]
61.31
61.31
SKD [33]
SPPR [2]
61.45
72.05
F2M[4]
CEC[3]
72.00
C-FSCIL†[1]
76.38
Ours(baseline) 76.88
Ours
77.05

2
46.32
46.58
47.80
50.09
58.00
63.80
67.47
66.83
70.77
71.83
71.72

3
42.94
44.00
39.30
45.17
53.00
59.53
63.16
62.97
66.17
67.46
67.51

Session ID
4
5
6
37.63 30.49 24.00
37.29 33.14 27.12
31.90 25.70 21.40
41.16 37.48 35.52
50.00 48.00 45.00
55.53 52.50 49.60
59.70 56.71 53.77
59.43 56.70 53.73
62.67 59.17 56.2
64.39 61.29 58.47
64.40 61.55 58.76

7
20.89
24.10
18.70
32.19
42.00
46.69
51.11
51.19
53.27
55.79
56.27

8
18.80
21.57
17.20
29.46
40.00
43.79
49.21
49.24
51.09
53.81
54.35

9
17.21
19.58
14.17
24.42
39.00
41.92
47.84
47.63
48.93
52.30
53.00

Average Final
Acc.
Improv.
33.29
+35.79
34.97
+33.42
30.83
+38.83
39.64
+28.58
48.48
+14.00
52.76
+11.08
57.89
+5.16
57.75
+5.37
60.52
+4.07
62.47
+0.70
62.73

Table 2: Quantitative comparison on the test set of CIFAR100 in the 5-way 5-shot FSCIL setting. ”Average Acc.” is the average performance of all
sessions. ”Final Improv.” calculates the improvement of our method in the last session. † The results of [1] are obtained using its released code.

Method

1
iCaRL [27]
64.10
EEIL [28]
64.10
NCM [46]
64.10
TOPIC [32]
64.10
SKD [33]
64.10
SPPR [2]
64.10
F2M[4]
71.45
CEC[3]
73.07
77.22
C-FSCIL†[1]
Ours(baseline) 77.33
Ours
77.82

2
53.28
53.11
53.05
55.88
57.00
65.86
68.10
68.88
71.92
72.49
73.00

3
41.69
43.71
43.96
47.07
50.01
61.36
64.43
65.26
67.16
68.30
69.11

Session ID
4
5
6
34.13 27.93 25.06
35.15 28.96 24.98
36.97 31.61 26.73
45.16 40.11 36.38
46.00 44.00 42.00
57.45 53.69 50.75
60.80 57.76 55.26
61.19 58.09 55.57
63.01 59.21 56.13
64.25 60.89 58.18
65.17 62.29 59.48

4. Experiments

7
20.41
21.01
21.23
33.96
39.00
48.58
53.53
53.22
53.44
55.68
56.96

8
15.48
17.26
16.78
31.55
37.00
45.66
51.57
51.34
51.05
53.65
54.97

9
13.73
15.85
13.54
29.37
35.00
43.25
49.35
49.14
48.94
51.63
52.71

Average Final
Acc.
Improv.
32.87
+38.98
33.79
+36.86
34.22
+39.17
42.62
+23.34
46.01
+17.71
54.52
+9.46
59.14
+3.36
59.53
+3.57
60.90
+3.77
62.49
+1.08
63.50

CIFAR100 and miniImageNet. These 40 novel classes
are further divided into eight incremental sessions. In
each session, we learn using a 5-way 5-shot approach,
which means training on 5 classes with 5 images per
class.

In this section, we conduct quantitive comparisons
between our TLCE and state-of-the-art few-shot classincremental learning methods on two representative
datasets. We also perfrom ablation studies on evaluating design choices and different hyperparameters for our
methods.

4.2. Implementation Details
For miniImageNet and CIFAR100, we use ResNet12 following C-FSCIL [1]. We train the TKN with
the SGD optimizer, where the learning rate is 0.01, the
batch size is set as 128 and epoch is 120. As for the
RHD network, it is pretrained by the C-FSCIL [1] work.
For each image in the dataset, we represent it as a 512dimensional feature extracor. The hyperparameter λ is
set to 0.8 for both the miniImageNet and the CIFAR100
dataset.

4.1. Datasets
We evaluate our proposed method on two datasets
for benchmarking few-shot class-incremental learning:
miniImageNet [7] and CIFAR100 [8].
In the miniImageNet [7] dataset, there are 100
classes, with each class having 500 training images and
100 testing images. As for CIFAR100 [8], it is a challenging dataset with 60,000 images of size 32 × 32, divided into 100 classes. Each class has 500 training images and 100 testing images. Following the split used in
[32], we select 60 base classes and 40 novel classes from

4.3. Comparison and Evaluation
In order to evaluate the effectiveness of our TLCE, we
first conduct quantitative comparisons with several rep6

(a)Weighted

(b)Base

(c)Novel

Figure 2: The weighted, base, and novel performances on miniImageNet.
Table 3: The ablation study on value selection of hyperparameter λ. Accuracy (%) on the test set of miniImageNet and CIFAR100 in the last
session are measured.

λ
Datasets
miniImageNet
CIFAR100

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

48.93 49.21 49.57 50.14 50.83 51.53 52.29 52.78 53.00 51.17 44.87
48.94 49.02 49.74 50.30 50.80 51.46 52.42 52.61 52.71 50.80 44.96

From Table 1 and 2, it can be deserved that C-FSCIL
performs more effectively in the first five incremental
sessions, while the effectiveness is slight in the last
four incremental sessions. We make further analysis
from the perspective of the accuracy on base and novel
classes, respectively. According to the data shown in
Figure 2, we can observe a slight decrease in the base
performance. This indicates that C-FSCIL could resist
the knowledge forgetting. However, the novel performance on the following incremental sessions is poor.
In contrast, an ideal FSCIL classifier will have equally
high performance on both novel and base classes. For
our method TLCE, it is evident that while there is a decrease in the base classes, there is a significant improvement in the novel and weighted performance. In the ablation study, we perform more experiments and analysis
of different λ values to reveal which degree of RHD and
TKN is more suitable for the dataset.

resentative and state-of-art few-shot classs-incremental
learning methods. However, it is important to note
that an improvement does not necessarily imply an improvement in both base and novel performances individually. Then, we conduct further analysis of model
performance from both perspectives base and novel to
delve deeper into the performance improvement. Furthermore, our method offers the advantage of requiring
no additional training and consuming minimal storage
space.
Quantitative comparisons. As there are numerous
efforts have been paid to the few-shot class-incremental
learning, we mainly compare our TLCE with representative and SOTA works. The compared methods include CIL methods [27, 28, 46] and FSCIL methods
[32, 33, 2, 4, 3, 1]. For C-FSCIL [1], we only compare
with their basic version and do not take their model requiring additional training during incremental sessions
into consideration .
For our method, we report we provide our best results with the value of λ set to 0.8. Table 1 and 2 show
the quantitative comparison results on two datasets. It
can be seen that our best results outperform the other
methods. In particular, we consider the different transferable knowledge models. For the baseline, we train
the model in the standard supervised classification. For
TLCE, we integrate cosine metric with cross entropy to
train the model . It can be seen that the latter one can
significantly enhance the performance of the ensemble
classifiers.

4.4. Ablation Study
In this section, we perform ablation studies to verify
the design choices of our method and the effectiveness
of different modules. First, we conduct experiments
on different hyperparameter λ to see how the RHD and
TKN can affect the final results. Then, we perform the
study on the effectiveness of different ensemble classifiers.
Effect on different hyperparameter λ. Different λ
values correspond to different degrees of RHD and TKN
applied to the input data. From the results in Table 3, it
7

Table 4: The effect of various components of ensemble classifiers. Accuracy (%) on the test set of miniImageNet are measured.

Cross Entropy
!
!
!
!

Cosine

RHD

!

!

!
!
!

1
71.10
70.55
76.38
76.88
77.05

2
64.14
63.35
70.77
71.83
71.72

3
59.14
58.63
66.17
67.46
67.51

can be found when the TKN does not work (λ = 0.0),
the result is lower. But with the ensemble of TKN, the
result shows a convex curve with different λ. That indicates the importance of the TKN.
Effect on different ensemble classifiers. We conduct experiments on miniImageNet to verify the effectiveness of the ensemble classifiers. Specifically, we
train the TKN in the standard supervised classifier as the
baseline. The results in Table 4 show that the ensemble
classifier can lead to better performance. Furthermore,
we discover that integrating cosine metric with cross entropy can lead to further enhancement of model performance. Hence, we adopt the latter approach for TKN
training in our classification.

4
55.92
55.53
62.67
64.39
64.40

5
53.44
53.09
59.17
61.29
61.55

6
50.86
50.39
56.2
58.47
58.76

7
47.97
47.72
53.27
55.79
56.27

8
46.23
46.04
51.09
53.81
54.35

9
44.85
44.87
48.93
52.30
53.00

[4] G. Shi, J. Chen, W. Zhang, L.-M. Zhan, X.-M. Wu, Overcoming
catastrophic forgetting in incremental few-shot learning by finding flat minima, Adv. Neural Inform. Process. Syst. 34 (2021)
6747–6761.
[5] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, J.-B. Huang,
A closer look at few-shot classification, in: Int. Conf. Learn.
Represent., 2019.
[6] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, P. Isola, Rethinking few-shot image classification: a good embedding is all
you need?, in: Eur. Conf. Comput. Vis., Springer, 2020, pp.
266–282.
[7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, Int. J. Comput.
Vis. 115 (2015) 211–252.
[8] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of
features from tiny images (2009).
[9] T. Hospedales, A. Antoniou, P. Micaelli, A. Storkey, Metalearning in neural networks: A survey, IEEE Trans. Pattern
Anal. Mach. Intell. 44 (2021) 5149–5169.
[10] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for
fast adaptation of deep networks, in: Int. Conf. Mach. Learn.,
PMLR, 2017, pp. 1126–1135.
[11] M. A. Jamal, G.-J. Qi, Task agnostic meta-learning for few-shot
learning, in: IEEE Conf. Comput. Vis. Pattern Recog., 2019.
[12] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu,
S. Osindero, R. Hadsell, Meta-learning with latent embedding
optimization, in: Int. Conf. Learn. Represent., 2019.
[13] G. Koch, R. Zemel, R. Salakhutdinov, et al., Siamese neural networks for one-shot image recognition, in: ICML deep learning
workshop, 2015.
[14] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., Matching networks for one shot learning, Adv. Neural Inform. Process. Syst. 29 (2016).
[15] J. Snell, K. Swersky, R. Zemel, Prototypical networks for fewshot learning, Adv. Neural Inform. Process. Syst. 30 (2017).
[16] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, T. M.
Hospedales, Learning to compare: Relation network for fewshot learning, in: IEEE Conf. Comput. Vis. Pattern Recog.,
2018, pp. 1199–1208.
[17] S. Yang, L. Liu, M. Xu, Free lunch for few-shot learning: Distribution calibration, in: Int. Conf. Learn. Represent., 2021.
[18] Y. Guo, R. Du, X. Li, J. Xie, Z. Ma, Y. Dong, Learning calibrated class centers for few-shot classification by pair-wise similarity, IEEE Trans. Image Process. 31 (2022) 4543–4555.
[19] J. Xu, X. Luo, X. Pan, W. Pei, Y. Li, Z. Xu, Alleviating the sample selection bias in few-shot learning by removing projection
to the centroid, in: Adv. Neural Inform. Process. Syst., 2022.
[20] S. Wang, R. Ma, T. Wu, Y. Cao, P3dc-shot: Prior-driven discrete
data calibration for nearest-neighbor few-shot classification, Image and Vision Computing (2023) 104736.

5. Conclusion
In this paper, we propose a simple yet effective framework, named TLCE, for few-shot class-incremental
learning. Without any retraining and expensive computation during incremental sessions, our transfer-learning
based ensemble classifiers method can efficiently to
further alleviate the issues of catastrophic forgetting
and overfitting. Extensive experiments show that our
method can outperform SOTA methods. Investigating
a more transferable network is worthy to explore in the
future. Also, exploring a more general way to combine
the classifiers is an interesting future work.
References
[1] M. Hersche, G. Karunaratne, G. Cherubini, L. Benini, A. Sebastian, A. Rahimi, Constrained few-shot class-incremental learning, in: IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp.
9057–9067.
[2] K. Zhu, Y. Cao, W. Zhai, J. Cheng, Z.-J. Zha, Self-promoted
prototype refinement for few-shot class-incremental learning,
in: IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 6801–
6810.
[3] C. Zhang, N. Song, G. Lin, Y. Zheng, P. Pan, Y. Xu, Fewshot incremental learning with continually evolved classifiers,
in: IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 12455–
12464.

8

[21] A. Iscen, J. Zhang, S. Lazebnik, C. Schmid, Memory-efficient
incremental learning through feature adaptation, in: Eur. Conf.
Comput. Vis., Springer, 2020, pp. 699–715.
[22] F. Zhu, X.-Y. Zhang, C. Wang, F. Yin, C.-L. Liu, Prototype
augmentation and self-supervision for incremental learning, in:
IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 5871–5880.
[23] G. Petit, A. Popescu, H. Schindler, D. Picard, B. Delezoide,
Fetril: Feature translation for exemplar-free class-incremental
learning, in: IEEE Winter Conf. Appl. Comput. Vis., 2023, pp.
3911–3920.
[24] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, et al., Overcoming catastrophic forgetting in neural
networks, Proc. Natl. Acad. Sci. 114 (2017) 3521–3526.
[25] A. Chaudhry, P. K. Dokania, T. Ajanthan, P. H. Torr, Riemannian walk for incremental learning: Understanding forgetting
and intransigence, in: Eur. Conf. Comput. Vis., 2018, pp. 532–
547.
[26] J. Lee, H. G. Hong, D. Joo, J. Kim, Continual learning with
extended kronecker-factored approximate curvature, in: IEEE
Conf. Comput. Vis. Pattern Recog., 2020, pp. 9001–9010.
[27] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl: Incremental classifier and representation learning, in: IEEE Conf.
Comput. Vis. Pattern Recog., 2017, pp. 2001–2010.
[28] F. M. Castro, M. J. Marin-Jimenez, N. Guil, C. Schmid, K. Alahari, End-to-end incremental learning, in: Eur. Conf. Comput.
Vis., 2018.
[29] Q. Gao, C. Zhao, B. Ghanem, J. Zhang, R-dfcil: Relationguided representation learning for data-free class incremental
learning, in: Eur. Conf. Comput. Vis., Springer, 2022, pp. 423–
439.
[30] L. Yu, B. Twardowski, X. Liu, L. Herranz, K. Wang, Y. Cheng,
S. Jui, J. v. d. Weijer, Semantic drift compensation for classincremental learning, in: IEEE Conf. Comput. Vis. Pattern
Recog., 2020, pp. 6982–6991.
[31] Y. Liu, B. Schiele, Q. Sun, Adaptive aggregation networks for
class-incremental learning, in: Eur. Conf. Comput. Vis., 2021,
pp. 2544–2553.
[32] X. Tao, X. Hong, X. Chang, S. Dong, X. Wei, Y. Gong,
Few-shot class-incremental learning, in: IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 12180–12189. doi:10.1109/
CVPR42600.2020.01220.
[33] A. Cheraghian, S. Rahman, P. Fang, S. K. Roy, L. Petersson,
M. Harandi, Semantic-aware knowledge distillation for fewshot class-incremental learning, in: IEEE Conf. Comput. Vis.
Pattern Recog., 2021, pp. 2534–2543.
[34] S. Dong, X. Hong, X. Tao, X. Chang, X. Wei, Y. Gong, Fewshot class-incremental learning via relation knowledge distillation, in: AAAI, volume 35, 2021, pp. 1255–1263.
[35] D.-W. Zhou, F.-Y. Wang, H.-J. Ye, L. Ma, S. Pu, D.-C. Zhan,
Forward compatible few-shot class-incremental learning, in:
IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 9046–9056.
[36] C. Peng, K. Zhao, T. Wang, M. Li, B. C. Lovell, Few-shot classincremental learning from an open-set perspective, in: Eur.
Conf. Comput. Vis., Springer, 2022, pp. 382–397.
[37] Z. Song, Y. Zhao, Y. Shi, P. Peng, L. Yuan, Y. Tian, Learning
with fantasy: Semantic-aware virtual contrastive constraint for
few-shot class-incremental learning, IEEE Conf. Comput. Vis.
Pattern Recog. (2023).
[38] D.-Y. Kim, D.-J. Han, J. Seo, J. Moon, Warping the space:
Weight space rotation for class-incremental few-shot learning,
in: Int. Conf. Learn. Represent., 2023.
[39] P. Mazumder, P. Singh, P. Rai, Few-shot lifelong learning, in:
AAAI, volume 35, 2021, pp. 2337–2345.
[40] Z. Chi, L. Gu, H. Liu, Y. Wang, Y. Yu, J. Tang, Metafscil: a

meta-learning approach for few-shot class incremental learning,
in: IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 14166–
14175.
[41] Z. Ji, Z. Hou, X. Liu, Y. Pang, X. Li, Memorizing complementation network for few-shot class-incremental learning, IEEE
Trans. Image Process. (2023).
[42] X. Xu, Z. Wang, Z. Fu, W. Guo, Z. Chi, D. Li, Flexible fewshot class-incremental learning with prototype container, Neural. Comput. Appl. 35 (2023) 10875–10889.
[43] G. Karunaratne, M. Schmuck, M. Le Gallo, G. Cherubini,
L. Benini, A. Sebastian, A. Rahimi, Robust high-dimensional
memory-augmented neural networks, Nat. Commun. 12 (2021)
2468.
[44] T. Lesort, T. George, I. Rish, Continual learning in deep
networks: an analysis of the last layer,
arXiv preprint
arXiv:2106.01834 (2021).
[45] Y. Wang, W.-L. Chao, K. Q. Weinberger, L. van der Maaten,
Simpleshot: Revisiting nearest-neighbor classification for fewshot learning, arXiv preprint arXiv:1911.04623 (2019).
[46] S. Hou, X. Pan, C. C. Loy, Z. Wang, D. Lin, Learning a unified
classifier incrementally via rebalancing, in: IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 831–839.

9