Noname manuscript No.
(will be inserted by the editor)

Factor-Assisted Federated Learning for Personalized
Optimization with Heterogeneous Data

arXiv:2312.04281v1 [stat.ML] 7 Dec 2023

Feifei Wang · Huiyun Tang · Yang Li

Received: date / Accepted: date

Abstract Federated learning is an emerging distributed machine learning framework aiming at protecting data privacy. Data heterogeneity is one of the core challenges in federated
learning, which could severely degrade the convergence rate and prediction performance
of deep neural networks. To address this issue, we develop a novel personalized federated
learning framework for heterogeneous data, which we refer to as FedSplit. This modeling
framework is motivated by the finding that, data in different clients contain both common
knowledge and personalized knowledge. Then the hidden elements in each neural layer can
be split into the shared and personalized groups. With this decomposition, a novel objective
function is established and optimized. We demonstrate FedSplit enjoyers a faster convergence speed than the standard federated learning method both theoretically and empirically.
The generalization bound of the FedSplit method is also studied. To practically implement
the proposed method on real datasets, factor analysis is introduced to facilitate the decoupling of hidden elements. This leads to a practically implemented model for FedSplit and
we further refer to as FedFac. We demonstrated by simulation studies that, using factor
analysis can well recover the underlying shared/personalized decomposition. The superior
prediction performance of FedFac is further verified empirically by comparison with various
state-of-the-art federated learning methods on several real datasets.
Keywords Deep Learning · Factor Analysis · Federated Learning · Heterogeneity ·
Personalization
Feifei Wang
Center for Applied Statistics, Renmin University of China, Beijing, China
School of Statistics, Renmin University of China, Beijing, China
Huiyun Tang
School of Statistics, Renmin University of China, Beijing, China
Yang Li
Center for Applied Statistics, Renmin University of China, Beijing, China
School of Statistics, Renmin University of China, Beijing, China
Correspondence to: Yang Li E-mail: yang.li@ruc.edu.cn

2

Feifei Wang et al.

1 Introduction
Federated learning (FL) is a novel distributed machine learning approach [1]. The standard
FL algorithm involves a set of clients with locally stored datasets. All clients collaborate
to train a global model under the supervision of a central server. In this work, we focus on
horizontal federated learning [2], which means datasets on different clients share the same
feature space but different in samples. In horizontal federated learning (we omit “horizontal”
for simplicity hereafter), one of the core challenges is the heterogeneity issue because data
in different clients are generated separately and never mixed up [3–5]. Data heterogeneity violates the frequently used independent and identically distributed (IID) assumption
in distributed optimization, which is important for the stochastic gradient descent (SGD)
type algorithms [1]. When modeling heterogeneous data in federated learning, the SGD algorithms could generate biased estimates and hurt the training convergence and prediction
performance [3, 6–8]. In addition, the estimation and prediction performance could also vary
greatly across clients with heterogeneous data [9, 10] and thus hurt the generalization ability
of FL models.
There exist various studies to address the heterogeneity challenge, which can be roughly
divided into two streams. The first stream of studies try to extend the standard FedAvg
method [1] to allow for more stable performance and more robust convergence in heterogeneous settings [7, 9–11]. However, these works still produce a single global model and
can only deal with certain degree of heterogeneity [12]. Another stream of studies focus on
personalizing the global model to be more customized to individual clients [3, 4, 12, 13].
As personalized models are preferred in many practical applications such as healthcare, finance, and AI services, this stream of studies have gained increasing popularity in recent
years [14–17]. More thorough discussion for related works can be found in Section 2.
The past literature mainly focuses on manipulating models. However in this work, we try
to utilize the characteristics of heterogeneous data itself to solve the heterogeneous problem.
Our proposed method is based on one common phenomenon. That is, data in different clients
contain both common knowledge and personalized knowledge. Accordingly, to model the
heterogeneous data in different clients, the hidden elements in each neural layer of the deep
neural networks (DNNs) might also contain the shared and personalized groups. To illustrate
this idea, we first design a heterogeneous setting using MNIST data. Specifically, we divide
the dataset into ten categories by the digit labels. Each digit category is further divided into
ten groups. The resulting 100 sub-datasets constitute 100 clients with heterogeneity (i.e.,
the non-IID case). We also consider the IID case for comparison, where the MNIST data
are randomly split into 100 sub-datasets. In each case, we separately train a multi-layer
perception model with one hidden layer of 200 units for digit classification for each client.
Figure 1 shows the heatmaps of outputs from each neuron in each client. As shown, nearly
all neurons perform similarly for all clients in the IID case. However, there exist neurons
learning quite varied representations across clients in the non-IID case. Specifically, the top
neurons in the right panel of Figure 1 have large output values for most clients, which reflect
common representations. The outputs of other neurons only behave large for some specific
clients, which reflect client-specific representations. To further quantify this phenomenon,
we compute the entropy of outputs generated by each neuron across all clients using the
k nearest-neighbor method [18]. The bigger the entropy, the greater the dispersion degree.
Figure 2 compares the distributions of entropy in the IID and non-IID cases. As shown,
the IID case has most entropy values near or below zero, meaning the representations of
neurons across all clients in the IID case are quite concentrated. In the non-IID case, the
entropy values have two peaks, one near zero and the other near two. This finding indicates

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

3

that some neurons extract similar representations and behave concentrated, while some other
neurons generate heterogeneous representations and behave dispersive.

Fig. 1: The heatmaps of outputs generated by each neuron in each client in the IID and non-IID cases.

Fig. 2: The histograms of entropy of outputs generated by each neuron across all clients in the IID and nonIID cases.

The above results motivate us to decompose neurons into client-shared ones and clientspecific ones according to their representation ability. Assume under some certain splitting
method, we can split the hidden elements of each layer (i.e., channels for convolution layers
or neurons for fully connected layers) into the client-shared group and the client-specific
group. See Figure 3 for an illustration, where the light blue circles represent the shared
hidden elements and circles in other colors represent the personalized hidden elements of
each client. Then parameters corresponding to the client-shared group are updated by all
clients, which are in accordance with those in the standard FedAvg method. However, the
parameters corresponding to the client-specific group are not synchronized with the shared
ones in the updating procedure. They are only updated locally and not averaged by the
server. By this way, the resulting models are more customized for individual clients. We
refer to this method as FedSplit. We provide theoretical analysis that guarantees the linear

4

Feifei Wang et al.

convergence and generalization property of the FedSplit method, which is trained via the
parallel gradient descend algorithm.

Fig. 3: The illustration of neurons decomposition in DNNs under the FedSplit framework. The light blue
circles represent the shared elements, which are updated by all clients, while circles in other colors represent
the client-specific elements, which are only updated locally.

In complicated learning scenarios with image or text, it is not easy to perform such
decomposition directly on the neurons. To tackle this problem, we adopt the statistical technique of factor analysis [19] to accomplish this decomposition, which we refer to as FedFac
in the subsequent analysis. Specifically, we regard the hidden layers of DNN as higher-level
representations of the raw data. Then, factor analysis is applied on the hidden elements of
each layer to split them into client-shared ones and client-specific ones. Factor analysis is a
classic statistical method, which can discover the inter-dependence structure among several
variables. In this work, we apply factor analysis to find the inter-dependence structure among
hidden elements within each layer, based on which the hidden elements would be split into
the client-shared group and client-specific group. After decomposition, the client-shared elements would be updated using all client information, while the client-specific ones would
only be updated for specific clients. Two specific solutions (static or dynamic) are provided
to implement the decomposition of client-shared elements and client-specific elements using
factor analysis. By various simulation experiments, we find both solutions can well recover
the true underlying decomposition. Finally, we empirically demonstrate the benefits of FedFac in several real-world datasets when compared with various state-of-the-art FL models.
This paper is organized as follows. Section 2 reviews the related literature. Section 3 introduces the FedSplit methodology with decomposition of client-shared and client-specific
elements. The theoretical properties of convergence and generalization ability are also provided. Section 4 discusses the implementation of FedSplit via factor analysis (i.e., the FedFac method). Its performance is verified through simulation studies. Section 5 presents extensive experiments comparing performance of FedFac against some state-of-the-art federated learning approaches. Section 6 concludes the paper with a brief discussion.

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

5

2 Related Work
2.1 Heterogeneity adjusted global models
There are two mainstreams under this strategy. One is to handle the non-IID problem by
modifying the standard FedAvg model [1]. Some important modifications include: adopting
the strategy of learning rate decay [6], choosing server-side adaptive optimizers, normalizing the local model updates when making averages [20], and correcting the local gradient
updates towards the global true optimum [7, 21]. Another line of works develop new algorithms to train the global model with heterogeneity. [10] regarded the distribution of the
whole data in the federated system as a convex combination of multi-source distributions,
and then solved the federated learning problem with min-max optimization. The generalization ability of the obtained model is lower bounded under other unfavorable distributions.
[3] introduced a proximal term to the local objective function, so as to penalize the local
updates which are far away from the global model. [7] and [21] introduced control variables
to correct the updating directions of local gradients on the client side, thereby reducing the
variance between local gradients and enabling the solutions of local models to converge
towards the optimal solution of the global model.
These heterogeneity adjusted global models alleviate the challenges brought by data heterogeneity. However there still exist some shortcomings. On one hand, these methods are
only verified to be effective under some certain “bounded-non-IID” conditions. On the other
hand, they inherit the limitations of the resulted global model. First, the global model is unable to capture individual differentiated information of each client, harming the inference or
prediction performance when deployed to clients. Second, it requires all participating clients
to reach a consensus for collaborative training, which is unrealistic in practical complex IoT
(Internet of Things) applications. Moreover, the global model could perform quite differently across clients, which results in unfairness in terms of the prediction performance [4].
Therefore some researchers have questioned global models. For example, [22] judged the
practicality of a global model that is beyond the typical use of the clients. [23] suggested
that for many tasks, some clients may not be able to benefit from participating in federated
learning because the global shared model is not as accurate as the local models trained by
themselves.

2.2 Personalization models
Given that a single global model cannot flexibly adapt to the diverse needs of various clients,
researchers have attempted to personalize the learning process considering heterogeneity in
devices, data, and models. This leads to personalized federated learning, which we call PFL
for short. PFL aims to collaboratively learn personalized models for each client, thus inherently fitting the heterogeneity situation. PFL can be summarized into two main categories:
one is to personalize the global model directly, and the other is to develop specific models for all clients through analyzing and mining the interrelationships between data or tasks
across clients. The classic methods in the first category include: fine-tuning [15, 23, 24] and
meta-learning [25]. The main idea is to first train a single global model in the federated
learning framework, and then conduct additional training of the global model on each client
to obtain a locally adaptive model. Note that the performance of the personalized model
depends on the generalization ability of the global model. Then many methods under this
scheme aim to improve the performance of the global model under data heterogeneity. The

6

Feifei Wang et al.

second category includes: federated multi-task learning [26–28], interpolation of local and
global models [29–31], clustered federated learning [30, 32, 33], and model regularization
[22, 34]. These methods strive to utilize relevant information between clients to construct
similar personalized models for related clients and thus improve the performance of customized models. See the comprehensive surveys in [4] and [12] for more details.

2.3 Split-personalization models
Among the personalization models, split-personalization [12] is especially designed for
deep neural networks. It is an extension of split learning in the context of personalized federated learning. It is first introduced by [35], in which a neural network is divided into the
base layers and personalized layers. Then the base layers are trained by the central server.
[36] proposed to update the base layers and personalized layers for different number of
epochs. They also theoretically proved that, the shared data representations learned by the
base layers could converge to the ground-truth representations. [37] concentrated on training the base layers with the head randomly initialized and never updated, then the head was
fine-tuned for personalization. [38] suggested local representation learning finished independently by individual clients and then the head layers trained globally. The strategy to
exclude some DNN layers from averaging is also employed by [39] to mitigate distribution
shift in the feature space. The existing split-personalization methods are mainly layer-wise.
That is, they allow different layers to have different behaviours across clients, but the hidden parameters within the same layer should be aligned across clients. Compared with the
previous split-personalization methods, the idea of splitting neuron-wise seems more flexible, which is inline with the emerging paradigm of split channel-wise. In this regard, [40]
strived to personalize the FL model through dividing the DNN into two sub-models by splitting each layer from bottom to top, and then constrained the two sub-models to align with
each other. However, they lacked of discussion about how to split channels in the FL model.
[41] extracted smaller sub-models from the global model for partial training so as to fit for
the heterogeneous capabilities of clients. As for decomposition, they utilized a rolling window method to select the client-specific nodes. The rolling window advances in each round
and loops over all node indices in sequence progressively. By this way, they in fact did not
consider the varied representation capability of nodes.
In this work, we develop a more refined decoupling strategy than the existing splitpersonalization methods. Although we also target on decoupling for the DNN structure,
our proposed FedSplit method is distinct from the aforementioned methods. Specifically,
we focus on developing a DNN framework that is capable of capturing the client-specific
representations and common representations, so that both the representations could be incorporated into the final shared prediction model to customize the global model. Furthermore,
we provide theoretical analysis that guarantees the linear convergence and generalization
property of the FedSplit method trained with parallel gradient descend algorithm. However,
the previous split-personalization methods usually lack theoretical guarantee. To our best
knowledge, we make the first step to bridge the theoretical gap between the empirical success of personalizing FL models and splitting DNN layers vertically. Last, we propose to
use factor analysis for decomposition of the two types of hidden elements that generate discrepant representations. The good performance of using factor analysis has been verified
through both simulation studies and empirical experiments.

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

7

3 The FedSplit Methodology
3.1 Problem formulation
Consider a FL system with C clients and N samples. For each client c ∈ [C ], we can observe
nc observations generated according to a local distribution DcP
over X × Y, where X ∈ Rd
is the input domain, Y ∈ R is the output domain, and N = c∈[C ] nc . Denote Dc as the
set of the nc observations. For any model h ∈ H with h : X 7→ Y, the loss function is
defined as ℓ : Y × Y → R. Usually, h is a deep neural network. The goal of the standard FL
method FedAvg [1] is to fit a shared global model, which has good averaged performance on
all clients. However, in a heterogeneous setting where the data distributions Dc vary across
the clients, such a global model could not generalize well on individual clients. To address
this issue, it is natural to fit separate model hc ∈ H for data distribution Dc on client c.
Denote LDc (hc ) = E(x,y)∼Dc [ℓ(hc (x), y )] as the true risk of hc at distribution Dc . Then
the objective function for heterogeneous federated learning is:
∀1 ≤ c ≤ C,

min LDc (hc ) .

hc ∈H

(1)

Although the C clients have different local distributions, there may exist common knowledge as well as personalized knowledge among the heterogeneous data stored in different
clients; recall the motivating example discussed in Section 1. Motivated by this idea, we aim
to distinguish between the two types of knowledge. Then we use the common knowledge
in cooperative training of the shared global model, while use the personalized knowledge
to improve the learning tasks on specific clients. Note that data are often unstructured in
complicated learning scenarios. It is difficult to operate decomposition on data directly. To
solve this problem, we regard the parameters associated with the hidden elements in deep
neural networks as a substitute, and propose the FedSplit method.
Concretely, assume the model of research interest is a L-layer deep neural network with
layers numbered from 0 (input) to L (output). Each layer contains ml , l ∈ [L] hidden elements with a Lipschitz nonlinearity function σ : R → R. Denote W to be the parameter set
in this neural network with p dimension. According to the work of [42], training the neural network is equivalent to linear regression using non-linear neural tangent feature space
(NTFs) as the layer width approaches infinity. For each data point x ∈ X, the corresponding feature is given by ϕ(x) = ∇W h(x; W) ∈ Rp . Notebly ϕ(x) could vary across clients
due to the heterogeneity of local distributions. Then the feature space should contain both
client-shared and client-specific information. To separate this two types of information, we
focus on the parameter set W, which can be regarded as the affine transformation of feature ϕ(x). However, the parameter set W is often very huge, it is difficult to decompose
W directly. Note that for DNN models, parameters of one layer are incorporated in the
next layer, thus a compromise is to decompose the parameter set layer-wise. Specifically, let
Λ ⊂ {1, · · · , L − 1} denote the set of layer indices needing decomposition. For the l-th layer
with l ∈ Λ, denote the parameter set for client c in this layer by wcl , which can be decomposed into two parts: (1) the parameter set wls shared by all clients, and (2) the personalized
p
p
parameter set wcl
for client c. Then we have wcl = {wls , wcl
} for c ∈ [C ] and l ∈ Λ.
For the j th layer with j ∈
/ Λ, decomposition is not needed and the parameters are shared by
all clients. Then we have wcj = wjs for c ∈ [C ]. Define Wc as the set of all parameters
for client c, which includes both shared parameters and personalized parameters. Then the

8

Feifei Wang et al.

objective function of FedSplit is
min

W1 ,...,WC

C
X

qc LDc (hc (Wc , xc )),

(2)

c=1

P
where qc ≥ 0 and c qc = 1 denote the weights for each client when aggregating the local
training updates. Based on (2), the empirical objective function is
min N −1

W1 ,...,WC

nc
C X
X

ℓ(hc (Wc , xcj ), ycj ).

(3)

c=1 j =1

3.2 Model estimation and prediction
In this section, we discuss how to estimate the objective function (3). We consider a twolayer fully connected ReLU activated neural network for illustration. Note that the optimization and prediction strategies provided here are also applicable to other DNNs. Assume the
true decomposition of parameters is already known. Suppose each of C clients possesses
n training examples. Assume the FedSplit method is run for T communication rounds. In
each round, all clients are available with each of them training their local models for G steps.
Denote x ∈ Rd as the input. Since we consider a two-layer neural network as an example,
let Wc denote the weight matrix of the first layer with wr ∈ Rd , r ∈ [m] representing
the weight vector of the rth hidden node, and m being the total number of hidden nodes.
Let a = (a1 , . . . , am )⊤ be the output weight vector. Define σ (·) to be the ReLU activation
function, i.e., σ (z ) = z if z ≥ 0 and σ (z ) = 0 otherwise. Recall that we decompose the
parameters of the first layer Wc into the shared ones Ws and client-specified ones Wcp ,
i.e., Wc = (Wcp , Ws ). To be more specific, assume we divide the hidden nodes of the first
p
layer into m1 personalized ones and m2 shared ones, and denote wc,q
as the q -th (q ∈ [m1 ])
s
personalized node and wr as the r-th (r ∈ [m2 ]) shared node. Then the neural network to
be estimated is:
h(Wc , a, x) = √

1
m1

m1
X
q =1

aq

C
X
c=1

p⊤
σ (wc,q
x)I{x ∈ Dc } + √

1
m2

m2
X

ar σ (wrs⊤ x).

(4)

r =1

We can also understand model (4) from representation learning. Denote the clientspecific representation by ψcp : Rd → Rm1 , the common representation by ψ s : Rd → Rm2 ,
and the shared classifier by f : Rm → Y. Then we have ψcp (x) = σ (Wcp x) , ψ s (x) =
σ (Ws x), and f (v) = a⊤ v, ∀v ∈ Rm corresponding to (4). The personalized model for
the c-th client is the composition of the client’s shared head parameters and the common
and unique representation: hc (x) = (f ◦ (ψ s + ψcp )) (x) = a⊤ (σ (Wcp x) + σ (Ws x)),
where the symbol “+” means concatenation. Obviously, model (4) focuses on incorporating
the client-specific representation with the client-shared representation to customize the local
model for heterogeneous clients. This makes it remarkably distinct from a variety of personalized FL methods tailored for client-specific head learning or common representation
learning [24, 36–38].
In general, model (4) can be optimized in a similar way with FedAvg but with different
operations for client-shared parameters and client-specific parameters. Specifically, in each
global communication round, the shared local updates of clients are aggregated by taking a

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

9

simple average, while the personalized local updates are excluded from the averaging step.
Below, we present the estimation procedure of (4) in details.
Consider the quadratic loss function. Denote Sc ⊂ [N ] to be the index set of samples on
client c satisfying Si ∩ Sj = ∅ for i ̸= j . The local empirical objective function lc (Wc ) can
be derived as follows:
lc (Wc ) =

1 X
(h(Wc , a, xi ) − yi )2 .
2n
i∈Sc

The global empirical objective function then becomes:
C

L(W1 , · · · , WC ) =

1 X
C

lc (Wc ).

(5)

c=1

To minimize (5), we use the parallel gradient descent (GD) method. The local updating and
global updating formulas are derived as follows.
p
s
Local update. Let Wc,g
/Wc,g
represent the personalized/shared parameters of the cp
s
th client in the g -th local update step. Then in each of G local steps, Wc,g
and Wc,g
are
updated by gradient descent, where
p
p
Wc,g
+1 = Wc,g − ηl

∂lc (Wc )
∂lc (Wc )
s
s
, Wc,g
,
+1 = Wc,g − ηl
p
s
∂Wc,g
∂Wc,g

p
s
where ηl is the local learning rate. Let wc,g,q
/wc,g,r
represent the parameters of the q -th/rth personalized/shared neuron on the c-th client in the g -th local update step. The gradients
are computed as follows,

n
o
X
∂lc (Wc )
1
p⊤
= √
xi ≥ 0 ,
(h(Wc , a, xi ) − yi ) aq xi I wc,g,q
p
n m1
∂wc,g,q
i∈Sc

n
o
X
∂lc (Wc )
1
s⊤
xi ≥ 0 .
= √
(h(Wc , a, xi ) − yi ) ar xi I wc,g,r
s
∂wc,g,r
n m2
i∈Sc

After G local steps, the shared local updates are sent to the server for aggregation, and
the personalized parameters are locally updated as:
Wcp (t + 1) = Wcp (t) + ∆Wcp (t).

Here Wcp (t) denotes the personalized parameters of the c-th client in the t-th global upp
date round, and wc,q
(t) is the corresponding parameters of the q -th personalized neuron.
p
∆Wc (t) is the accumulated updates of the personalized parameters on the c-th client in the
p
t-th round, with ∆wc,q
(t) is the corresponding accumulated updates of the q -th personalized
neuron which takes the following form:

X X
η
(g )
p
p
∆wc,q
(t) := aq √ l
yi − hc (t)i xi I{wc,g,q
(t)⊤ xi ≥ 0}.
m1

(g )

g∈[G] i∈Sc

Here hc (t) ∈ RM is the local model’s outputs on the c-th client in the g -th local update
(g )
(g )
step of the t-th communication round, and hc (t)i is the i-th entry of hc (t).

10

Feifei Wang et al.

Global update. Denote ∆Wcs (t) as the accumulated updates of the shared parameters
on the c-th client in the t-th round. The shared local updates of all clients are aggregated as
follows:
X ∆Wcs (t)
∆Ws (t) =
.
c∈[C ]

C

Let ηg denote the global learning rate, Ws (t) denote the shared parameters in the t-th global
update round, and wrs (t) be the corresponding parameters of the r-th shared neuron. Then
the shared parameters are globally updated as:
Ws (t + 1) = Ws (t) + ηg ∆Ws (t),

where for ∀c ∈ [C ], r ∈ [m2 ], we have
∆wrs (t) :=



ar X X ηl X
(g )
s
(t)⊤ xi ≥ 0}.
yi − hc (t)i xi I{wc,g,r
√
C
m2
c∈[C ] g∈[G]

i∈Sc

Last, we discuss the prediction issue in federated learning. Practically, new clients often
occur in the prediction stage. To accomplish the prediction task for new clients, we provide
solutions for two typical scenarios. The first scenario is that the new clients can join the
federated learning procedure. For these new clients, the personalized parameters Wcp could
be obtained by themselves through local training, and we just need to transfer the shared
parameters Ws to the new clients to estimate the personalized models. This solution is also
consistent with the practice in [39]. We thus refer to this prediction strategy as LocalTrain.
The second scenario is that the new clients cannot complete local training for some reasons,
such as a lack of samples and insufficient computing power. In this case, assume the new
clients have access to the personalized models of the existing clients [28, 43]. Then the new
clients can generate their predictions using the existing personalized models, and the outputs
are ensemble together to gain the final prediction. We thus refer to this prediction strategy
as Ensemble.

3.3 Theoretical properties
In this section, we study the convergence and generalization property of model (4) under
a two-layer fully connected ReLU activated neural network model. To this end, we adopt
the neural tangent kernel (NTK) technique [42], which is commonly used for convergence
analysis of FL models [39, 44, 45]. NTK is defined as the inner product space of the gradient
of paired data points (also known as the Gram matrix). It describes the evolution of DNN
during gradient descent training. Under the framework of NTK, the training of an infinitewidth over-parameterized neural networks can be understood as a simple kernel gradient
descent algorithm, and the kernel is fixed and only depends on the network structure and activation function. Therefore, the limit probability distribution to which the gradient descent
converges can be seen as a stochastic process. Here we follow [44] to study the theoretical
properties of our proposed method. We first give the following assumption.
Assumption 1 (Non-parallel data) For i ∈ {1, . . . , N }, the data points (xi , yi ) satisfy
∥xi ∥2 ≤ 1 and |yi | ≤ M for some constant C0 . For any i ̸= j , we have xi ∦ xj .

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

11

The above assumption is essential to ensure the positive definiteness of the NTK/Gram matrix. This assumption is easy to satisfy because two inputs are hardly parallel in the context
of horizontal federated learning. To facilitate the analysis of learning dynamics, we define
the shared and personalized neural tangent kernels, which focus on cross-client and innerclient information respectively.
Definition 1 (Neural tangent kernel) The shared neural tangent kernel Hs∞ ∈ RN ×N
and personalized tangent kernel Hp∞ ∈ RN ×N are given by
h

n

h

n

oi

⊤
⊤
⊤
Hs∞
i,j := Ew∼N (0,I) xi xj I w xi ≥ 0, w xj ≥ 0

,

o
i
⊤
⊤
⊤
Hp∞
i,j := Ew∼N (0,I) xi xj I w xi ≥ 0, w xj ≥ 0 I{i ∈ Sc , j ∈ Sc , ∀c ∈ [C ]} .
Denote λs and λp as the smallest eigenvalues of Hs∞ and Hp∞ , respectively. According to
[46], we have λs > 0. Since Hp∞ is the diagonal matrix composed of the C block matrices
with size n × n on the diagonal of Hs∞ , it is easily to infer that λp ≥ λs .
We then study the convergence property of model (4) trained using the multi-step parallel GD. In the traditional NTK theory, GD achieves zero training loss on shallow neural
networks for regression tasks, and the convergence rate of infinite-width over-parameterized
neural networks depends on the smallest eigenvalue of the induced kernel in the training
evolution [42, 46–48]. However, the analysis becomes much more complicated in our personalized federated learning. For one thing, we involve two types of parameters with different behaviors; for another, the movement of parameters are no longer directly determined
by the local gradients. Following the common practice in NTK, we start from investigating
the prediction trajectory of the neural network h(Wc , a, x). Denote h(t) ∈ Rn to be the
aggregated prediction vector of the personalized models after the t-th global round. Then
the prediction errors are decomposed as
∥y − h(t + 1)∥22 = ∥y − h(t) − (h(t + 1) − h(t)) ∥22

= ∥y − h(t)∥22 − 2 (y − h(t))⊤ (h(t + 1) − h(t)) + ∥h(t + 1) − h(t)∥22 .
As show, the prediction errors in the (t + 1)-th round are composed of those in the previous
round (i.e., y − h(t)) and the difference between the predictions in adjacent rounds (i.e.,
h(t + 1) − h(t)). To better understand the learning dynamics, we separate the personalized/shared neurons into two sets, respectively. One set contains neurons with the activation
pattern changing over time and the other set contains neurons with activation pattern remaining the same. Then we could analyze (y − h(t))⊤ (h(t + 1) − h(t)) through investigating
the personalized and shared Gram matrices, whose definitions are given as follows.
Definition 2 (Gram matrix) The shared Gram matrix H(t, g, c)s ∈ RN ×N and personalized Gram matrix H(t, g, c)p ∈ RN ×N of the c-th client in the g -th local update step of the
t-th communication round are given by
H(t, g, c)si,j =
H(t, g, c)pi,j =

m2
1 X

m2

1
m1

r =1
m1
X
q =1

s
⊤
s
⊤
x⊤
i xj I{wc,g,r (t) xj ≥ 0}I{wr (t) xi ≥ 0},

p
⊤
p
⊤
x⊤
i xj I{wc,g,q (t) xj ≥ 0}I{wc,q (t) xi ≥ 0}I {i ∈ Sc } .

12

Feifei Wang et al.

Combining the Sc columns of H(t, g, c)s for all c ∈ [C ], we get the shared Gram matrix
H(t, g )s ∈ RN ×N for the g -th local update step of the t-th communication round. Similarly,
H(t, g )p ∈ RN ×N is the personalized Gram matrix for the g -th local update step of the t-th
communication round. Under appropriate assumptions as listed in Theorem 1, the evolving
H(t, g )p /H(t, g )s are close to their original Gram matrices H(t)p /H(t)s , the latter of which
are close to Hp∞ /Hs∞ . On the other hand, by leveraging concentration properties at initialization, the difference between the prediction errors in local steps and those in the global
step can be bounded. With these techniques, the term (y − h(t))⊤ (h(t + 1) − h(t)) can be
bounded by ∥y − h(t)∥22 . According to the classic NTK theory, the norm of the gradients
can be controlled when the neural network is sufficiently wide. Thus the movement of the
two kinds of parameters could be upper bounded. As a result, the term ∥h(t + 1) − h(t)∥22
is also controlled by the prediction errors. Consequently, the model (4) could benefit from a
linear learning rate, and the results are presented in the following theorem.
Theorem 1 (Convergence Rate) Suppose Assumption 1 holds. Let λs = λmin (Hs∞ ),
γ s = λmax (Hs∞ ) /λmin (Hs∞ ), λp = λmin (Hp∞ ), γ p = λmax (Hp∞ ) /λmin (Hp∞ ),
ρ1 = ηg ηl G/C , ρ2 = ηl G. Fix m1 = Ω ((λp )−4 C −2 )N 4 log(N/δ ) and m2 = Ω (λs )−4

p
N 4 log(N/δ ) . Furthermore, we i.i.d. initialize wc,q
and wrs ∼ MN(0, I), and sample
aq , ar from {−1, 1} uniformly at random for c ∈ [C ], q ∈ [m1 ], r ∈ [m2 ], then with probability at least 1 − δ , training network
 (4)s using
the GD algorithm with global step size
λp
ηg = O(1) and local step size ηl = O γ sλγ p+GN
converges linearly as
2
∥h(t) − y∥22 ≤



1
1 − (ρ1 λs + ρ2 λp )
2

t

∥h(0) − y∥22 .

The detailed proof of Theorem 1 can be found in Supplementary Materials B. This
theorem shows that the convergence guarantees of model (4) are governed by the spectral
properties of both the shared NTK and personalized NTK. According to [44], the convergence rate of the two-layer
with

 fully connected ReLU activated neural network trained
FedAvg is O 1 − ρ′ λs /2 , where ρ′ = ηg ηl′ G/C and ηl′ = O λs /(γ s GN 2 ) . We have
ρ′ ≈ ρ1 when the total sample size N is large. Notably, the convergence rate of FedSplit is
O (1 − ρ1 λs /2 − ρ2 λp /2), where the additional term ρ2 λp /2 is introduced due to layer partition. Therefore, FedSplit enjoys faster convergence rate than FedAvg. It implies that, in our
FedSplit method, the update in the customized model is not only determined by the averaged
gradient directions, but also by the direct gradient direction from heterogeneous clients, so
that the parameters are optimized towards the most favorable path for a specific client. In addition, our proposed FedSplit model enjoys linear convergence rate under the neural network
setting, without assumptions on the convexity of objective functions or distribution of data.
On the contrary, existing methods have to rely on the assumptions of convexity and smoothness for the objective functions to develop their convergence guarantees. These assumptions
are not realistic for non-linear neural networks. For example,
[34] demonstrated
that, their


method pFedMe could obtain a sublinear speed of O 1/(T GC )2/3 for L-smoothed nonconvex objectives (Assumption 1), with bounded variance of gradients (Assumption 2) and
bounded gradient diversity (Assumption 3) between the global model and local models.
Except the common assumptions on smoothness and gradients, [25] also required the loss
functions
be bounded
and twice continuously differentiable, and its convergence speed was


3/2
O 1/(T )
.

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

13

Last, we study the upper bound for the generalization error of the FedSplit method.
To achieve this goal, additional conditions on the training data distribution and test data
distribution are required. The details are summarized in Theorem 2.
Theorem 2 (Generalization Bounds) Fix failure probability δ ∈ (0, 1). Suppose the trainC
ing data S = {(xi , yi )}N
i=1 in the FL system are i.i.d. samples from ⊗c=1 Dc , such that with
s∞
p∞
probability at least 1−δ/3, we have λmin (H
+ H ) ≥ λ > 0. Let λs = λmin (Hs∞ ),
p
p∞
λ = λmin (H ), ρ1 = ηg ηl G/C , and ρ2 = ηl
G. Fix α = O(λ poly(log N, log(1/δ ))/N ),
m1 = Ω α−2 poly N, log(1/δ ), (λp )−1 , λ−1
, and m2 = Ω α−2 (poly (N, log(1/δ ),

p
s −1 −1
(λ ) , λ
. Let the neural network (4) be initialized with wc,q
and wrs , which are i.i.d.
2
sampled from MN(0, α I). Let aq , ar be sampled from {−1, 1} uniformlyfor c ∈ [C ], q ∈
[m1 ], r ∈ [m2 ]. Assume we train model (4) in the FL framework for T ≥ Ω (ρ1 λs + ρ2 λp )−1
poly(log(N/δ ))) iterations. Consider the loss function ℓ : R × R → [0, 1] that is 1-Lipschitz
in its first argument. Then with probability at least 1 − δ over the random initialization
on Ws (0) ∈ Rd×m2 , Wp (0) ∈ Rd×Cm1 and the training samples, the population loss
[ℓ(h(W, a, x), y )] is upper bounded by
(h) := E(x,y)∼⊗C
L⊗C
c=1 Dc
c=1 Dc
q
p
(h) ≤ 2 (C/n)y⊤ (Hs∞ + Hp∞ )−1 y + O( log(N/(λδ ))/(2N )).
L⊗C
c=1 Dc
The detailed proof of Theorem 2 can be found in Supplementary Materials C. The intuitions behind Theorem 2 are that, if the proposed model (4) satisfies the listed conditions,
then its generalization bound depends on a data-dependent complexity measure (i.e., the
first term on the right side of the inequality) and is unrelated with the network itself. The
complexity measure is determined by both the cross-client information and inner-client information, as well as the number of clients C and sample size on each client n. Given the
total training sample size N in the FL system, increasing the number of clients C or decreasing the number of local samples n on each client would lead to larger generalization
error. Intuitively, a large client number on a fixed amount of data means it is more difficult
to develop personalized models that could generalize well for all clients.

4 Factor-Assisted Decomposition
4.1 Decomposition with factor analysis
A key problem to optimize the objective function (3) is to decompose the whole parameter set into client-shared ones and client-specific ones. To achieve this goal, we apply the
factor analysis technique [19]. Factor analysis is a classic statistical model and is widely
used in fields such as sociology, political science, and economics [49–51]. It assumes there
exists inter-dependence among a large set of variables, which can be characterized by several “common latent factors”. In addition, these common latent factors often have different
influencing levels to the variables. Then variables influenced much by the common latent
factors can be regarded as a shared group. In our problem, we treat the hidden elements in
each layer as variables. Then factor analysis is conducted to find the common latent factors
among the hidden elements. The hidden elements scoring high on the common factors would
be regarded as client-shared ones, whereas the others would be regarded as client-specific
ones. We then discuss the factor-assisted decomposition procedure in detail.
For the l-th (l ∈ Λ) convolutional layer, assume it has dl kernels with a fixed size s1 ×s2 .
Then the depth of the feature map in the (l + 1)th layer is dl . Let Kl denote the set of dl

14

Feifei Wang et al.

kernels. Assume there exists inter-dependence among Kl , which can be characterized by Gl
latent factors Fl = (F1 , · · · , FGl )⊤ with Gl < dl . Then the factor model is
Kl = Al Fl + ϵl ,

(6)

where Al = (aij ) ∈ Rdl ×Gl denotes the loading matrix, and ϵl = (ϵ1 , · · · , ϵdl )⊤ represents
the information that cannot be explained by Gl factors.
The factor analysis is conducted on the server side. Before that, the received intermediate
updates tensors (weights or gradients) from clients should be reshaped and combined to form
an input matrix for the factor model, which are shown in Figure 4. Assume the server selects
K ≤ N active clients for the current communication round, denoted by St . Then for each
client, the intermediate updates tensors with size dl−1 × s1 × s2 of each of the dl kernel
is flattened to form a weight vector zclj ∈ Rdl−1 s1 s2 with c ∈ St and j ∈ [dl ]. Second,
the flattened updates vectors of K active clients are then concatenated client-wise to form
⊤
∈ Rdl−1 s1 s2 K . Third, the dl concatenated
a longer vector χlj = z1lj , z2lj , · · · , zKlj
vectors are combined to form a matrix, which is denoted by Zl = (χl1 , χl2 , · · · , χldl ) ∈
R(dl−1 s1 s2 K )×dl . After these steps, Zl is the input for factor analysis on the l-th layer.

Fig. 4: Data preparation for factor analysis.

We follow the common practice to estimate the loading matrix Al , as well as choosing
the appropriate number of latent factors Gl [19]. First, we normalize each column of the
input matrix and still denote Zl as the normalized one. Then we compute the correlation
matrix of Kl as Rl = Z⊤
l Zl . Assume γ1 ≥ γ2 ≥ · · · ≥ γdl ≥ 0 are the eigenvalues of
Rl and uj (j = 1, · · · , dl ) is the corresponding eigenvector. To determine the appropriate
number
latent factors Gl , we compute a cumulative variability contribution rate as rlm =
Pm of P
dl
j =1 γj /
j =1 γj for 1 ⩽ m ⩽ dl . Then define Gl = min{m : rlm ⩾ κl }, where κl is
a pre-specified threshold. Denote Σϵl as the covariance matrix of special factors. According
∗
to classic theory of factor model, we have Rl = Al A⊤
l + Σϵl . Let Rl = Rl − Σϵl . Assume
∗
∗ ∗
∗
∗
the top positive Gl eigenvalues of Rl are γ1 , γ2 , · · · , γGl , and uj (j = 1, · · · , Gl ) is the
corresponding eigenvector. Then Al can be estimated as
⊤
p
q
∗ u∗
bl =
A
γ1∗ u∗1 , · · · , γG
.
(7)
G
l
l

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

15

b l using
Since Σϵl is unknown, we execute the estimation process literately. First we obtain A
b lA
b⊤
bϵl = Rl − A
(7). Then we compute Σ
.
The
process
is
iterated
until
the
difference
l
b l as
bϵl between consecutive two iterations is very small. Here we could initialize A
of Σ
⊤
√
√
γ1 u1 , · · · , γGl uGl .
Based on the estimation results of Al , we split the hidden elements
in the l-th layer as
P
follows. According to the property of factor analysis, we have dml =1 a2jm = 1 for the j th
PG l
2
kernel. Then we define a new measure νlj = m
=1 ajm to indicate the variation proportion
of the j th kernel that can be explained by Gl common factors. Then a higher νlj indicates the
j th kernel could be more explained by the common factors, and thus more likely to be shared
by all clients. This idea can be understood from another perspective. Assume there exists a
kernel j ′ with all aj ′ m = 0. Since this kernel has zero loadings on all common factors, it
is immune to general
and behaves as an absolutely client-specific one. Then we
P knowledge
2
l
can rewrite νlj = G
m=1 (ajm − 0) , which is the Euclidean distance between the j th kernel
′
and the j th kernel. Thus the j th kernel with smaller νlj suggests it is more similar to the j ′ th
kernel and behaves more client-specific. To split the kernels, let ζlj ∈ {0, 1} be an indicator
variable, where ζlj = 1 denotes the j th kernel in the l-th layer to be client-shared and ζlj = 0
otherwise. By using a threshold τl , define ζlj = 1 if vlj ≥ τl and ζlj = 0 if vlj < τl . In
this way, each element in the hidden layer can be categorized into the client-shared group
I sl = {j : ζlj = 1, 1 ≤ j ≤ dl } and client-specific group Ipl = {j : ζlj = 0, 1 ≤ j ≤ dl }.
All convolution layers and the dense layer can be split sequentially.
With the factor analysis, we can split the whole parameter set into the client-shared ones
and client-specific ones. Then the objective function (3) of FedSplit can be optimized. We
refer to the FedSplit method with factor analysis used for decomposition as FedFac, which
is a practically implemented version of FedSplit. The framework of FedFac is summarized
in Figure 5.

Fig. 5: The overall framework of FedFac.

16

Feifei Wang et al.

4.2 Static and dynamic algorithms
To implement factor analysis to decompose the hidden elements into client-shared group and
client-specific group, we develop two algorithms, i.e., the static FedFac algorithm and the
dynamic FedFac algorithm. In short, the static FedFac algorithm only conducts factor analysis once and then fixes the decomposition. Specifically, assume a certain number of clients
can first train local models based on their private datasets. Then the server gathers the local
parameters of specific layers and then conducts factor analysis on the aggregated weight
matrix. As a result, we could discriminate the shared hidden elements (neurons or channels)
from personalized ones based on common and heterogeneous knowledge. Then the decomposition of parameters is fixed during the whole FL procedure. We call this algorithm the
static FedFac (Alg. 1) since factor analysis is only conducted once in the initialization step.
This static algorithm is convenient to execute. In addition, since the decomposition of hidden elements is fixed during the training procedure, our theoretical analysis of FedSplit is
applicable.

Algorithm 1: The Static FedFac Algorithm
Data: T, ηg , K, Λ, τl , κl , (l
 ∈ Λ)
Result: Wc = W s , Wcp for each c ∈ [C]
Initialization: The server collects the parameters of l-th layer for l ∈ Λ from each local model,
conducts factor analysis on the aggregated weight matrix to decide parameters partition I sl and
I pl for each l ∈ Λ ;
p
s }}, a ∈
Let W s = {wa , {wlj
/ Λ, l ∈ Λ, j ∈ I sl ; and W p = {wlj
}, l ∈ Λ, j ∈ I pl ;
p
s
Initialize Wc = {W (0), Wc (0)}, c ∈ [C] ;
for each round t = 1, 2, · · · , T do
St ← (random set of K clients);
broadcasts W s (t) to clients in St ;
for each client k ∈ St do 

∆Wks (t), ∆Wkp (t) ← ClientUpdate k, W s (t), Wkp (t) ;
p
p
p
Wk (t + 1) = Wk (t) + ∆Wk (t)
P
n ∆Wks (t)
∆W s (t) = k∈St Pk
;
n
k∈St

k

W s (t + 1) = W s (t) + ηg ∆W s (t)

Client Update
Data: client index c, local epoches G, learning rate ηl and batchsize B
Result: Wc
download W s from server;
Wc = {W s , Wcp };
B ← ( split Dc into batches of size B);
for each local epoch g from 1 to G do
for batch b ∈ B do
Wc ← Wc − η∇ℓ(Wc ; b)

However, by using the static method, we have no chance to adjust the initial decomposition if it is far away from optimal. To address this issue, we further develop a dynamic
version of FedFac (Alg. 2). The algorithm of dynamic FedFac conducts factor analysis in
each communication round so as to modify decomposition based on the newest common
knowledge the FL system has learned. Specifically, a subset of clients are selected at each
communication round to perform local updates based on their personalized models. Then
the updated intermediate parameters are sent to the server, where the interested parameters

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

17

are reshaped and factor analysis is performed to decide parameter decomposition. After that,
the shared parameters are averaged to form a global update, whereas the personalized parameters are updated by each client. Compared with the static method, the dynamic method
is more flexible.

Algorithm 2: The Dynamic FedFac Algorithm
Data: T, ηg , K, Λ, τl , κl , (l
 ∈ Λ)
Result: Wc = W s , Wcp for each c ∈ [N ]
Initialization: The server collects the parameters of l-th layer for l ∈ Λ from each local model,
conducts factor analysis on the aggregated weight matrix to decide parameters partition I sl and
I pl for each l ∈ Λ ;
p
s }}, a ∈
Let W s = {wa , {wlj
/ Λ, l ∈ Λ, j ∈ I sl ; and W p = {wlj
}, l ∈ Λ, j ∈ I pl ;
p
s
Initialize Wc = {W (0), Wc (0)}, c ∈ [N ] ;
for each round t = 1, 2, · · · , T do
St ← (random set of K clients);
broadcasts W s (t) to clients in St ;
for each client k ∈ St do

∆Wk (t) ← ClientUpdate k, W s (t), Wkp (t) ;
for each layer l ∈ Λ do
conducts factor analysis to decide parameters partition I sl (t) and I pl (t)
s }}, for k ∈ S , a ∈
∆Wks (t) = {∆wka , {∆wklj
/ Λ, l ∈ Λ, j ∈ I sl (t) ;
t
P
nk ∆Wks (t)
s
;
∆W (t) = k∈St P
n
k∈St

k

W s (t + 1) = W s (t) + ηg ∆W s (t);
p
p
Wkp (t + 1) = {wklj
(t) + ∆wklj
(t)}, for k ∈ St , l ∈ Λ, j ∈ I pl (t) ;

Client Update
Data: client index c, local epoches G, learning rate ηl and batchsize B
Result: Wc
download W s from server;
Wc = {W s , Wcp };
B ← ( split Dc into batches of size B);
for each local epoch g from 1 to G do
for batch b ∈ B do
Wc ← Wc − η∇ℓ(Wc ; b)

4.3 Effectiveness of using factor analysis
In this section, we investigate whether using factor analysis can recover the true underlying
decomposition. To study this problem, we design a simulation study to compare the performance of different decomposition methods. Below, we first present the simulation setup and
then present the corresponding results.
4.3.1 Simulation Setup
We first present the data generation process, which can generate heterogeneous data suitable
for FedFac. For c ∈ [C ], denote xc = (xpc , xsc ) ∈ Rd , where xpc ∈ Rd1 represents the personalized covariates and xsc ∈ Rd2 represents the shared covariates. Let α = d2 /d indicate the

18

Feifei Wang et al.

shared proportion of xc . Assume xsc ∼ MN(0, I) and xpc ∼ MN(µc , Σ), where MN(·) represents the multi-normal distribution and µc ∼ MN(0, I), Σ = (0.5|i−j| )d1 ×d1 . Let Ws =


′
′
′ ′
(wsp )′ , (wss )′ ∈ Rd×m2 denote the shared parameters, and Wcp = (wcpp ) , (wcps ) ∈
Rd×m1 denote the personalized parameters for the c-th client. Denote p = m2 /m, m =
m1 + m2 . Then p represents the shared proportion of model parameters. Denote a =
(a1 , · · · , am )⊤ ∈ Rm . Then the true personalized model of each client is assumed as follows:
h∗c (xc ) =

m1
X



′

′



pp
ps
aq σ (wcq
) xpc + (wcq
) xsc +

q =1

mX
1 +m2



′

′



ar σ (wrsp ) xpc + (wrss ) xsc .

r =1+m1

(8)
We fix Ws and a for all clients, with wrsp ∼ U (−0.1, 0.1), wrss ∼ U (−1, 1) for r ∈
[m1 + 1 : m1 + m2 ], and a ∼ MN(0, I). The personalized parameters are set client-specific
pp
ps
with wcq
∼ MN(µc , I) and wcq
∼ U (−0.1, 0.1) for q ∈ [m1 ], for the c-th client. Assume
∗
∗
∗
yc is generated as yc = hc (xc ) + ϵ where ϵ ∼ N (0, c2 ). Then to generate a binary variable
yc , we consider the sigmoid function f (x). Specifically, we set yc = 1 if f (yc∗ ) ∈ (0.5, 1];
otherwise yc = 0.
In the simulation study, we set C = 100, d = 100, and m = 200. Then the MLP with
one hidden layer of m neurons is used to fit the generated data. For comparison purpose, we
consider five decomposition methods. They are, respectively: (1) the true decomposition,
(2) the oracle decomposition, where factor analysis is conducted on the true parameters,
(3) the estimated decomposition using static FedFac, (4) the estimated decomposition using
dynamic FedFac, and (5) the random decomposition. In the true decomposition method,
we know the true indices of personalized and shared hidden elements, which are stated in
(8). In the oracle decomposition method, we conduct decomposition using factor analysis
based on the true parameters of all clients, i.e., Wc = (Wcp , Ws ) ∈ Rd×m , c ∈ [C ], and then
execute FL training under the proposed framework. In the estimated decomposition methods
(static FedFac or dynamic FedFac), we neither know the true decomposition nor the true
parameters. It is the same case we encountered in practice. In the random decomposition,
we split shared/personalized neurons by random guess. Note that in the true and oracle
methods, the structure of DNN stays the same throughout the FL training. Therefore, we
only average the parameters of the shared neurons in each communication round.
4.3.2 Simulation results
We consider different experimental settings by changing the values of p (the shared proportion of model parameters) and α (the shared proportion of covariates). Figure 6 plots the
testing accuracies using different decomposition methods. As shown, the first four decomposition methods can generate nearly identical results, while the random decomposition is
much inferior. This finding suggests that, the estimated decomposition methods (both static
FedFac and dynamic FedFac) used in practice can well approximate the true decomposition
with high accuracy.
Recall that by using the static method, the decomposition of client-shared elements
and client-specific elements are fixed during the model training procedure. However, the
dynamic method behaves more flexible which allows the decomposition to change. Then
one question arises: whether the dynamic method can reach convergence? In other words,
whether the client-shared elements and client-specific elements can stay unchanged in the
end? To investigate this question, we compute the proportion of neurons which do not

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

19

Fig. 6: Test accuracies under different decomposition methods. We vary p in the top three figures by fixing
α = 0.4; and vary α in the bottom three figures by fixing p = 0.5. The smaller the value of p, the more
heterogeneous of model. The larger the value of α, the more identical of covariates.

change states between consecutive rounds when using dynamic FedFac. Figure 7 shows the
corresponding results under different experimental settings. As shown, by using dynamic
FedFac, nearly all neurons will reach their final stable states during the model training procedure. These findings demonstrate the effectiveness and feasibility of dynamic FedFac in
parameter decomposition. In conclusion, dynamic FedFac utilizes more information than
static FedFac to decompose the hidden elements and behaves more flexible. It can also reach
a static decomposition situation in the model training procedure. Therefore, compared with
static FedFac, we recommend to use dynamic FedFac in real applications. Our experiments
in real datasets also reveal that dynamic FedFac achieves better prediction results than static
FedFac, the details of which can be found in the next section.

Fig. 7: The proportion of neurons which do not change states between consecutive rounds using dynamic
FedFac.

20

Feifei Wang et al.

5 Experiments of FedFac on Real Data
5.1 Experimental setup
We conduct various experiments on a wide range of learning tasks and neural networks.
Below, we first introduce the datasets, models, and FL approaches used for comparison.
Datasets. We test the performance of FedFac on four real datasets involving three machine learning problems: (1) FEMNIST, which is used for handwritten digit recognition, (2)
Shakespeare, which is used for next-character prediction, and (3) CIFAR10 and CIFAR100,
which are used for complex image classification [52]. Among these datasets, FEMNIST is
a commonly used dataset in federated learning, whose data points on each client are generated from a specific writer. The Shakespeare dataset is naturally heterogeneous with different
clients representing different speaking roles [1]. For CIFAR10 and CIFAR100, we follow
the common practice and consider different levels of label heterogeneity. For CIFAR10, we
follow the work [53] and assign data points of the same label to different clients according to
a symmetric Dirichlet distribution with parameter π . The Dirichlet parameter π indicates the
non-IID degree with smaller π representing more highly skewed local distributions. For CIFAR100, we adopt the pathological non-IID partition as in [1], and the heterogeneity degree
is controlled by varying the number of classes S per client.
Models. Based on the four datasets, we consider a range of classical neural network models, including ResNet, VGG, and LSTM. The detailed description of neural network models
and the corresponding used datasets are summarized in App.D. For simplicity, all models are trained with the ADAM optimizer [54] with an initial learning rate of 10−3 and
β = (0.9, 0.999). Table 1 summarizes the datasets and the corresponding used deep neural
networks.
Dataset

Task

Clients Total samples

FEMNIST handwritten digit recognition 1079
Shakespeare Next-Character Prediction
778
CIFAR10
Image classification
100
CIFAR100
Image classification
100

245,114
4,101,468
60,000
60,000

Model
2-layer CNN + 2-layer FC
2-layer LSTM + 2-layer FC
ResNet-18
VGG-16

Table 1: Summary of datasets and models.

Competitors. We compare FedFac against two global FL methods, including: (1) FedAvg
[1], (2) FedProx [3], as well as four personalized FL methods, including: (1) FedPer [35],
(2) LG-FedAvg [38], (3) FedEM [28], and (4) FedRep [36]. Specifically, the penalization
parameter µ in FedProx is tuned by grid search on the set {1, 0.5, 0.1, 0.01, 0.001}. For
FedPer and FedRep, we keep the last layer of the DNN models private. For LG-FedAvg, the
number of DNN layers spread across the local and global models is tuned from 1 to L − 1.
As for FedEM, we set the number of components to be 3 as suggested by the authors. For
all the FL methods, we randomly split the dataset in each client to be training (80%) and
testing (20%). All experiments are examined in the partial participation case with C denoting
the client participation rate. The number of local epochs between each two communication
rounds is tuned on the set {1, 5, 20}.

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

21

Tuning Parameter Setup The FedFac method involves three tuning parameters: the layers
Λ to be split, the personalization threshold τl , and the variability threshold κl used in estimating Al . We tune the parameter Λ through the set {1, . . . , L − 1}. As for τl , define
ν l = {νl1 , · · · , νldl }. Then the tuning set of τl contains the 25%, 50%, 75% quantiles of
ν l , along with the max(ν l ) + 0.1, which represents all parameters are client-specific and
denoted by +∞. Note that a larger value of τl indicates more personalized parameters.
Then by considering different values of τl , the FedFac method can cover FedAvg as well
as many other split-personalization methods (e.g., FedPer and LG-FedAvg). This implies
FedFac is a more general FL framework. For the parameter κl , we consider a tuning set
{0.5, 0.75, 0.85, 0.9, 0.95}. All the parameters are tuned by grid search within their tuning
sets.

5.2 Comparison results
We evaluate the model performance on the local test dataset unseen at training for all FL
approaches. Specifically, the averaged weighted accuracy across all clients in the FL system
is computed by using weights proportional to the local dataset sizes. Table 2 summarizes
the best averaged accuracy measured across 10 consecutive rounds after convergence is
achieved setting C = 0.1. It is observed that, FedFac significantly outperforms all competing methods in almost all settings. Dynamic FedFac always performs better than its static
counterpart, suggesting the benefit of adjusting the initial decomposition. Besides, we find
the decomposition of parameters would reach a stable state in dynamic FedFac. Thus we use
dynamic FedFac for the following analysis and omit “dynamic” to save space. The results
on CIFAR10 and CIFAR100 show that, FedFac is robust to different levels of heterogeneity.
In summary, these results demonstrate the benefits and applicability of FedFac.

FedFac
Static Dynamic

Λ

Dataset

Setting

FedAvg FedProx FedPer LG-FedAvg FedEM FedRep

CIFAR10

π = 0.1
π = 0.5
π=1

0.8033
0.7039
0.6849

0.8058
0.7196
0.7007

0.7624
0.7130
0.6870

0.7710
0.5544
0.4702

0.5722
0.5708
0.5331

0.6694 0.8307
0.3932 0.7483
0.3072 0.7047

0.8307
0.7487
0.7153

{16,17}
{17}
{17}

CIFAR100

S=5
S = 10
S = 30

0.0710
0.3011
0.4513

0.2093
0.3313
0.4658

0.6773
0.6643
0.5994

0.6586
0.2435
0.2243

0.5224
0.5189
0.6183

0.2153 0.7138
0.1725 0.6897
0.1107 0.6107

0.7136
0.6932
0.6275

{15}
{15}
{15}

FEMNIST

—-

0.7680

0.8107

0.7123

0.7653

0.7949

0.6965 0.8139

0.8301

{2}

Shakespeare

—-

0.4996

0.4992

0.5014

0.3636

0.3533

0.4095 0.4983

0.5029

{3}

Table 2: Average test accuracies on different datasets. The decomposed layer l using dynamic FedFac is also
reported in the last column.

We further investigate the local performance of different clients, whose detailed results
are shown in Figure 8. It is clear that, FedFac has gained improvement in terms of the local
accuracy for most clients. In addition, note that the FedFac method can achieve relatively
small variance of client-side accuracies. It implies the advantage of FedFac in reducing
the heterogeneity of model performance across different clients. In other words, the fairness
across clients in terms of local performances in different clients can be improved by FedFac.
Last, we focus on the prediction performance on new clients. To this end, we randomly
select 10% of the total clients as new clients, and prevent them from model training. Then we

22

Feifei Wang et al.

Fig. 8: Local performance across all clients on varied datasets. Different models are ordered by their median
local performance. The orange box represents our proposed FedFac method.

apply the two solutions (LocalTrain and Ensemble) discussed in Section 3.2 to predict labels
in the new clients. As for the competitors, we apply their original generalization methods for
new clients. We do not include FedPer because it did not mention the generalization method.
For FedAvg and FedProx, we just apply the estimated global models on the new clients. The
obtained global models with fine-tune are also considered as baseline, which are denoted
by FedAvg+ and FedProx+. The method LG-FedAvg applies a similar ensemble approach
for new clients; while FedEM applies a similar LocalTrain approach. The experiment is
conducted on CIFAR10 for illustration purpose. Table 3 shows the prediction performance
achieved by FedFac and its competitors. As shown, the FedFac method with Ensemble always performs better than that with LocalTrain. It also outperforms all its competitors in
the cases π = 0.5 and 1. In the extreme heterogeneity case with π = 0.1, FedFac with
Ensemble can still achieve similar performances with the fine-tune methods FedAvg+ and
FedProx+.

Setting

FedAvg FedAvg+ FedProx FedProx+ LG-FedAvg FedEM FedRep

π = 0.1
π = 0.5
π=1

0.4004
0.5953
0.6847

0.6500
0.6604
0.6953

0.2813
0.4948
0.6636

0.6520
0.6799
0.7003

0.2046
0.3109
0.3770

0.3701
0.5807
0.5193

0.3970
0.3319
0.2660

FedFac
LocalTrain Ensemble
0.6113
0.5750
0.7034

0.6410
0.7154
0.7160

Table 3: The test accuracies of new clients on CIFAR10.

5.3 The tuning procedure of hyper-parameters
Recall that we have three hyper-parameters (Λ, τl , κl ). Our preliminary exploration shows
Λ has the largest influence on the classification performance of FedFac, which is followed
by τl and κl . Therefore, to avoid time consuming in grid search, we suggest the following

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

23

parameter tuning procedure. That is, tune Λ first, then tune τl with the optimal Λ fixed, and
finally tune κl with the optimal Λ and τl fixed. To illustrate this tuning procedure, we take
CIFAR100 as an example, which is estimated by VGG-16.
Table 4 summarizes the test accuracies under different Λ, with τl = 50 and κl = 0.85
fixed. We find that, the largest Λ results in the best test accuracy for different setups of S .
Recall Λ determines the layers to conduct decomposition for client-shared parameters and
client-specific parameters. This finding suggests that, for more complicated neural networks
dealing with challenging tasks (i.e., CIFAR100 with VGG-16), conducting decomposition
on deeper layers results in better model performance than that on shallow layers. This is
because the deeper layers usually express high-level representations, which may behave
more helpful to complicated tasks. On the contrary, for simple DNN structures tackling
easier tasks (i.e., the handwritten digit recognition), our results show that, shallow layers are
more preferable and should be taken into consideration.
By Table 4, we have chosen the best Λ for CIFAR100, i.e., Λ = {15}. In the next step,
we fix Λ = {15}, κl = 0.85, and then select the optimal τl . The corresponding results are
shown in Table 5. We find that, the optimal τl is 95% for S = 5 and S = 10. For S = 30,
the optimal τl is the 90% quantile of ν l . Recall S controls the heterogeneity level of data.
The smaller the S , the larger heterogeneity of the data. Therefor, these results suggest that,
the optimal value of τl is related with the heterogeneity level of data. Larger τl is more
beneficial when data have higher level of heterogeneity among different clients. This finding
also confirms our intuitive understanding that, high heterogeneity inherently encourages
more personalization and less sharing. Finally, with the optimal Λ and τl fixed, we can
select κl . Note that κl is used to determine the number of common factors in factor analysis.
Table 6 shows the tuning results of κl for the three cases.
Λ

{1}

{3}

{4}

{5}

{6}

{7}

{8}

S=5
S = 10
S = 30
Λ

0.1043
0.2967
0.4433
{9}

0.1178
0.3506
0.4601
{10}

0.0938
0.3678
0.4618
{11}

0.1075
0.4050
0.4580
{12}

0.1078
0.4219
0.4683
{13}

0.1397
0.4180
0.5032
{14}

0.1773
0.4986
0.5219
{15}

S=5
S = 10
S = 30

0.1418
0.5258
0.5412

0.1516
0.5743
0.5535

0.2514
0.5167
0.5711

0.1563
0.3887
0.5725

0.0718
0.2275
0.3943

0.1876
0.3912
0.5615

0.6822
0.6561
0.6100

Table 4: Tuning results of Λ on CIFAR100 with τl = 50, κl = 0.85.

τl

25%

50%

75%

90%

95%

S=5
S = 10
S = 30

0.6020
0.2739
0.5747

0.6792
0.4254
0.6066

0.6883
0.4814
0.6189

0.6940
0.6788
0.6260

0.7136
0.6932
0.6072

Table 5: Tuning results of τl on CIFAR100 with Λ = {15}, κl = 0.85.

24

Feifei Wang et al.
κl
S=5
S = 10
S = 30

0.5

0.75

0.85

0.9

0.95

0.6921
0.6846
0.6145

0.6973
0.6932
0.6260

0.7031
0.6778
0.6134

0.7136
0.6847
0.6090

0.7084
0.6893
0.6103

Table 6: Tuning results of κl on CIFAR100 with Λ = {15} and according optimal τl in Table 5.

5.4 The computational issue
We first investigate the convergence rate of FedFac using the dynamic method. Figure 9
presents the training loss curves of FedFac and FedAvg in different datasets. Compared
with FedAvg, we find the loss of FedFac goes down faster in all datasets with different
heterogeneity settings, indicating that FedFac has a faster convergence rate than FedAvg.
This result is consistent with our theoretical analysis in Theorem 1. Moreover, compared to
FedAvg, FedFac behaves smoother and more stable during the learning process. We then
focus on the influence of heterogeneity levels. We can find that, the higher heterogeneity of
data across clients, the larger difference between the loss curves of FedFac and FedAvg. This
finding suggests that, the convergence rate of FedFac is much faster than that of FedAvg in
more heterogeneous situations. In particular, the two loss curves nearly coincide for nearly
IID datasets (i.e., CIFAR10 with π = 1 and Shakespeare), which implies the convergence
rate of FedFac is almost the same as FedAvg when data have nearly no heterogeneity.

Fig. 9: The curves of training loss obtained by FedFac and FedAvg on different datasets.

We also explore the computational complexity of FedFac. To this end, we take the running time consumed by FedAvg in the training process as baseline. Then we report the ratio
between the running time consumed by a particular method (e.g., FedFac) with FedAvg.
Table 7 presents the detailed results. As shown, FedFac indeed consumes more time than
FedAvg because of the implementation of factor analysis. However, compared with other
competitors addressing data heterogeneity, FedFac is still competitive.

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data
Dataset

Setting

FedEM

FedRep FedFac

CIFAR10

π = 0.1 4.2073 1.0038
π = 0.5 4.7352 1.1357
π = 1 4.5255 0.9828

FedProx FedPer LG-FedAvg
1.0032
1.1337
0.9996

2.8738
2.8508
2.8091

0.6683 0.8559
0.8679 1.1798
0.7292 1.0845

CIFAR100

S=5
S = 10
S = 30

1.2945 0.6967
1.3165 0.9118
1.5052 0.7143

0.6884
0.7055
0.5686

12.2676 2.7761 3.3879
9.3704 1.8907 3.2842
8.1725 1.9346 2.4466

FEMNIST

—-

2.2718 1.1752

1.1045

9.3726

Shakespeare

—-

3.5878 0.9052

0.9460

10.4129 0.7915 1.2856

25

0.6964 1.2147

Table 7: The ratio of training time of all methods relative to that of FedAvg.

5.5 Ablation studies
We conduct ablation studies from two perspectives. First, we aim to investigate the effect of
client-shared parameters and client-specific parameters. To this end, we mask each group of
parameters in the l-th layer when updating the local models, and then record the corresponding performance. Here, the mask operation means replacing the targeted group of parameters
by random values. For fair comparison, we set τl = 50%, which means the two groups have
equal size of parameters. Panel A in Table 8 presents the performance when masking the
shared parameters or personalized parameters in different experimental settings. As shown,
in most settings especially the heterogeneity level is high, masking personalized parameters
would lead to worse classification performance than masking the shared parameters. This
finding demonstrates the usefulness of personalized parameters. Second, we aim to investigate the effect of decomposition by factor analysis. To this end, we randomly select the
shared and personalized parameters for the layers to be split while keeping the other experiment settings consistent with FedFac. The results are shown in Panel B of Table 8. For
clear comparison, the corresponding results by FedFac in Table 8 are also reported. As
shown, FedFac can always achieve better performance than random split, which verifies the
usefulness of factor analysis.
CIFAR10 CIFAR10 CIFAR10 CIFAR100 CIFAR100 CIFAR100
FEMNIST Shakespeare
π = 0.1 π = 0.5 π = 1
S=5
S = 10
S = 30

Setting

Panel A: Client-shared V.S. Client-specific
mask-shared
mask-personalized

0.8154
0.7225

0.7417
0.7043

0.7092
0.7037

0.6794
0.6578

0.6618
0.6448

0.6052
0.5846

0.8256
0.4793

0.5010
0.4993

0.8301
0.8113

0.5029
0.4990

Panel B: Factor Analysis V.S. Random Split
factor analysis
random split

0.8307
0.8140

0.7487
0.7251

0.7153
0.7037

0.7138
0.6831

0.6932
0.6808

0.6275
0.6060

Table 8: The performance evaluated by Acc of two settings in ablation studies.

5.6 Interpretability
We visualize the featuremaps generated by the personalized and shared filters obtained by
FedFac. For comparison purpose, we also visualize the representations of corresponding filters in FedAvg. For illustration, we take the fifth convolutional layer (the first convolutional

26

Feifei Wang et al.

module) of ResNet-18 trained on CIFAR10 with π = 0.5 for example, since the feature
maps outputted by shallow layers are less abstract compared with deeper layers (which usually output pigment matrix). We set τ5 = 50% so that half of channels in this layer are
personalized. As shown in Figure 10, the personalized filters in FedFac can extract more
clear and specific features for the original Image A, while the corresponding shared filters
in FedAvg generate much more noise in feature representations. It again demonstrates the
advantage of personalized filters in FedFac. The shared filters in FedFac behave similarly
with those in FedAvg. In addition, we find the two types of filters (personalized v.s. shared)
in FedFac generate quite different representations. The personalized filters probably more
focus on extracting target-specific features, whereas the shared filters tend to summarize the
overall features of the original figure. These results justify our motivation to separate the
client-shared representations and client-specific representations.

Fig. 10: Representations of one image generated by the fifth convolution layers of ResNet-18.

6 Conclusion and Discussion
In this work, we propose a novel personalization approach called FedSplit to tackle the statistical heterogeneity problem in federated learning. We are motivated by a common finding that, data in different clients could contain both common knowledge and personalized
knowledge. Thus the hidden elements within the same layer should be also decomposed
into client-shared ones and client-specific ones. With this decomposition, the parameters
corresponding to the client-shared group are updated by all clients, while the parameters
corresponding to the client-specific group are only updated locally and not averaged by the
server. By this way, the resulting models are more customized for individual clients. We
demonstrate FedSplit enjoys a faster convergence speed than the standard FedAvg method.
The generalization bound of FedSplit is also analyzed. To practically decompose the two
groups of information, the factor analysis technique is applied. This leads to the FedFac
method, which is a practically implemented version of FedSplit. We demonstrate by simulation studies that, using factor analysis can well recover the underlying shared/personalized
decomposition. Extensive experiments also show that, FedFac can improve model performance than various state-of-the-art federated learning algorithms.
The FedFac method also has some limitations, which inspire further study in the future.
First, some automatical hyper-parameter tuning strategy such as choosing the network layer
for decomposition is worth of consideration. Second, some privacy-protecting techniques
(e.g., differential privacy) can be leveraged to protect the proposed method from potential
disclosure of sensitive information. Last, our theoretical analysis is applicable for the static
FedFac method. The theoretical properties of using dynamic FedFac for decomposition are
worth of further investigation.

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

27

References
1. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera
y Arcas. Communication-efficient learning of deep networks from decentralized data.
In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
2. Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated Machine Learning:
Concept and Applications. ACM Transactions on Intelligent Systems and Technology
(TIST), 10(2):1–19, 2019.
3. Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine,
37(3):50–60, 2020.
4. Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis,
Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel
Cummings, et al. Advances and open problems in federated learning. Foundations and
Trends® in Machine Learning, 14(1–2):1–210, 2021.
5. Qinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, Yuan Li, Xu Liu, and Bingsheng He. A survey on federated learning systems: Vision, hype and reality for data
privacy and protection. IEEE Transactions on Knowledge and Data Engineering, 2021.
6. Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the
Convergence of FedAvg on Non-IID Data, 2019.
7. Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian
Stich, and Ananda Theertha Suresh. Scaffold: Stochastic Controlled Averaging for Federated Learning. In International conference on machine learning, pages 5132–5143.
PMLR, 2020.
8. Xinwei Zhang, Mingyi Hong, Sairaj Dhople, Wotao Yin, and Yang Liu. Fedpd: A
federated learning framework with adaptivity to non-iid data. IEEE Transactions on
Signal Processing, 69:6055–6070, 2021.
9. Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair Resource Allocation
in Federated Learning, 2019.
10. Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic Federated Learning. In International Conference on Machine Learning, pages 4615–4625. PMLR,
2019.
11. Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N.
Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization, 2021.
12. Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards Personalized Federated Learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
13. Yutao Huang, Lingyang Chu, Zirui Zhou, Lanjun Wang, Jiangchuan Liu, Jian Pei, and
Yong Zhang. Personalized Cross-Silo Federated Learning on Non-IID Data. In AAAI,
pages 7865–7873, 2021.
14. Daliang Li and Junpu Wang. FedMD: Heterogenous Federated Learning via Model
Distillation, 2019.
15. Kangkang Wang, Rajiv Mathews, ChloÃ© Kiddon, Hubert Eichner, FranÃ§oise Beaufays, and Daniel Ramage. Federated Evaluation of On-device Personalization, 2019.
16. Suyu Ge, Fangzhao Wu, Chuhan Wu, Tao Qi, Yongfeng Huang, and Xing Xie. FedNER:
Privacy-preserving Medical Named Entity Recognition with Federated Learning, 2020.
17. Jiaming Pei, Kaiyang Zhong, Mian Ahmad Jan, and Jinhai Li. Personalized Federated
Learning Framework for Network Traffic Anomaly Detection. Computer Networks,
209:108906, 2022.

28

Feifei Wang et al.

18. Lyudmyla F Kozachenko and Nikolai N Leonenko. Sample Estimate of the Entropy of
A Random Vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
19. Harry H Harman. Modern Factor Analysis. University of Chicago press, 1976.
20. Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the
Objective Inconsistency Problem in Heterogeneous Federated Optimization. Advances
in Neural Information Processing Systems, 33:7611–7623, 2020.
21. Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Linear Convergence
in Federated Learning: Tackling Client Heterogeneity and Sparse Gradients. Advances
in Neural Information Processing Systems, 34:14606–14619, 2021.
22. Filip Hanzely and Peter Richtárik. Federated Learning of A Mixture of Global and
Local Models. arXiv preprint arXiv:2002.05516, 2020.
23. Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. Salvaging Federated Learning by
Local Adaptation. arXiv preprint arXiv:2002.04758, 2022.
24. Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Fedavg with
fine tuning: Local updates lead to representation learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information
Processing Systems, 2022.
25. Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach. Advances in Neural Information Processing Systems, 33:3557–3568, 2020.
26. Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated
Multi-Task Learning. Advances in Neural Information Processing Systems, 30, 2017.
27. Luca Corinzia, Ami Beuret, and Joachim M. Buhmann. Variational Federated MultiTask Learning, 2019.
28. Othmane Marfoq, Giovanni Neglia, Aurélien Bellet, Laetitia Kameni, and Richard Vidal. Federated Multi-Task Learning under a Mixture of Distributions. Advances in
Neural Information Processing Systems, 34:15434–15447, 2021.
29. Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning, 2020.
30. Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three Approaches for Personalization with Applications to Federated Learning, 2020.
31. Othmane Marfoq, Giovanni Neglia, Richard Vidal, and Laetitia Kameni. Personalized
federated learning through local memorization. In International Conference on Machine Learning, pages 15070–15092. PMLR, 2022.
32. Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An Efficient
Framework for Clustered Federated Learning. Advances in Neural Information Processing Systems, 33:19586–19597, 2020.
33. Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization under Privacy Constraints.
IEEE Transactions on Neural Networks and Learning Systems, 32(8):3710–3722, 2020.
34. Canh T. Dinh, Nguyen Tran, and Josh Nguyen. Personalized Federated Learning with
Moreau Envelopes. Advances in Neural Information Processing Systems, 33:21394–
21405, 2020.
35. Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav
Choudhary. Federated learning with personalization layers, 2019.
36. Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting
Shared Representations for Personalized Federated Learning. In International Conference on Machine Learning, pages 2089–2099. PMLR, 2021.

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

29

37. Jaehoon Oh, SangMook Kim, and Se-Young Yun. FedBABU: Toward enhanced representation for federated image classification. In International Conference on Learning
Representations, 2022.
38. Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B. Allen, Randy P. Auerbach, David
Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally:
Federated learning with local and global representations, 2020.
39. Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization, 2021.
40. Yiqing Shen, Yuyin Zhou, and Lequan Yu. Cd2-pfed: Cyclic Distillation-guided Channel Decoupling for Model Personalization in Federated Learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10041–
10050, 2022.
41. Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang. Fedrolex: Model-Heterogeneous
Federated Learning with Rolling Sub-Model Extraction. Advances in Neural Information Processing Systems, 35:29677–29690, 2022.
42. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Advances in Neural Information Processing Systems, 31, 2018.
43. Michael Kamp, Jonas Fischer, and Jilles Vreeken. Federated learning from small
datasets. In The Eleventh International Conference on Learning Representations, 2023.
44. Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. Fl-ntk: A neural tangent kernelbased framework for federated learning analysis. In International Conference on Machine Learning, pages 4423–4434. PMLR, 2021.
45. Kai Yue, Richeng Jin, Ryan Pilgrim, Chau-Wai Wong, Dror Baron, and Huaiyu Dai.
Neural Tangent Kernel Empowered Federated Learning. In International Conference
on Machine Learning, pages 25783–25803. PMLR, 2022.
46. Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably
optimizes over-parameterized neural networks, 2018.
47. Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-Grained
Analysis of Optimization and Generalization for Overparameterized TwoLayer Neural
Networks. In International Conference on Machine Learning, pages 322–332. PMLR,
2019.
48. Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient Descent
Finds Global Minima of Deep Neural Networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019.
49. Emil OW Kirkegaard. Inequality across US Counties: An S Factor Analysis. Open
Quantitative Sociology & Political Science, 2016.
50. Malek Abduljaber. A dimension reduction method application to a political science
question: Using exploratory factor analysis to generate the dimensionality of political ideology in the arab world. Journal of Information & Knowledge Management,
19(01):2040002, 2020.
51. Malin Song and Qianjiao Xie. Evaluation of Urban Competitiveness of the Huaihe River
Eco-Economic Belt Based on Dynamic Factor Analysis. Computational Economics,
58(3):615–639, 2021.
52. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny
images. 2009.
53. Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of nonidentical data distribution for federated visual classification, 2019.
54. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.