Stochastic-Constrained Stochastic Optimization with
Markovian Data
Yeongjong Kim

kimyj@kaist.ac.kr

arXiv:2312.04312v1 [math.OC] 7 Dec 2023

Department of Mathematical Sciences
KAIST
Daejeon 34141, South Korea

Dabeen Lee

dabeenl@kaist.ac.kr

Department of Industrial and Systems Engineering
KAIST
Daejeon 34141, South Korea

Abstract
This paper considers stochastic-constrained stochastic optimization where the stochastic
constraint is to satisfy that the expectation of a random function is below a certain threshold.
In particular, we study the setting where data samples are drawn from a Markov chain and
thus are not independent and identically distributed. We generalize the drift-plus-penalty
framework, a primal-dual stochastic gradient method developed for the i.i.d. case, to the
Markov chain sampling setting. We propose two variants of drift-plus-penalty; one is for the
case when the mixing time of the underlying Markov chain is known while the other is for
the case of unknown mixing time. In fact, our algorithms apply to a more general setting of
constrained online convex optimization where the sequence of constraint functions follows a
Markov chain. Both algorithms are adaptive in that the first works without knowledge of
the time horizon while the second uses AdaGrad-style algorithm parameters, which is of
independent interest. We demonstrate the effectiveness of our proposed methods through
numerical experiments on classification with fairness constraints.
Keywords: stochastic-constrained stochastic optimization, constrained online convex
optimization, Markov chain stochastic gradient descent, drift-plus-penalty, classification
with fairness constraints

1 Introduction
In this paper, we consider the stochastic-constrained stochastic optimization (SCSO) problem
min

x∈X

Eξ∼µ [f (x, ξ)]

s.t. Eξ∼µ [g(x, ξ)] ≤ 0

(SCSO)

where the expectation is taken with respect to the random parameter ξ over a stationary
distribution µ, f and g are convex loss and constraint functions, and X is a compact
domain. This formulation of SCSO has applications in stochastic programming with a
risk constraint (Rockafellar and Uryasev, 2000), chance-constrained programming (Nemirovski and Shapiro, 2007), portfolio optimization (Dentcheva and Ruszczynski, 2003),
sparse matrix completion (Akhtar et al., 2021), semi-supervised learning (Chapelle et al.,
2010), classification with fairness constraints (Zafar et al., 2019; Celis et al., 2019; Donini
et al., 2018), Neyman-Pearson classification (Scott and Nowak, 2005; Rigollet and Tong,
2011), ranking with fairness constraints (Celis et al., 2018), recommendation systems with
©2023 Yeongjong Kim and Dabeen Lee.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.

Yeongjong Kim and Dabeen Lee

fairness constraints (Yao and Huang, 2017), scheduling in distributed data centers (Yu
et al., 2017), safe reinforcement learning (Garcı́a et al., 2015), and stochastic simple bilevel
optimization (Jalilzadeh et al., 2023; Cao et al., 2023).
Stochastic approximation (SA) algorithms are prevalent solution methods for SCSO.
Basically, we run gradient-based algorithms with an oracle providing i.i.d. samples of
f (x, ξ), g(x, ξ), ∇f (x, ξ), ∇g(x, ξ) for a given solution x. Lan and Zhou (2020) proposed
the cooperative stochastic approximation scheme for SCSO, which is a stochastic extension
of Polyak’s subgradient method developed for constrained optimization (Polyak, 1967).
Xiao (2019) developed the penalized stochastic gradient method that takes the square
of the constrained function as a penalty term. Lin et al. (2020) developed a level setbased algorithm for SCSO. Akhtar et al. (2021) studied an augmented Lagrangian-based
stochastic primal-dual algorithm, which was inspired by the primal-dual framework for online
convex optimization with long-term constraints due to Mahdavi et al. (2012). Furthermore,
motivated by recent success in adaptive gradient algorithms, Yan and Xu (2022) considered
an adaptive primal-dual stochastic gradient method for SCSO. Zhang et al. (2023) proposed
a stochastic variant of the proximal method of multipliers. Zhang et al. (2022) studied
another stochastic proximal method of multipliers based on linearization.
SA algorithms for other related problem settings are as follows. Yu et al. (2017) studied
online convex optimization with stochastic constraints where the constraint functions are
stochastic i.i.d. while the objective functions are arbitrary, for which they proposed the driftplus-penalty (DPP) algorithm. DPP applies to SCSO given that i.i.d. samples of g(x, ξ),
∇g(x, ξ), and ∇f (x, ξ) are available. Wei et al. (2020) developed an extension of DPP,
and Lee et al. (2023) provided a projection-free algorithm for the setting. Moreover, SCSO
can be formulated as a stochastic saddle-point problem by taking the Lagrangian if certain
constraint qualifications hold. Nemirovski et al. (2009); Juditsky et al. (2011) developed
stochastic mirror-prox algorithms for general stochastic saddle point problems. Zhao (2022)
devised an accelerated stochastic framework for convex-concave saddle-point problems, and
Yazdandoost Hamedani et al. (2023) devised a randomized adaptive primal-dual method.
The aforementioned solution methods for SCSO require i.i.d. data samples from the
stationary distribution µ when running SA algorithms. However, there are several application
scenarios in which sampling from the stationary distribution µ independently and identically
is difficult. For example, federated learning serves as an alternative to traditional machine
learning systems that require data centralization, with the purpose of improving data
privacy (Zhao et al., 2018). The basic framework is that data is stored on individual
local devices while the training is governed by a central server. Another related setting is
distributed optimization over sensor networks (Rabbat and Nowak, 2004; Lopes and Sayed,
2007; Johansson et al., 2007) and multi-agent systems (Johansson et al., 2008). For these
applications, an enormous amount of data is distributed over distinct nodes of a network,
for which data exchange and massage passing are between immediate neighboring nodes.
One resolution approach for such application scenarios is Markov chain stochastic
gradient descent (SGD) (Johansson et al., 2010; Ram et al., 2009a). Basically, Markov
chain SGD takes a random walk over a network of local data centers and updates solutions
based on data acquired from the data centers visited. Here, as the data is generated by a
Markov random walk, there is inherent dependence and bias between data samples. More
generally, the framework can be viewed as a variant of the Markov chain Monte Carlo
2

Stochastic-Constrained Stochastic Optimization with Markovian Data

method (Andrieu et al., 2003). That said, Markov chain SGD can be applied to other
applications domains where data is collected from a Markov process, such as decentralized
learning (Yang et al., 2021), robust estimation (Poljak and Tsypkin, 1980; Sarkar and
Rakhlin, 2019), and reinforcement learning (Nagaraj et al., 2020; Kowshik et al., 2021).
Following the Markov incremental subgradient methods due to Johansson et al. (2010);
Ram et al. (2009a) designed for distributed optimization problems, Markov chain SGD
with data sampled from a general Markov process have been studied. Duchi et al. (2012)
developed Markov chain SGD with data sampled from an ergodic process. Later, Sun et al.
(2018) studied Markov chain SGD for convex and nonconvex problems when the underlying
Markov chain is nonreversible. Doan et al. (2020) proposed an accelerated version of Markov
chain SGD for both convex and nonconvex settings. Dorfman and Levy (2022) considered
the setting where the mixing time of the underlying Markov chain is unknown, for which they
developed Markov chain SGD based on the multi-level Monte Carlo estimation scheme (Giles,
2015; Blanchet and Glynn, 2015). Roy et al. (2022) studied nonconvex problems where
the transition of the underlying Markov chain is state-dependent, motivated by strategic
classification and reinforcement learning. Recent works (Doan, 2023; Even, 2023) established
some performance guarantees of Markov chain SGD under minimal assumptions.
Applications of SCSO naturally motivate and necessitate algorithmic frameworks that
can handle data sets with inherent dependence and bias. Portfolio optimization in finance
takes time series data. Machine learning with fairness constraints processes heterogeneous
data sets from disjoint sources. Scheduling in distributed data centers will benefit from
distributed optimization technologies. Safe and constrained reinforcement learning can
be formulated as SCSO for which the training data is obtained from trajectories of the
underlying Markov decision process. Despite this immediate need, no prior work exists for
solving SCSO with non-i.i.d. data samples. Motivated by this, the objective of this paper is
to develop stochastic approximation algorithms that run with data sampled from a Markov
chain.
1.1 Our Contributions
This paper initiates the study of stochastic approximation algorithms for stochastic-constrained
stochastic optimization (SCSO) using non-i.i.d. data samples. Inspired by recent advances in
Markov chain SGD, we develop primal-dual stochastic gradient methods using data sampled
from a Markov chain, which can be viewed as primal-dual variants of Markov chain SGD.
Specifically, we extend the drift-plus-penalty algorithm by Yu et al. (2017) that was originally
developed for the i.i.d. setting. We adopt the approach of ergodic mirror descent by Duchi
et al. (2012) for the case of known mixing time and the framework of Dorfman and Levy
(2022) for the setting where the mixing time is unknown.
Our key technical contribution is to develop two variants of the drift-plus-penalty
algorithm that can take a sequence of dependent constraint functions. These two algorithms
solve constrained online convex optimization where the objective functions can be arbitrary
and the constraint functions are generated from a Markov chain. One of them is for the
case of known mixing time, while the other is for the case when the mixing time is unknown.
We provide regret and constraint violation bounds for the algorithms, which delineate how
their performance depends on the mixing time. Based on the regret and constraint violation
3

Yeongjong Kim and Dabeen Lee

Algorithm 1
✗

Oblivious to τmix
Regret
Constraint violation





1−β 1−β
Õ τmix
T


β/2 (β+1)/2
O τmix T

 1−β
Optimality gap

Õ

τmix
Tβ


Feasibility gap

Õ

+

β/2
τmix
(1−β)/2
T

β/2

τmix



T (1−β)/2



Algorithm 1 under
Slater’s condition
✗

√
Õ τmix T

√
Õ τmix T
√

τ
Õ √mix
T
√

τ
Õ √mix
T

Algorithm 2
✔
adaptive
adaptive
 1−β 
τ
Õ mix
Tβ
 (2β+1)/4 
Õ

τmix
T (1−β)/2

Table 1: Bounds on regret, constraint violation, optimality gap, and feasibility gap under
Algorithms 1 and 2 (β ∈ (0, 1/2] is a predetermined algorithm parameter that
controls the balance between regret and constraint violation)

guarantees, we analyze the optimality gap and feasibility gap for SCSO. The connection
between the constrained online convex optimization formulation and SCSO for our Markovian
setting is not as immediate as the i.i.d. setting because the expectation in (SCSO) is taken
with respect to the stationary distribution of the underlying Markov chain.
What follows is a more detailed description of our contributions. Our results are also
summarized in Table 1.
• In Section 3, we consider the case when the mixing time of the underlying Markov chain
is known, for which we develop Algorithm 1, a variant of the drift-plus-penalty algorithm.
We first prove that for online convex optimization with Markovian constraints, the
β/2
1−β 1−β
regret of Algorithm 1 is Õ(τmix
T
) and the constraint violation is O(τmix T (β+1)/2 )
where τmix is the mixing time, T is the length of horizon, β can be chosen to be any
number in (0, 1/2], and Õ hides a poly log T factor. If we further assume that (SCSO)
satisfies Slater’s constraint qualification, then
√ the regret and constraint violation
of Algorithm 1 can be both bounded by Õ( τmix T ). We remark that Algorithm 1 is
adaptive in that its parameters are chosen without knowledge of T , unlike the vanilla
drift-plus-penalty algorithm. These results generalize the work of Yu et al. (2017).
• Based on the regret and constraint violation analysis for Algorithm 1, we show that the
averaging of the sequence of solutions generated by Algorithm 1 guarantees that the
β/2
1−β −β
optimality gap is bounded by Õ(τmix
T + τmix T −(1−β)/2 ) while the feasibility gap
β/2
is bounded by Õ(τmix T −(1−β)/2 ). If we further assume that (SCSO) satisfies Slater’s
constraint qualification, p
then the optimality gap and feasibility gap of Algorithm 1 can
be both bounded by Õ( τmix /T ).
• In Section 4, we propose Algorithm 2, another variant of the drift-plus-penalty algorithm, for the setting where the mixing time is unknown. The parameters of
Algorithm 2 are set in an adaptive fashion, as in the AdaGrad method (Duchi et al.,
4

Stochastic-Constrained Stochastic Optimization with Markovian Data

2011; Levy, 2017). Then we apply Algorithm 2 to constrained online convex optimization where the objective and constraint functions are given by the Multi-level
Monte Carlo estimation scheme as in Dorfman and Levy (2022). We provide adaptive
regret and constraint violation bounds for Algorithm 2. We note that Algorithm 2 is
the first AdaGrad-style adaptive variant of the drift-plus-penalty algorithm. In fact,
Algorithm 2 applies to online convex optimization with adversarial constraints (Neely
and Yu, 2017) and provides adaptive performance guarantees. We include this result
in Appendix C.
• Combining the estimation accuracy bounds on the Multi-level Montel Carlo method
by Dorfman and Levy (2022) and our adaptive regret and constraint violation bounds,
1−β −β
we prove that the optimality gap under Algorithm 2 is bounded by Õ(τmix
T ) while
(2β+1)/4 −(1−β)/2
the feasibility gap is bounded by Õ(τmix
T
).
• In Section 5, we provide numerical results from experiments on a classification problem
with fairness constraints. Specifically, we take the logistic regression formulation
proposed in Zafar et al. (2019). The numerical results on random problem instances
demonstrate the efficacy of our proposed algorithmic frameworks for solving SCSO
with Markovian data.
The main component of our analysis is to provide bounds on the terms
" T
#
" T
#
 
X
X Qt
Qt
E
Qt gt (x) , E
gt (x) , E [Qt ] , E
Vt
Vt
t=1

t=1

under our Markovian data regime. All these terms involve the virtual queue size Qt , and
therefore, controlling the virtual queue size is crucial to guarantee a fast convergence rate.
We include our proofs of the main theorems in Sections 6 and 7. We include some of the
known lemmas and tools for analyzing the drift-plus-penalty method due to Yu et al. (2017)
in Appendices A and D. In fact, Appendix A state these results for any adaptive version of
the drift-plus-penalty algorithm, which uses time-varying parameters Vt and αt .
1.2 Related Work
Although we have listed and explained important lines of previous work that motivate
this paper, we supplement the list by mentioning a few more relevant results in stochastic
approximation algorithms and Markov chain stochastic gradient methods.
Stochastic Approximation for Stochastic Optimization Starting from the seminar
paper by Robbins and Monro (1951), stochastic approximation algorithms, also known as
stochastic gradient methods, have been a central topic of research in the domain of machine
learning, operations research, and optimization (Ermoliev, 1983; Pflug, 1996; Ruszczyński
and Syski, 1986; Bottou et al., 2018). There already exist numerous works on stochastic
approximation algorithms for stochastic optimization. In particular, for SCSO or stochastic
optimization with expectation constraints, various stochastic approximation methods were
proposed and studied by Lan and Zhou (2020); Xiao (2019); Lin et al. (2020); Akhtar et al.
(2021); Yan and Xu (2022); Zhang et al. (2023, 2022), as discussed earlier.
5

Yeongjong Kim and Dabeen Lee

We note that online convex optimization with stochastic constraints is a superclass of
SCSO under the i.i.d. data regime. Mahdavi et al. (2012); Jenatton et al. (2016) developed
augmented Lagrangian-based primal-dual algorithms for online convex optimization with
deterministic long-term constraints. It turns out that these algorithms and their analysis
can be adapted to the setting of stochastic constraints (Akhtar et al., 2021). Later, Yu et al.
(2017); Wei et al. (2020) proposed the drift-plus-penalty algorithm that achieves better regret
and constraint violation guarantees under Slater’s condition. Yuan and Lamperski (2018);
Yi et al. (2021); Guo et al. (2022) also provided algorithms for online convex optimization
with stochastic constraints.
Markov Chain Stochastic Gradient Descent The asymptotic convergence of Markov
chain SGD was studied in Bertsekas and Tsitsiklis (1996); Borkar (2008); Benveniste et al.
(1990) based on ordinary differential equation methods. Lopes and Sayed (2007); Johansson
et al. (2007, 2010); Ram et al. (2009a) developed incremental subgradient methods for the
distributed optimization setting where there is a network of data centers and an algorithm
performs a random walk over the network to obtain data samples. These methods are also
referred to as token algorithms. More recently, Mao et al. (2020) devised what they called
the walkman algorithm, which is a token algorithm based on an augmented Lagrangian
method. Sun et al. (2022); Ayache et al. (2023) considered adaptive variants of token
algorithms. Hendrikx (2023) developed a more general framework for token algorithms
that allows multiple tokens for improving communication efficiency and may adopt existing
optimization tools such as variance reduction and acceleration.
In addition to random walk-based token algorithms, there exist general Markov chain
sampling frameworks for stochastic gradient methods. As mentioned earlier, Duchi et al.
(2012); Sun et al. (2018); Doan et al. (2020); Dorfman and Levy (2022); Roy et al. (2022)
developed Markov chain SGD methods for convex and nonconvex stochastic optimization
problems. Moreover, Sun et al. (2020) developed a Markov chain sampling-based block
coordinate descent method. Sun et al. (2023) proposed a decentralized variant of Markov
chain SGD. Wang et al. (2022) considered the stability of Markov chain SGD and deduced
its generalization bounds. Doan (2023) derived convergence guarantees on Markov chain
SGD without a smoothness assusmption, and Even (2023) studied convergence of Markov
chain SGD without the bounded gradient assumption.

2 Preliminaries
In this section, we provide problem formulations for stochastic-constrained stochastic and online convex optimization under the Markovian data sampling regime. In addition, Section 2.1
gives the formal definition of the mixing time of a Markov chain. Section 2.4 describes a list
of assumptions considered throughout the paper.
2.1 Markov Chain and Mixing Time
Given two probability distributions P, Q over the probability space (S, F), the total variation
distance between them is defined as
∥P − Q∥T V := sup |P(A) − Q(A)| .
A∈F

6

Stochastic-Constrained Stochastic Optimization with Markovian Data

Let {ξt }∞
t=1 be a time-homogeneous ergodic Markov chain with a finite state space S. For a
distribution ν over (S, F), we denote by Pt (ν, ·) the conditional probability distribution of
ξt+1 given ξ1 ∼ ν. Since {ξt }∞
t=1 is ergodic, it has a unique stationary distribution µ, i.e.,
Pt (µ, ·) = µ(·). In fact, the long-term distribution of the ergodic Markov chain converges to
µ regardless of the initial distribution, which can be demonstrated as follows. It is known
that Pt (ν, ·) for any t and ν satisfies
∥Pt (ν, ·) − µ∥T V ≤ Cαt
for some α ∈ (0, 1) and C > 0 (Levin and Peres, 2017). Then we define quantities dmix and
τmix (ϵ) as follows.
dmix := sup∥P t (ν, ·) − µ∥T V ,
ν

τmix (ϵ) := inf{t ∈ N : dmix (t) ≤ ϵ}.

Moreover, following the convention, we define τmix as
τmix := τmix (1/4),
and we refer to τmix as the mixing time of the underlying Markov chain. It is known that
dmix (lτmix ) ≤ 2−l for every l ∈ N (Levin and Peres, 2017, Chapter 4), which implies that
τmix (ϵ) ≤ ⌈log2 ϵ−1 ⌉τmix .
In particular, throughout the paper, we will use quantity τ defined as
τ := τmix (1/T ) = O(τmix log T )
where T is the length of the time horizon.
2.2 Stochastic-Constrained Online Convex Optimization with Markovian Data
Let X ⊂ Rd be a compact convex domain. Let {ft : X → R}∞
t=1 be a sequence of arbitrary
convex functions. Another sequence of convex functions {gt : X → R}∞
t=1 is assumed to
follow an ergodic Markov chain. More precisely, there exists a time-homogeneous ergodic
Markov chain {ξt }∞
t=1 such that gt (x) = g(x, ξt ). Let
ḡ(x) = Eξ∼µ [g(x, ξ)]
for x ∈ X where the expectation is taken with respect to the stationary distribution µ of
the Markov chain. Then the problem is to solve and compute
x∗

∈

argmin
x∈X

T
X

ft (x)

s.t.

ḡ(x) ≤ 0.

t=1

However, the information about the functions {ft }Tt=1 and {gt }Tt=1 is revealed online. Basically,
at each step t, we choose our decision xt ∈ X before observing ft , gt , after which we receive
feedback about them. The stochastic setting studied in Yu et al. (2017) is that constraint
functions g1 , . . . , gT are i.i.d., which means that ξ1 , . . . , ξT are i.i.d., while we consider the
7

Yeongjong Kim and Dabeen Lee

case where constraint functions g1 , . . . , gT follow an ergodic Markov chain and thus are
dependent. That said, we may refer to the problem setting as online convex optimization
with ergodic constraints.
To measure the performance of a learning algorithm for the online optimization problem,
we adopt the regret and cumulative constraint violation definition due to Yu et al. (2017).
The regret and cumulative constraint violation of an algorithm that generates solutions
x1 , . . . , xT over T time steps are given by
Regret(T ) =

T
X

ft (xt ) −

t=1

T
X

ft (x∗ ),

Violation(T ) =

t=1

T
X

gt (xt ).

t=1

We want these quantities to have a sublinear growth in T . Note that the benchmark solution
x∗ satisfies the expectation constraint Eξ∼µ [g(x∗ , ξ)] ≤ 0.
2.3 Stochastic-Constrained Stochastic Optimization with Markovian Data
Next, we consider the setting where the objective functions {ft }∞
t=1 as well as the constraint
functions {gt }∞
are
given
by
an
ergodic
Markov
chain.
Without
loss of generality, we
t=1
may assume that ft (x) = f (x, ξt ) and gt (x) = g(x, ξt ) for some time-homogeneous ergodic
Markov chain {ξt }∞
t=1 with a stationary distribution µ. As before, let
f¯(x) = Eξ∼µ [f (x, ξ)]
and ḡ(x) = Eξ∼µ [g(x, ξ)] for x ∈ X . Then the problem is to solve and compute
x#

∈

f¯(x)

argmin

s.t.

ḡ(x) ≤ 0.

x∈X

Here, we do not have direct access to (f¯, ḡ), but we receive samples (ft , gt ) which converge
to (f¯, ḡ) in expectation.
For the performance measure of a learning algorithm for this problem, we consider the
following standard notions of optimality gap and constraint violation. For a solution x ∈ X
Gap(x) = f¯(x) − f¯(x# ),

Infeasibility(x) = ḡ(x).

We want these quantities to approach 0 as T grows. Hereinafter, we refer to Gap(x) and
Infeasibility(x) as the optimality gap and the feasibility gap, respectively.
In the special case where {ξt }∞
t=1 are i.i.d. samples drawn from µ, solving stochasticconstrained online convex optimization would provide a solution to stochastic-constrained
stochastic optimization. One can argue that if ξ1 , . . . , ξT are i.i.d., then
Gap(x̄T ) ≤

1
E [Regret(T )] ,
T

Infeasibility(x̄T ) ≤

where x̄T denotes the simple average of x1 , . . . , xT :
T

x̄T =

1X
xt .
T
t=1

8

1
E [Violation(T )]
T

Stochastic-Constrained Stochastic Optimization with Markovian Data

However, in contrast to the i.i.d. case, the above inequalities do not hold for the case of an
arbitrary ergodic Markov chain. This is because the distribution of ξt+1 conditioned on ξt
is not equal to the stationary distribution µ.
On the other hand, we will use the fact that the long-term distribution of an ergodic
Markov chain converges to its stationary distribution. Based on this observation, we first
bound the expected regret and the expected constraint violation of the online problem,
and then we use this to bound the optimality gap and the feasibility gap of the stochastic
optimization problem.
2.4 Notations and Assumptions
We work over a norm ∥·∥ and its dual norm ∥·∥∗ . We choose a convex mirror map Φ : C → R,
where C ⊆ Rd is a convex domain containing X . We use the corresponding Bregman
divergence defined as
D(x, y) = Φ(x) − Φ(y) − ∇Φ(y)⊤ (x − y).
Assumption 1 There is a constant R > 0 such that D(x, y) ≤ R2 for any x, y ∈ X , and
Φ is 2-strongly convex with respect to norm ∥·∥ i.e. ∥x − y∥2 ≤ D(x, y) for any x, y ∈ X .
Moreover, f (·, ξ) and g(·, ξ) are differentiable for each ξ ∈ S.
We use notations Ft , Gt , Ht for t ∈ [T ] given by
Ft := ∥∇ft (xt )∥∗ , Gt := ∥∇gt (xt )∥∗ , Ht := |gt (xt )|
which are parameters used in our adaptive algorithm. Due to the stochasticity of ft , gt , these
are random variables. We assume these quantities are bounded.
Assumption 2 There exist constants F, G, H > 0 such that
∥∇x f (x, ξ)∥∗ ≤ F, ∥∇x g(x, ξ)∥∗ ≤ G, |g(x, ξ)| ≤ H
for any x ∈ X and ξ ∈ S.
Assumptions 1 and 2 are common in online convex optimization, stochastic optimization,
and the Markov chain SGD literature. The following assumption is referred to as Slater’s
condition or Slater’s constraint qualification for constrained optimization.
Assumption 3 (Slater’s condition) There exists x̂ ∈ X such that ḡ(x̂) ≤ −ϵ for some
ϵ > 0.
For online convex optimization with stochastic i.i.d. constraint functions, Slater’s condition
leads to improvement in regret and constraint violation (Yu et al., 2017). In Section 3,
we will show that even for online convex optimization with ergodic constraint functions,
assuming that Slater’s condition holds results in an improvement in the regret and constraint
violation of Algorithm 1.
9

Yeongjong Kim and Dabeen Lee

Algorithm 1 Ergodic Drift-Plus-Penalty (EDPP)
Initialize: Initial iterates x1 ∈ X , Q1 = 0, and 0 < β ≤ 1/2.
for t = 1 to T do
Observe ft and gt .
Set penalty parameter Vt and step size parameter αt as
Vt = (τmix t)β ,

αt = τmix t.

Primal update: Set xt+1 as
n
o
xt+1 = argmin (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ x + αt D(x, xt )
x∈X

Dual update: Set Qt+1 as
h
i
Qt+1 = Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt )

+

end for

3 Known Mixing Time
We first focus on the setting where we have access to the mixing time τmix of the underlying
Markov chain {ξt }∞
t=1 . Duchi et al. (2012) studied this case for stochastic convex minimization
without a stochastic functional constraint, for which they modified the step size of the
stochastic gradient descent method based on the mixing time parameter τmix . Inspired by
this approach, we take and modify the drift-plus-penalty (DPP) algorithm due to Neely and
Yu (2017); Yu et al. (2017) developed for stochastic-constrained online convex optimization.
Based on DPP, we develop our algorithm by setting the algorithm parameters properly to
adapt to the mixing time τmix .
3.1 Ergodic Drift-Plus-Penalty
The DPP algorithm has two parameters, V and α,√where V is the penalty parameter and α
determines the step size. Yu et al. (2017) set V = T and α = T . In contrast to the vanilla
DPP algorithm, our algorithm uses parameters
Vt = (τmix t)β ,

αt = τmix t

for iterations t = 1, . . . , T , where T is the length of the horizon and β is another algorithm
parameter that controls the balance between the regret and the constraint violation. Our
algorithm, which we call ergodic drift-plus-penalty (EDPP), is described in Algorithm 1.
Note that the mixing time τmix is now part of parameters Vt and αt , as in the ergodic
mirror descent algorithm by Duchi et al. (2012). Second, Vt and αt are time-varying, so
our algorithm is adaptive (Jenatton et al., 2016) and oblivious to the length of the time
horizon T . One may wonder why we do not use T instead of t, i.e., V = (τmix T )β and
α = τmix T . In fact, our numerical results, which will be presented in Section 5, demonstrate
that Algorithm 1 with the adaptive parameters Vt and αt outperforms the algorithm with
10

Stochastic-Constrained Stochastic Optimization with Markovian Data

fixed parameters V = (τmix T )β and α = τmix T . However, the performance analysis of the
vanilla DPP algorithm by Yu et al. (2017) does not immediately extend to such adaptive
parameters. Hence, we prove that the DPP framework with parameters of varying t still
achieves the desired regret and constraint violation guarantees.
Let us also briefly explain how the DPP framework initially developed by Yu et al. (2017)
as well as our Algorithm 1 works. We may regard Qt as the size of a virtual queue at time
t. Then we consider the associated quadratic Lyapunov term Lt = Q2t /2 and study the
corresponding drift given by ∆t = Lt+1 − Lt = (Q2t+1 − Q2t )/2. It is not difficult to see that

 1
∆t ≤ Qt gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + (Ht + Gt R)2
2
holds (Lemma 19). Here, the upper bound on the drift ∆t has term Qt ∇gt (xt )⊤ (xt+1 −
xt ) that depends on the next iterate xt+1 . Hence, by choosing xt+1 that minimizes
Qt ∇gt (xt )⊤ (xt+1 − xt ), we may attempt to control the drift. In fact, the primal update sets
xt+1 to be the minimizer of
Qt ∇gt (xt )⊤ (x − xt ) + Vt ∇ft (xt )⊤ (x − xt ) + αt D(x, xt )
{z
} |
{z
}
|
drift

penalty

over X . Consequently, at each iteration, we get to choose a solution that minimizes the drift
term ∆t and a penalty term for controlling the objective simultaneously.
3.2 Performance Guarantees of Ergodic Drift-Plus-Penalty
First, we analyze the regret and constraint violation of EDPP for the constrained online
convex optimization setting under the Markovian sampling regime, formulated in Section 2.2.
Recall that τmix = τmix (1/4) and that parameter τ is defined as τ = τmix (T −1 ), which
satisfies τ ≤ ⌈log2 T ⌉τmix .
Theorem 1 Suppose that Assumptions 1 and 2 hold. Then for online convex optimization
with ergodic constraints, Algorithm 1 achieves


−β
E [Regret(T )] = O τmix
τ T 1−β ,


p
β/2
E [Violation(T )] = O τmix T (β+1)/2 + (τ − 1)T
where the expectation is taken with respect to the randomness in running the algorithm.
One of the key components of the analysis for proving Theorem 1 is that we bound the
expected size of the virtual queue Qt at time t as follows.


p
β/2
E[Qt ] = O τmix T (β+1)/2 + (τ − 1)T .
Another key part is to analyze the term
T
X

E [Qt gt (x)]

t=1

11

Yeongjong Kim and Dabeen Lee

where function gt is not independent of the virtual queue size Qt under our Markovian
sampling regime. In contrast, if g1 , . . . , gT were i.i.d., then E [Qt gt (x)] = E [Qt ḡ(x)] would
hold. Instead, we relate the term with
T −τ
X+1

E [Qt gt+τ −1 (x)]

t=1

and use the intuition that the distribution of the ergodic Markov chain after τ steps is close
to its stationary distribution. We provide a bound on the term in Lemma 11. Similarly, we
also need to analyze the term


T
X
Qt
E
gt (x) ,
Vt
t=1

and an upper bound on the sum is given in Lemma 12. Based on these observations, we
provide an upper bound on the expected virtual queue size E [Qt ], which is given in Lemma 13.
Then we apply the performance analysis results of the general adaptive drift-plus-penalty
method provided in Appendix A. The complete proof of Theorem 1 is given in Section 6.1.
Next we analyze the performance of EDPP on the stochastic-constrained stochastic
optimization problem under Markovian data sampling, whose formulation is given by (SCSO).
Theorem 2 Suppose that Assumptions 1 and 2 hold. Then for stochastic-constrained
stochastic optimization (SCSO), Algorithm 1 guarantees that
!
τ
τ −1
τ
(τ − 1)3/2
E [Gap(x̄T )] = O
+ 1−β/2
+
+
,
β
T
τmix T 1/2
τmix
Tβ
τmix T (1−β)/2
!
τ
(τ − 1)3/2
τ
E [Infeasibility(x̄T )] = O
+
+
1−β/2 (1−β)/2
T
τmix T 1/2
τ
T
mix

P
where x̄T = Tt=1 xt /T and the expectation is taken with respect to the randomness in
running the algorithm.
Note that Theorem 1 implies
1
E [Regret(T )] = O
T

τ

!

β
τmix
Tβ

,

but the bound on the optimality gap given in Theorem 2 has additional terms due to
the difference between the stationary distribution of the ergodic Markov chain and the
distribution of ft+1 conditioned on ft . In fact, under Markovian sampling, we have
T
X

E [ft (xt )] ̸=

T
X

t=1


E f¯(xt ) .

t=1


Here, to get around this issue, we also use the intuition that E f¯(xt ) is close to E [ft+τ −1 (xt )]
as the distribution of the Markov chain after τ steps is close to its stationary distribution.
12

Stochastic-Constrained Stochastic Optimization with Markovian Data

The proof of Theorem 2 is given in Section 6.2. Moreover, since τ = Õ(τmix ), it follows that




p
β/2
1−β 1−β
E [Regret(T )] = Õ τmix
T
, E [Violation(T )] = Õ τmix T (β+1)/2 + τmix T ,
!
β/2
1−β
τmix
τmix
τmix
,
E [Gap(x̄T )] = Õ
+ (1−β)/2 +
T
Tβ
T
!
√
β/2
τmix τmix
τmix
E [Infeasibility(x̄T )] = Õ
+ √
+
T
T (1−β)/2
T
where the Õ hides a log T factor. In particular, if we set β = 1/3, then we have
2/3
1/6
1/2
E [Regret(T )] = Õ(τmix T 2/3 ), E [Violation(T )] = O(τmix T 2/3 + τmix T 1/2 ), E [Gap(x̄T )] =
2/3
1/6
1/2
Õ(τmix T −1/3 + τmix T −1 ), and E [Infeasibility(x̄T )] = O(τmix T −1/3 + τmix T −1/2 + τmix T −1 ).
Furthermore, observe that if {ξt }∞
t=1 is a sequence of i.i.d. random variables, then we
have τmix = τ = 1. In this case, by Theorems 1 and 2, Algorithm 1 guarantees that
E [Regret(T )] = O(T 1−β ), E [Violation(T )] = O(T (β+1)/2 ), E [Gap(x̄T )] = O(T −β ), and
E [Infeasibility(x̄T )] = O(T −(1−β)/2 ) for any β ∈ (0, 1/2], which recovers the result of Jenatton et al. (2016).
If we further assume that Slater’s constraint qualification holds, we can argue that we get
a better control on the size of E[Qt ]. Note that the upper bound on the expected queue size
β/2
given by Lemma 13 is E[Qt ] = O(τmix t(β+1)/2 ) which holds regardless of whether Slater’s
condition holds or not. On the other hand, we will argue that under Slater’s condition, we
have


√
τ (τ + t)
= Õ( τmix t).
E[Qt ] = O √
τmix t
√
This is consistent with Yu et al. (2017) as they proved that E[Qt ] = O( T ) for the i.i.d.
setting. In fact, our proof for bounding E[Qt ] is more involved than the argument of Yu et al.
(2017) because we use adaptive step sizes for Algorithm 1 and consider non-i.i.d. constraint
functions. This leads to improvements as stated in the following result.
Theorem 3 Suppose that Assumptions 1–3 hold. Then for online convex optimization
with ergodic constraints and stochastic-constrained stochastic optimization, Algorithm 1 with
β = 1/2 guarantees
√ !


τ T
τ (τ + T )
E [Regret(T )] = O √
, E [Violation(T )] = O √
,
τmix
τmix T
!
τ2
τ 5/2
E[Gap(x̄T )] = O
+ 3/2
,
3/2 √
τmix T
τmix T
!
τ2
τ 5/2
τ2
E[Infeasibility(x̄T )] = O
+ 3/2 + 1/2
,
3/2 √
τmix T
τmix T
τmix T 3/2
where the expectation is taken with respect to the randomness in running the algorithm.
Our analysis takes into account the time-varying algorithm parameters Vt and αt as well as
the fact that the functions are correlated according to a Markov chain. To consider this, we
13

Yeongjong Kim and Dabeen Lee

prove Lemmas 14 and 15 that lead to time-varying bounds on the expected virtual queue
size. The complete proof of Theorem 3 is included in Section 6.3. Since τ = Õ(τmix ),
!
3/2
τmix
E [Violation(T )] = Õ
τmix T + √
,
T
!

√
√
3/2
τmix τmix
τmix τmix
τmix
√
√
, E[Infeasibility(x̄T )] = Õ
E[Gap(x̄T )] = Õ
+
+
+ 3/2 .
T
T
T
T
T


p
E [Regret(T )] = Õ
τmix T ,

p

4 Unknown Mixing Time
Next, we study the setting where the mixing time τmix is not observable. Even if we do not
know the mixing time of the underlying Markov chain, we would still want to provide a
learning algorithm that provides performance guarantees of similar orders. To achieve this
goal, we develop yet another variant of the drift-plus-penalty algorithm which incorporates
the multi-level Monte Carlo (MLMC) gradient estimation scheme (Giles, 2015; Blanchet
and Glynn, 2015; Dorfman and Levy, 2022). Dorfman and Levy (2022) first introduced
the approach of combining stochastic gradient descent with the MLMC gradient estimation
framework for stochastic optimization with no stochastic functional constraint.
For the case of known mixing time, we may update the step size based on the mixing
time τmix to achieve an optimal dependence on the parameter. When τmix is not known,
Dorfman and Levy (2022) used AdaGrad-based adaptive step sizes (Duchi et al., 2011;
Levy, 2017; Ward et al., 2019). We take this idea to develop an AdaGrad variant of the
drift-plus-penalty algorithm for SCSO, described in Algorithm 2, that incorporates the
MLMC gradient estimation framework. The AdaGrad version of DPP itself is of independent
interest.
4.1 Multi-Level Monte Carlo Sampling
The idea behind the multi-level Monte Carlo estimation scheme is to obtain many consecutive
samples from an ergodic Markov chain and take their average. At the same time, we may
control the expected number of consecutive samples required for each time step by O(log T ).
(1)

(Nt )

More precisely, for each time step t, we Nt sample ξt , . . . , ξt
random variable given by
(
Ñt , if Ñt ≤ T 2
Nt =
1,
otherwise

where Nt itself is a

and Ñt = 2Jt with Jt ∼ Geom(1/2). Note that in our case, the condition is that Ñt ≤ T 2
where the bound on Ñt is T 2 , while it was set to T in Dorfman and Levy (2022). With this
sampling strategy, we define Ft as the σ-field
!
t n
o
[
(1)
(N )
Ft = σ {N1 , . . . , Nt } ∪
ξs , . . . , ξs s
.
s=1

14

Stochastic-Constrained Stochastic Optimization with Markovian Data

Let Et [·] denote the conditional expectation with respect to Ft , i.e., Et [·] = E [· | Ft ]. Next,
for t ≥ 1 and N ≥ 1, we define
N

ftN (x) :=

N

1 X
(i)
f (x, ξt ),
N

gtN (x) :=

i=1

1 X
(i)
g(x, ξt ).
N
i=1

Based on this, we define the MLMC estimators of f and g as follows.

( 
Nt /2 Nt /2
Nt Nt
N
(f
,
g
)
−
(f
,
g
)
, if Nt > 1
t
t
t
t
t
(ft , gt ) = (ft1 , gt1 ) +
0,
otherwise.
Basically, functions ft and gt are obtained after applying the MLMC estimation scheme to
the underlying ergodic Markov chain. One thing to note, however, is that ft and gt are not
necessarily convex anymore (Dorfman and Levy, 2022). To remedy this issue, what we can
argue instead is that Et−1 [ft ] and Et−1 [gt ] are convex. Based on (Dorfman and Levy, 2022,
Lemma 3.1), we deduce the following lemma.

Lemma 4 Let jmax := max j ∈ N : 2j ≤ T 2 = ⌊2 log2 T ⌋. Then for each t,
h j i
h
i
2 max
2jmax
Et−1 [ft ] = Et−1 ft
, Et−1 [∇ft ] = Et−1 ∇ft
,
h
i
h j i
jmax
max
.
Et−1 [gt ] = Et−1 gt2
, Et−1 [∇gt ] = Et−1 ∇gt2
Moreover, we have


E ∥∇ft (xt )∥2∗ = Õ(F 2 τmix ),


E ∥∇gt (xt )∥2∗ = Õ(G2 τmix ),


E |gt (xt )|2 = Õ(H 2 τmix ).

Lastly, the expected number of samples for time step t satisfies E[Nt ] ≤ 2 log2 T + 1.
The reason for setting the upper bound T 2 on Ñt instead of T is to achieve high accuracy
of estimation for ft , gt , ∇ft , and ∇gt that leads to the desired performance guarantees of
Algorithm 2. To be specific, we use the following estimation bounds based on (Dorfman and
Levy, 2022, Lemma A.6).
Lemma 5 There exists C(T ) > 0 with

1/2 
C(T ) = O log(T ) log τmix T 2 log(T )
such that


2
τmix
2jmax
¯
Et−1 ft
(x) − f (x)
≤ C(T )2 2 ,
T


2
τmix
jmax
Et−1 gt2
(x) − ḡ(x)
≤ C(T )2 2 ,
T

h
i
τmix
jmax
Et−1 ∥∇ft2
(x) − ∇f¯(x)∥2∗ ≤ C(T )2 2 ,
T
h
i
τmix
jmax
Et−1 ∥∇gt2
(x) − ∇ḡ(x)∥2∗ ≤ C(T )2 2 .
T

hold for any x ∈ X that is measurable with respect to Ft−1 and any t ∈ [T ].
The complete proof of this lemma is given in Appendix E.
15

Yeongjong Kim and Dabeen Lee

Algorithm 2 MLMC Adaptive Drift-Plus-Penalty (MDPP)
Initialize: Initial iterates x1 ∈ X , Q1 = 0 and parameters 0 < β ≤ 1/2, δ > 0.
for t = 1 to T do
Observe ft and gt via MLMC method.
Set penalty parameter Vt , step size parameter αt as (1).
Primal update: Set xt+1 as
n
o
xt+1 = argmin (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ x + αt D(x, xt )
x∈X

Dual update: Set Qt+1 as
h
i
Qt+1 = Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt )

+

end for
4.2 Adaptive Drift-Plus-Penalty
The second component of our algorithm for the unknown mixing time setting is the AdaGrad
variant of the drift-plus-penalty algorithm. To develop AdaGrad-style step sizes, let us
define the following sequence of parameters. For some positive constant δ > 0, we set
a0 = S0 = δ,

at :=

Ft2
+ R2 G2t + Ht2 + δ,
4

St := δ +

t
X

as

s=1

for t ≥ 1. Here, we may choose any positive number for δ. Recall that Ft = ∥∇ft (xt )∥∗ ,
Gt = ∥∇gt (xt )∥∗ , and Ht = |gt (xt )|. Based on these parameters, we set the algorithm
parameters Vt and αt as follows.
Vt =

β
St−1
,
R

αt =

St−1
R2

(1)

for some 0 < β ≤ 1/2. Here, the penalty parameter Vt and the step size parameter αt can
be chosen without knowledge of the global upper bounds F, G, H on Ft , Gt , Ht .
Now we are ready to describe our algorithm, which we call the MLMC adaptive driftplus-penalty (MDPP) algorithm. Here, MDPP is a combination of the AdaGrad-style
adaptive drift-plus-penalty algorithm with the MLMC estimator presented in the previous
subsection. More importantly, the algorithm is designed to solve constrained online convex
optimization where the MLMC estimators f1 , . . . , fT are the objective loss functions and the
MLMC estimators g1 , . . . , gT are the constraint functions. The following lemma provides an
adaptive regret guarantee and an adaptive constraint violation bound for Algorithm 2 for
the associated online convex optimization.
Lemma 6 Suppose that Assumptions 1 and 2 hold. Then for the constrained online convex
optimization problem where the MLMC estimators f1 , . . . , fT are the objective loss functions
and the MLMC estimators g1 , . . . , gT are the constraint functions, Algorithm 2 achieves the
16

Stochastic-Constrained Stochastic Optimization with Markovian Data

following. For any x ∈ X with ḡ(x) ≤ 0, we have

E

" T
X

ft (xt ) −

T
X

t=1

#
ft (x)

t=1



1/2
1/2
1/2
= Õ E [ST ]1−β + τmix E [ST ]1/2−β + τmix T 1/4 E [ST ]1/4−β/2 + τmix T 1−β ,
" T
#
X
E
gt (xt )
t=1
1/2

= Õ E[ST ]1/2 + T 1/4 E[ST ]β/2+1/4 + τmix +

E[ST ] + T 1/2 E[ST ]β+1/2
β/2+1/4

τmix

T β/2+1/2
!

1/2

+

τmix + E[ST ]1/2 + E[ST ]β/2+1/4
β/2+1/4

τmix

T β/2+1/2

β/2−1/4 β/2+1/2
+ (log ST )2 τmix
T

.

Recall that the parameters Vt and αt depend on the parameter δ. Here, we may decide any
positive number for δ, its choice does affect the performance of Algorithm 2. Although the
bounds given in Lemma 6 do not exhibit an explicit dependence on P
δ, our proof of P
Lemma 6 in
Section 7.2 reveals that increasing δ increases the objective gap E[ Tt=1 ft (xt ) − Tt=1 ft (x)]
P
and decreases the constraint violation E[ Tt=1 gt (xt )]. Likewise, decreasing δ decreases the
objective gap and increases the constraint violation.
Lemma 4 implies that E [ST ] = Õ(τmix T ). Plugging in this bound on E [ST ] to the
adaptive performance guarantees given in Lemma 6, we deduce the following result.

Proposition 7 Suppose that Assumptions 1 and 2 hold. Then for the constrained online
convex optimization problem where the MLMC estimators f1 , . . . , fT are the objective loss
functions and the MLMC estimators g1 , . . . , gT are the constraint functions, Algorithm 2
achieves the following. For any x ∈ X with ḡ(x) ≤ 0, we have

E

" T
X

ft (xt ) −

t=1

T
X

#



1−β 1−β
ft (x) = Õ τmix
T
,

t=1

E

" T
X

#



β/2+1/4 β/2+1/2
3/4−β/2 1/2−β/2
.
gt (xt ) = Õ τmix
T
+ τmix
T

t=1

for any x ∈ X satisfying ḡ(x) ≤ 0.

We remark that the performance bounds given in Lemma 6 and Proposition 7 are not
comparable to the regret and constraint violation bounds for stochastic-constrained stochastic
optimization. When each pair of loss and constraint functions for stochastic-constrained
stochastic optimization corresponds to a single data, the associated regret and constraint
17

Yeongjong Kim and Dabeen Lee

violation measure are given by the following.
Regret(T ) =

Nt
T X
X

(j)
ft (xt ) −

t=1 j=1

Violation(T ) =

Nt
T X
X

Nt
T X
X

(j)

ft (x∗ ),

t=1 j=1
(j)

gt (xt ).

t=1 j=1
(1)

(N )

Here, ft , . . . , ft t are the Nt sampled functions from which we derive the MLMC estimator
ft for t ∈ [T ]. In contrast, Lemma 6 and Proposition 7 analyze the performance of Algorithm 2
on the sequence of the MLMC estimators. Despite this, it would be an interesting question
to understand the performance of Algorithm 2 for the latter online convex optimization
setting where each function pair corresponds to a single data.
The proof of Lemma 6 is given in Section 7. One of the main components of the analysis
is to provide adaptive bounds on the terms E [Qt ] and E [Qt /Vt ]. This is possible thanks
to our subtle choice of parameters Vt and αt . It turns out that the AdaGrad style analysis
for drift-plus-penalty is more sophisticated than that for unconstrained stochastic gradient
descent.
Finally, we state the following theorem providing upper bounds on the optimality gap
and the feasibility gap under Algorithm 2 for SCSO. The argument is to use the results of
Proposition 7 and the estimation error bounds due to Lemma 5. In contrast to the setting of
Section 3 for which we had to rely on the mixing property of ergodic Markov chains directly,
the MLMC estimators are already close to f¯ and ḡ.
Theorem 8 Suppose that Assumptions 1 and 2 hold. Then for stochastic-constrained
stochastic optimization (SCSO), Algorithm 2 guarantees that
!
1−β
τmix
E [Gap(x̄T )] = Õ
,
Tβ
!
(3−2β)/4
(2β+1)/4
τmix
τmix
+ (β+1)/2 .
E [Infeasibility(x̄T )] = Õ
T (1−β)/2
T

5 Numerical Experiments
We examine the performance of the ergodic drift-plus-penalty algorithm (Algorithm 1) for the
known mixing time case and the MLMC adaptive drift-plus-penalty algorithm (Algorithm 2)
for the unknown mixing time case on a linear classification problem with fairness constraints
using synthetic data. We follow the experimental setup of Zafar et al. (2019). We adopt Zafar
et al. (2019) for creating data points and sensitive features (with ϕ = π/2) and imposing
fairness constraints. To make our Markov chain more meaningful and complex enough, we
experiment with a 3-state chain instead of a 2-state chain in contrast to Dorfman and Levy
(2022). The 3-state Markov chain is given with the following transition matrix


1 − 2p
p
p
 p
1 − 2p
p 
p
p
1 − 2p
18

Stochastic-Constrained Stochastic Optimization with Markovian Data

which has stationary distribution (1/3, 1/3, 1/3).
Theorem 9 (Levin and Peres, 2017, Theorems 12.4 and 12.5) For an ergodic and reversible
Markov chain with n states, whose transition matrix is P , let 1 = λ1 > λ2 ≥ . . . ≥ λn be the
eigenvalues of P and µmin be the minimum entry of the stationary distribution. Then, the
mixing time of the chain satisfies


|λ2 |
1
4
.
log 2 ≤ τmix ≤
log
1 − max{|λ2 |, |λn |}
1 − max{|λ2 |, |λn |}
µmin
Then it follows that the mixing time of the 3-state Markov chain satisfies
1 − 3p
1
log 2 ≤ τmix ≤
log 12.
3p
3p
In our experiment, we used 1/3p as an approximation of the mixing time.
For each set of data points, we generate two clusters, each of which has 1,000 data points
in R2 sampled from a multivariate normal distribution. All data points from a cluster have
the same label in {−1, 1}. We denote the index set corresponding to each state j ∈ {1, 2, 3}
as Dj and the whole index set as D = D1 ∪ D2 ∪ D3 . The data clusters are drawn in Figure 1.
The color corresponding to each state and label is summarized in Table 2. We then generate

Figure 1: Data Points

PP
P

PP label
PP
state
P
P

1
2
3

1

−1

purple
blue
green

yellow
orange
red

Table 2: Colors for (State, Label) Pairs
binary sensitive feature zi ∈ {0, 1} randomly for each data point xi ∈ Rd , i.e., gender. We
want the binary-sensitive feature of the data points to have low covariance with the results
19

Yeongjong Kim and Dabeen Lee

from our classifier. More details about how to create sensitive features are included in the
supplement.
We use logistic regression classifiers with the following loss functions
fj (w, b) =

1 X
⊤
log(1 + e−yi (w xi +b) )
|Dj |
i∈Dj

and constraint functions
gj (w, b) =

1 X
(zi − z̄)(w⊤ xi + b) − c
|Dj |
i∈Dj

hj (w, b) = −

1 X
(zi − z̄)(w⊤ xi + b) − c
|Dj |
i∈Dj

P
for each state j, where z̄ =
i∈D zi /|D| and c > 0. Then the stochastic-constrained
stochastic optimization problem is to minimize the usual logistic regression loss function
under fairness constraints, which was proposed by Zafar et al. (2019), as follows.
1 X
⊤
log(1 + e−yi (w xi +b) )
|D|
(w,b)∈X
i∈D
1 X
(zi − z̄)(w⊤ xi + b) ≤ c.
s.t. − c ≤
|D|
min

i∈D

Solving this problem with our framework can be viewed as a distributed optimization scheme
of Ram et al. (2009b) and Johansson et al. (2010). Basically, there are agents 1,2, and 3
sharing z̄, and agent i has data Di . After we update our parameters wt and bt , they are sent
to agent it , and the agent sends us back the information ∇fit (wt , bt ), gt (wt , bt ), ∇git (wt , bt ),
ht (wt , bt ), ∇hit (wt , bt ). Here, we may impose that the sequence of selected agents gives rise
to an ergodic Markov chain.
The experimental setup involves two constraints. Although we consider the single
constraint setting in the paper for simplicity, our results can be easily extended to the case
with multiple constraint functions ḡ1 , . . . , ḡn . For each ḡi , we obtain sampled functions gt,i
for t ∈ [T ]. Then for each t, we update xt and {Qt,i }ni=1 as


!⊤
n


X
xt+1 = argmin
Vt ∇ft (xt ) +
Qt,i gt,i (xt )
x + αt D(x, xt ) ,

x∈X 
i=1
h
i
Qt+1,i = Qt,i + gt,i (xt ) + ∇gt,i (xt )⊤ (xt+1 − xt ) , i = 1, . . . , n.
+

For Algorithm 1, we use the parameters Vt = (τmix t)β and αt = τmix t as before. For
(j)
t
Algorithm 2, we define the MLMC estimator gt,i using {gt,i }N
j=1 for each i ∈ [n] and define
n

a0 = S0 = δ,

at =

n

X
Ft2 X 2 2
2
+
R Gt,i +
Ht,i
,
4
i=1

i=1

20

St = δ +

t
X
s=1

as ,

Stochastic-Constrained Stochastic Optimization with Markovian Data

where Gt,i = ∥∇gt,i (xt )∥∗ , Ht,i = |gt,i (xt )|. The parameters Vt and αt are defined as in (1).
We compare our DPP-based algorithms with some existing algorithms developed for
stochastic-constrained stochastic optimization with i.i.d. data. The list of algorithms that
we tested is given as follows.
• PD : Primal-dual method by Mahdavi et al. (2012).
• PD2 : Primal-dual method by Jenatton et al. (2016).
• DPP : Drift-plus-penalty algorithm by Yu et al. (2017).
• EDPP-t : Ergodic drift-plus-penalty (Algorithm 1).
• EDPP-T : modification of Algorithm 1 with non-adaptive parameters Vt =
and αt = τmix T .

√

τmix T

• MDPP : MLMC adaptive drift-plus-penalty (Algorithm 2).
For MDPP, we observed that the MLMC estimator usually has a high variance in practice,
making experimental results unstable. Hence, we truncated MLMC sampling so that the
number of samples per iteration is at most 24 . We chose δ = F 2 /4 + 2R2 G2 + 2H 2 , for
which we computed the constants F, G, H > 0 such that Ft ≤ F, Gt,i ≤ G, Ht,i ≤ H.
We set parameters to p = 0.001 and c = 0.5 with which we ran the list of algorithms
with the same initial parameters and sequence of states. We first ran MDPP with 25,000
iterations which created 101,034 samples. The results on the optimality gap are summarized
in Figure 2, and the results on the infeasibility are presented in Figure 3 and Table 3. The
results on the regret and the cumulative constraint violations are shown in Figures 4 and 5,
respectively.

Figure 2: Optimality Gap (Left), Enlarged Figure around 5,000 - 30,000 Samples (Right)
As shown in the figures, Our algorithms (EDPP-T, EDPP-t, MDPP) outperform the
other algorithms in terms of the optimality gap. DPP also shows a good optimality gap but
it ends with a positive constraint 1 infeasibility. In contrast, EDPP-T, EDPP-t, and MDPP
21

Yeongjong Kim and Dabeen Lee

Figure 3: Constraint 1 Infeasibility (Left), Constraint 2 Infeasibility (Right)
Algorithm
PD
PD2
DPP-T
DPP-t
EDPP-T
EDPP-t
MDPP

Final values of
constraint 1 infeasibility
−0.0533
−0.4997
0.0622
0.0800
0.2042
−0.0904
−0.1536

Table 3: Final Values of Constraint 1 Infeasibility

all end with a negative constraint infeasibility. Note that after 20,000 samples, EDPP-T
achieves the smallest optimality gap, followed by EDPP-t and MDPP. However, Figure 3
shows that EDPP-T incurs a significantly higher infeasibility for constraint 1, given by
1 X
(zi − z̄)(w⊤ xi + b) ≤ c
|D|
i∈D

than the other algorithms. In contrast, EDPP-t outperforms DPP-T and DPP-t in terms
of both the optimality gap and the infeasibility measure. It is also interesting to see that
DPP-T and DPP-t behave similarly, while DPP-t performs better than DPP-T.
Figure 4 shows the regret values under various algorithms for the online convex optimization setting where each pair (ft , gt ) of functions corresponds to one data sample. That said,
we excluded MDPP as it requires multiple samples in one round. In addition, we excluded
PD2, which exhibits much higher regret values than the other algorithms, to focus on and
better present the performance of the other algorithms. Figure 5 shows that EDPP-T incurs
a positive cumulative constraint 1 violation, while the other algorithms result in a negative
22

Stochastic-Constrained Stochastic Optimization with Markovian Data

Figure 4: Regret

Figure 5: Constraint 1 Cumulative Violation (Left), Constraint 2 Cumulative Violation
(Right)

cumulative constraint violation. We may check from Figures 4 and 5 that EDPP-t performs
the best for online convex optimization with ergodic constraints.

6 Analysis of Ergodic Drift-Plus-Penalty for the Known Mixing Time Case
This section presents the proofs of Theorems 1 to 3 given in Section 3. Section 6.1 contains the
proof of Theorem 1 which provides regret and constraint violation bounds on the formulation
of online convex optimization with ergodic constraints. Then Section 6.2 presents the proof
of Theorem 2 that gives bounds on the optimality gap and the feasibility gap for SCSO.
In Section 6.3, we prove Theorem 3 for the case where Slater’s condition is satisfied. As
before, throughout this section, we denote by Ft the σ-field generated by the information
23

Yeongjong Kim and Dabeen Lee

accumulated up to time step t. That is,
Ft = σ ({ξ1 , . . . , ξt }) .
Moreover, Et [·] in this section refers to the conditional expectation with respect to Ft , i.e.,
Et [·] = E [· | Ft ] (In Sections 4 and 7, Ft denotes the σ-field generated up to time step t
under the MLMC estimation scheme). Furthermore, Ps[t] for t > s denotes the probability
measure of ξs conditional on Ft .
6.1 Ergodic Drift-Plus-Penalty for Online Convex Optimization with Ergodic
Constraints
The three
of our analysis are the one (Lemma 11) bounding the
hPimportant components
i
T
term E
t=1 Qt gt (x) , the part (Lemma 12) providing an upper bound on the term
hP
i
T
E
(Q
/V
)g
(x)
, and Lemma 13 that derives an upper bound on the expected queue
t
t
t
t=1
size E [Qt ]. Then, plugging in the deduced bounds to the lemmas in Appendix A analyzing
the general template of adaptive drift-plus-penalty, we prove Theorem 1.
By the update rule and the convexity of gt , we deduce the following straightforward
bound on |Qt+1 − Qt |.
Lemma 10 For t ≥ 1, we have −H − GR ≤ Qt+1 − Qt ≤ H.
Proof Note that we have


Qt+1 = Qt + gt (xt ) + ∇gt (xt )T (xt+1 − xt ) + ≤ [Qt + gt (xt+1 )]+ ≤ Qt + H.
For the lower bound,
Qt+1 ≥ Qt + gt (xt ) + ∇gt (xt )T (xt+1 − xt )
≥ Qt − |gt (xt )| − |∇gt (xt )∥∗ ∥xt+1 − xt ∥
≥ Qt − H − GR,
as required.
−1
As Q1 = 0, it follows that Qt ≤ (t − 1)H.
PT Recall that we denote τ = τmix (T ). The next
lemma provides a bound on the term t=1 E [Qt gt (x)]. As mentioned in Section 3.1, we
need to take into account that g1 , . . . , gT are not i.i.d. while they are generated by an ergodic
Markov chain.

Lemma 11 For any x ∈ X such that ḡ(x) ≤ 0,
" T
#
T
X
2H X
E
Qt gt (x) ≤ H(H + GR)(τ − 1)T +
E[Qt ].
T
t=1

t=1

Proof If T < τ , then
T
X
t=1

Qt gt (x) ≤

T
X
T (T − 1)H 2
(t − 1)H 2 =
< H 2 (τ − 1)T,
2
t=1

24

Stochastic-Constrained Stochastic Optimization with Markovian Data

and the statement follows. If T ≥ τ , then
" T
#
X
E
Qt gt (x)
=

≤

t=1
τX
−1
t=1
τX
−1

E[Qt gt (x)] +
2

(t − 1)H +

t=1

≤

T −τ
X+1

t=1
T −τ
X+1

E[Qt+τ −1 gt+τ −1 (x)]

E[(Qt+τ −1 − Qt )gt+τ −1 (x)] +

t=1

T −τ
X+1
(τ − 1)(τ − 2)H 2
+
(τ − 1)H(H + GR) +
2

T −τ
X+1

E[Qt gt+τ −1 (x)]

t=1
T −τ
+1
X

t=1

E [Qt Et−1 [gt+τ −1 (x) − ḡ(x)]]

t=1

T −τ
X+1
(τ − 1)(2T − τ )H(H + GR)
=
+
E [Qt Et−1 [gt+τ −1 (x) − ḡ(x)]]
2
t=1

where the second inequality holds because (Qt+τ −1 − Qt )gt+τ −1 (x) ≤ (H + GR)H due to
Lemma 10, ḡ(x) ≤ 0, and Qt is Ft−1 -measurable. Here, to bound Et−1 [gt+τ −1 (x) − ḡ(x)],
we consider
Et−1 [gt+τ −1 (x) − ḡ(x)] = Et−1 [g(x, ξt+τ −1 ) − g(x, ξ)]
Z
Z
t+τ −1
−1
H(dP[t−1] − dµ) +
H(dPt+τ
≤
[t−1] − dµ)
AC

A

−1
≤ 2H∥Pt+τ
[t−1] − µ∥T V
−1
where Pt+τ
[t−1] is the probability measure of ξt+τ −1 conditional on Ft−1 , ξ ∼ µ, and A is any
−1
measurable set. By the definition of τ = τmix (T −1 ), it follows that ∥Pt+τ
[t−1] − µ∥T V ≤ 1/T,
which implies that Et−1 [gt+τ −1 (x) − ḡ(x)] ≤ 2H/T. From this, we deduce the desired
statement.
P
Next we provide an upper bound on the term Tt=1 E [Qt gt (x)/Vt ], as we mentioned in
Section 3.1.

Lemma 12 For any x ∈ X such that ḡ(x) ≤ 0,
" T
#
X Qt
2H(H + GR)(τ + 1)
E
gt (x) ≤
(T + 1)1−β .
β
Vt
τmix (1 − β)
t=1

Proof Consider the case T < τ first. Note that
T
X
Qt

T

H 2 X 1−β
H2
2H 2 (τ + 1)
gt (x) ≤ β
t
≤ β
(T + 1)2−β ≤ β
(T + 1)1−β
Vt
τ
τ
(2
−
β)
τ
(1
−
β)
t=1
mix t=1
mix
mix

where the second inequality is due to Corollary 29 and the third inequality is from T < τ .
Next we consider the case T ≥ τ . Note that
" T
# τ −1 
 T −τ

X Qt
X
X+1  Qt+τ −1
Qt
E
gt (x) =
E
gt (x) +
E
gt+τ −1 (x) .
Vt
Vt
Vt+τ −1
t=1

t=1

t=1

25

Yeongjong Kim and Dabeen Lee

Here the first term on the right-hand side can be bounded as
τ −1
X


τ −1
H 2 X 1−β
H 2τ
Qt
H2
gt (x) ≤ β
τ 2−β ≤ β
(T + 1)1−β
E
t
≤ β
Vt
τ
τ
(2
−
β)
τ
(1
−
β)
t=1
mix t=1
mix
mix


as above. Moreover, the second term can be bounded as follows.
T −τ
X+1




Qt+τ −1
gt+τ −1 (x)
Vt+τ −1
t=1
 T −τ

T −τ
X+1  (Qt+τ −1 − Qt )
X+1  Qt
≤
gt+τ −1 (x) +
gt+τ −1 (x)
E
E
Vt+τ −1
Vt+τ −1
E

t=1

(τ − 1)H(H + GR)

T −τ
X+1

t=1
T −τ
X+1

1
E[Qt Et−1 [gt+τ −1 (x) − ḡ(x)]]
Vt+τ −1
t=1
t=1

T −τ +1 
2H X
(τ − 1)H(H + GR) 1−β
Qt
1−β
(T
− (τ − 1)
)+
≤
E
β
T
Vt+τ −1
τmix
(1 − β)
t=1
 
T
2H X
Qt
H(H + GR)
1−β
τ (T + 1)
+
E
≤ β
T
Vt
τ (1 − β)
≤

β
τmix

1
+
(t + τ − 1)β

t=1

mix

≤

H(H + GR)τ
β
(1 − β)
τmix

(T + 1)1−β +

H(H + GR)τ

2H 2

T
X

β
T t=1
τmix
2H 2

t1−β

(T + 1)1−β + β
(T + 1)2−β
β
τmix
(1 − β)
τmix T (2 − β)
H(H + GR)(τ + 1)
(T + 1)1−β ,
≤
β
(1 − β)
τmix
≤

where the second inequality is from Lemma 10 and ḡ(x) ≤ 0, the third inequality holds due
to Corollary 29 and the choice of τ , the sixth inequality comes from Corollary 29, and the
last inequality holds because T (2 − β) > (T + 1)(1 − β).

Based on Lemmas 11 and 12, we can now prove the first part of Theorem 1, which upper
bounds the regret of Algorithm 1 for online convex optimization with ergodic constraints.
Proof [The first part of Theorem 1] Lemma 21 implies that

E

" T
X
t=1

ft (xt ) −

T
X
t=1



T
T
T
X
αT 2 F 2 X V t
(H + GR)2 X 1
Qt
∗
ft (x ) ≤
R +
+
+
E
gt (x ) .
VT
4
αt
2
Vt
Vt
#

∗

t=1

26

t=1

t=1

Stochastic-Constrained Stochastic Optimization with Markovian Data

Next using Lemma 12 with x∗ satisfying ḡ(x∗ ) ≤ 0 and plugging in our choice of αt and Vt ,
we deduce the following.

E

" T
X

ft (xt ) −

t=1

T
X

#
ft (x)

t=1

≤ R2 (τmix T )1−β +

β−1
−β
F 2 τmix
(H + GR)2 τmix
2H(H + GR)(τ + 1)
Tβ +
T 1−β +
(T + 1)1−β
4β
2(1 − β)
τ β (1 − β)
mix

−β
= O(τmix
τ T 1−β ),

as required.
Next, we prove the second part of Theorem 1, which gives an upper bound on constraint
violation under Algorithm 1. The following lemma provides a time-varying bound on the
expected virtual queue size.

Lemma 13 For t ∈ [T + 1], E[Qt ] is bounded above by
3p
β+3
2I(t − 1) +
2
β+1

s

β
2F Rτmix
tβ+1 3R p
3H p
+
2τmix (t − 1) +
2(τ − 1)(t − 1) + 4H
1+β
2
2

where I = H 2 + G2 R2 + F 2 /4.

Proof To prove the lemma, we argue by induction. Note that the statement of the lemma
trivially holds when t = 1 because Q1 = 0. Suppose the statement holds for t ≤ s for some
s ≥ 1. What remains is to provide an upper bound on E[Qs+1 ]. Note that
h 
i
h
h
ii
E Qt gt (xt ) + ∇gt (xt )⊤ (x − xt ) = E Qt Et−1 gt (xt ) + ∇gt (xt )⊤ (x − xt )
≤ E[Qt Et−1 [gt (x)]]
= E[Qt gt (x)].
Then Lemma 24 together with Jensen’s inequality implies the following.
v
u
s
s
X
X
u
E[Qs+1 ] ≤ t2s (H 2 + R2 G2 ) + 2RF
Vt + 2R2 αs + 2
E[Qt gt (x)].
t=1

Moreover,

Ps

t=1

t=1 E[Qt gt (x)] can be upper bounded based on Lemma 11. Then it follows that

27

Yeongjong Kim and Dabeen Lee

E[Qs+1 ]
v
u
s
β
u
2F Rτmix
(s + 1)β+1
4H X
≤ t2Is +
E[Qt ]
+ 2R2 τmix s + 2H(H + GR)(τ − 1)s +
1+β
s
t=1
v
s
u
s
β
p
√
uH X
2F Rτmix
(s + 1)β+1 p 2
≤ 2Is +
E[Qt ]
+ 2R τmix s + 2H 2 (τ − 1)s + 2t
1+β
s
t=1
s
s
β
p
√
2F Rτmix
(s + 1)β+1 p 2
1 X
≤ 2Is +
+ 2R τmix s + 2H 2 (τ − 1)s + 2H +
E[Qt ]
1+β
2s
t=1

for s ≤ T . By the induction hypothesis, it follows that E[Qt ] for any 1 ≤ t ≤ s is upper
bounded by
s
β
3p
β + 3 2F Rτmix tβ+1 3R p
3H p
2I(t − 1) +
+
2τmix (t − 1) +
2(τ − 1)(t − 1) + 4H
2
β+1
1+β
2
2
s
β
√
3
β + 3 2F Rτmix (s + 1)β+1 3R √
3H p
≤
2Is +
+
2τmix s +
2(τ − 1)s + 4H.
2
β+1
1+β
2
2
This leads to the desired upper bound on E[Qs+1 ].
We are now ready to prove the second part of Theorem 1, which proves the constraint
violation bound of Algorithm 1.
Proof [The second part of Theorem 1] Combining Lemmas 22 and 13, we deduce that

E

" T
X

#
gt (xt ) ≤ E[QT +1 ] +

t=1

T

T

t=1

t=1



p
F G X Vt
G2 X E[Qt ]
β/2
+
= O τmix T β/2+1/2 + (τ − 1)T ,
2
αt
2
αt

as required.

6.2 Ergodic Drift-Plus-Penalty for Stochastic-Constrained Stochastic
Optimization
In this section, we provide our formal proof of Theorem 2. To better organize and present
the result, we divide the analysis into two, one of which is for bounding the optimality gap
while the other is for bounding the feasibility gap. The main idea is to use the performance
bounds given in Theorem 1 and relate them to the optimality gap and the feasibility gap.
The relation between them is not as clear as in the i.i.d. setting, but we use the mixing
property of ergodic Markov chains.
28

Stochastic-Constrained Stochastic Optimization with Markovian Data

Proof [The first part of Theorem 2] For T ≥ τ = τmix (T −1 ), we have the following

T
X

X+1
 T −τ

f¯(xt ) − f¯(x) =
f¯(xt ) − f¯(x) − ft+τ −1 (xt ) + ft+τ −1 (x)

t=1

t=1
T −τ
X+1

+

+

(ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 ))

t=1
T
X

T
X

t=τ

t=T −τ +2

(ft (xt ) − ft (x)) +


f¯(xt ) − f¯(x) .

We consider the four parts of the right-hand side separately. Here, the first part satisfies


Et−1 f¯(xt ) − f¯(x) − ft+τ −1 (xt ) + ft+τ −1 (x)
Z


−1
= (f (xt , ξ) − f (x, ξ)) dµ(ξ) − dPt+τ
(ξ)
[t−1]
Z
t+τ −1
≤ F R |dµ(ξ) − dP[t−1]
(ξ)|
−1
≤ 2F R∥µ − Pt+τ
[t−1] ∥T V

≤

2F R
.
T

P −τ +1 ¯
Hence, it follows that Tt=1
E[f (xt ) − f¯(x) − ft+τ −1 (xt ) + ft+τ −1 (x)] ≤ 2F R. To bound
the second part, we consider the following.

E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )] =
≤

≤

t+τ
−2
X
s=t
t+τ
−2
X
s=t
t+τ
−2
X
s=t

=

E[ft+τ −1 (xs ) − ft+τ −1 (xs+1 )]
E[F ∥xs − xs+1 ∥]
F
E[Vs F + Qs G]
2αs

−2
t+τ −2
β−1 t+τ
X
F 2 τmix
F G X E[Qs ]
sβ−1 +
2
2τmix s=t
s
s=t

t+τ −2
β−1 

F 2 τmix
F G X E[Qs ]
β
β
≤
(t + τ − 2) − (t − 1) +
2β
2τmix s=t
s

29

Yeongjong Kim and Dabeen Lee

where the last inequality holds due to Corollary 29. Here, we consider the following to
provide an upper bound on the second term on the right-most side.
t+τ
−2
X
s=t


p
p
E[Qs ]
2I(t + τ − 2) − 2I(t − 1)
≤3
s
s

q
β q
2(β + 3) 2F Rτmix
β+1
β+1
+
(t + τ − 2)
− (t − 1)
(β + 1)2
β+1
p

p
+ 3R
2τmix (t + τ − 2) − 2τmix (t − 1)

p
p
2(τ − 1)(t + τ − 2) − 2(τ − 1)(t − 1)
+ 3H


t+τ −2
+ 4H log
.
σ(t − 1)

Here, we handle the above terms in the following manner. Note that
T −τ
X+1 q

(t + τ − 2)β+1 −


q
β+1
(t − 1)

t=1
β+1

β+1

β+1

= (T − 1) 2 + · · · + (T − τ + 1) 2 − (τ − 2) 2 − · · · − 1
≤ (τ − 1)T β/2+1/2
holds and that
T −τ
X+1 p

(t + τ − 2) −


p
1
1
1
(t − 1) = (T − 1) 2 + · · · + (T − τ + 1) 2 − (τ − 2) 2 − · · · − 1

t=1

≤ (τ − 1)T 1/2 .
The resulting terms form dominant terms, which means that
T −τ
X+1

E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )]

t=1



β/2−1
−1
= O τmix (τ − 1)T β/2+1/2 + τmix
(τ − 1)3/2 T 1/2 .
Moreover, by Theorem 1 and the triangular inequality, the third part is bounded as

E

" T
X

#



−β
(ft (xt ) − ft (x)) = O τmix
τ T 1−β + (τ − 1) .

t=τ

The fourth part can be bounded as follows.
"
E

T
X

#
f¯(xt ) − f¯(x)

t=T −τ +2

30


≤ RF (τ − 1).

Stochastic-Constrained Stochastic Optimization with Markovian Data

Combining the bounds on the four parts, we can conclude that for any x ∈ X such that
ḡ(x) ≤ 0,
" T
#
X

E
f¯(xt ) − f¯(x)
t=1



β/2−1
−β
−1
= O τmix
τ T 1−β + τmix (τ − 1)T β/2+1/2 + τmix
(τ − 1)3/2 T 1/2 + τ ,
which implies
" T
#
X


1
E f¯(x̄T ) − f¯(x) ≤ E
f¯(xt ) − f¯(x)
T


t=1

=O

τ
β
τmix

β/2

(τ − 1)3/2
τ
τ − 1 τmix
+
+
+
(1−β)/2
1/2
τmix T
T
τmix T
Tβ

!
,

as required.
Next we prove the feasibility gap bound of Algorithm 1, which is given as the second
part of Theorem 2.
Proof [Proof of the second part of Theorem 2] For T ≥ τ = τmix (T −1 ), we have the following,
T
X

ḡ(xt ) =

t=1

T −τ
X+1
t=1
T
X

+

(ḡ(xt ) − gt+τ −1 (xt )) +

T −τ
X+1

(gt+τ −1 (xt ) − gt+τ −1 (xt+τ −1 ))

t=1

gt (xt ) +

t=τ

T
X

ḡ(xt )

t=T −τ +2

where the right-hand side consists of four terms. We separately upper bound the four parts
of the right-hand side. As before, the first part satisfies
Z


t+τ −1
Et−1 [ḡ(xt ) − gt+τ −1 (xt )] = g(xt , ξ) dµ(ξ) − dP[t−1]
(ξ)
Z
−1
≤ H |dµ(ξ) − dPt+τ
[t−1] (ξ)|
−1
≤ 2H∥µ − Pt+τ
[t−1] ∥T V

≤
This implies that

PT −τ +1
t=1

2H
.
T

E[ḡ(xt ) − gt+τ −1 (xt )] ≤ 2H. Next, the second part satisfies

E[gt+τ −1 (xt ) − gt+τ −1 (xt+τ −1 )] =
≤

t+τ
−2
X
s=t
t+τ
−2
X
s=t

31

E[gt+τ −1 (xs ) − gt+τ −1 (xs+1 )]
E[G∥xs − xs+1 ∥]

Yeongjong Kim and Dabeen Lee

As in the proof of the first part of Theorem 2, we deduce that
T −τ
X+1



β/2−1
−1
E[gt+τ −1 (xt ) − gt+τ −1 (xt )] ≤ O τmix (τ − 1)T β/2+1/2 + τmix
(τ − 1)3/2 T 1/2 .

t=1

Moreover, it follows from Theorem 1 and the triangular inequality that the third part can
be bounded as
" T
#


X
p
β/2
E
gt (xt ) = O τmix T β/2+1/2 + (τ − 1)T + τ − 1 .
t=τ +2

Lastly, the following provides an upper bound on the fourth part.
" T
#
X
E
ḡ(xt ) ≤ H(τ − 1).
t=T −τ +2

Combining the bounds on the four parts, we have that
" T
#


X
β/2
β/2−1
−1
(τ − 1)3/2 T 1/2 + τ ,
E
ḡ(xt ) ≤ O τmix T β/2+1/2 + τmix (τ − 1)T β/2+1/2 + τmix
t=1

implying in turn that
" T
#
X
1
ḡ(xt ) = O
E [ḡ(x̄T )] ≤ E
T
t=1

β/2−1

τmix τ
(τ − 1)3/2
τ
+
+
T
T (1−β)/2
τmix T 1/2

!
,

as required.

6.3 Ergodic Drift-Plus-Penalty under Slater’s Condition
In this section, we prove Theorem 3, which analyzes the performance of Theorem 3 under
Slater’s constraint qualification stated in Assumption 3. We will see that Slater’s condition
does lead to improvement. Basically, we deduce a reduction on E[Qt ]. To argue this, we
follow the proof path of Yu et al. (2017). We first present a lemma, which is analogous to
(Yu et al., 2017, Lemma 5). The difference is that we use a time-varying parameter θ(t)
which allows time-varying algorithm parameters Vt and αt . In fact, its proof is similar to
that of (Yu et al., 2017, Lemma 5), so we defer it to the appendix (Appendix D).
Lemma 14 Let {Zt , t ≥ 0} be a discrete-time stochastic process adapted to a filtration
{Wt , t ≥ 0} with Z0 = 0 and W0 = {∅, S}. Suppose there exists a positive integer t0 , real
numbers 0 < ζ ≤ δmax , and a non-decreasing function θ(t) > 0 such that
|Zt+1 − Zt | ≤ δmax ,
(
t0 δmax ,
E[Zt+t0 − Zt |Wt ] ≤
−t0 ζ,
32

if Zt < θ(t)
if Zt ≥ θ(t)

Stochastic-Constrained Stochastic Optimization with Markovian Data

for all positive integer t. Then,
4δ 2
E[Zt ] ≤ θ(t) + t0 δmax + t0 max log
ζ



2
8δmax
ζ2



for all positive integer t.
Here, the important part is that although θ(t) is time-varying, the parameter t0 is some
fixed value that does not depend on t. That is why the condition in Lemma 14 can be
recursively applied.
Lemma 14 implies that if the stochastic process given by {Qt , t ≥ 0} where Q0 = 0
satisfies the drift condition as in Lemma 14 with appropriate parameters δmax , t0 , and ζ,
then the expected queue size E [Qt ] can be bounded. Moreover, by properly setting θ(t) as
well as the parameters, we may control the size of E [Qt ]. The next lemma shows that the
stochastic process given by {Qt , t ≥ 0} indeed satisfies the desired drift condition.
While proving the lemma, we need to consider the term
Et−1 [Qi gi (x̂)]
for i ≥ t where x̂ is the solution satisfying Slater’s constraint qualification, i.e. ḡ(x̂) ≤ −ϵ.
Here, we need to relate ḡ(x̂) ≤ −ϵ and the term Et−1 [Qi gi (x̂)], from which we deduce the
drift condition. Although gi (x̂) is not necessarily negative, we again use the property of
ergodic Markov chains that the distribution of gi (x̂) conditional on Ft−1 gets close to the
stationary distribution for a sufficiently large i.
Lemma 15 Let T ≥ 4H/ϵ, t0 > 2τ = 2τmix (T −1 ), and
θt0 (t) =


2
ϵ(t0 − τ )2 (H + GR) + 4τ (t0 + t − 1)(H + GR)2
ϵ(t0 − 2τ )


2
β
+
2t0 (H 2 + G2 R2 ) + 2RF τmix
t0 (t + t0 − 1)β .
ϵ(t0 − 2τ )

Then Algorithm 1 under Assumption 3 satisfies
|Qt+1 − Qt | ≤ H + GR,
(
(H + GR)t0 ,
Et−1 [Qt+t0 − Qt ] ≤
−ϵt0 /4,

if Qt < θt0 (t)
if Qt ≥ θt0 (t).

Proof |Qt+1 − Qt | ≤ H + GR follows directly from Lemma 10.
For the second part, note that by Jensen’s inequality, we have
Et−1 [Qt+t0 ]2 ≤ Et−1 [Q2t+t0 ].
Moreover, observe that
θt0 (t) ≥

ϵt0
2ϵ(t0 − τ )2 (H + GR)
≥ (t0 − τ )(H + GR) ≥
,
ϵ(t0 − 2τ )
2
33

Yeongjong Kim and Dabeen Lee

where the last inequality holds because t0 > 2τ and H > ϵ. This means that Qt > ϵt0 /4 if
Qt > θ(t0 ). Therefore, it is sufficient to show that if Qt ≥ θ(t0 ),

Et−1 [Q2t+t0 ] ≤



ϵt0 2
Qt −
.
4

Recall that By Lemma 23 and the convexity of gt , we have for any i ≥ 1,
∆i ≤ H 2 + G2 R2 + Qi gi (x̂) + Vi RF + αi (D(x̂, xi ) − D(x̂, xi+1 )) .
Thus, we have
t+t
0 −1
X

Et−1 [∆i ]

i=t
2

2

2

≤ t0 (H + G R ) +

+ αt D(x̂, xt ) +

t+t
0 −1
X

β
Et−1 [Qi gi (x̂)] + RF τmix

t+t
0 −1
X

i=t
t+t
−2
0
X

iβ

i=t

(αi+1 − αi )D(x̂, xi+1 ) − αt+t0 −1 D(x̂, xt+t0 )

i=t

≤ t0 (H

2

β
t0 (t + t0 − 1)β + αt+t0 −1 R2 +
+ G R ) + RF τmix
2

2

t+t
0 −1
X

Et−1 [Qi gi (x̂)].

i=t

We factor the last term as follows.
t+t
0 −1
X

Et−1 [Qi gi (x̂)]

i=t
t+τ
−2
X

=

Et−1 [Qi gi (x̂)] +
i=t
t+t
0 −τ
X

Et−1 [(Qi+τ −1 − Qi )gi+τ −1 (x̂)]

i=t

Et−1 [Qi (gi+τ −1 (x̂) − ḡ(x̂))] +

+

≤H

t+t
0 −τ
X

2

i=t
t+τ
−3
X

t+t
0 −τ
X

Et−1 [Qi ḡ(x̂)]

i=t


i + (t0 − τ + 1)(τ − 1)H(H + GR) +

 t+t
0 −τ
X
2H
−ϵ
Et−1 [Qi ]
T
i=t

i=t−1

where the inequality is due to
Z
Et−1 [ḡ(xt ) − gt+τ −1 (xt )] ≤ H

t+τ −1
−1
|dµ(ξ) − dP[t−1]
(ξ)| ≤ 2H∥µ − Pt+τ
[t−1] ∥T V ≤

34

2H
.
T

Stochastic-Constrained Stochastic Optimization with Markovian Data

Then, since T ≥ 4H/ϵ and thus 2H/T ≤ ϵ/2, it follows that
t+t
0 −1
X

Et−1 [Qi gi (x̂)]

i=t
t+t0 −τ
ϵ X
≤ (t0 + t − 2)(τ − 1)H(H + GR) −
Et−1 [Qi ]
2
i=t
tX
0 −τ

ϵ
(Qt − i(H + GR))
2
i=0

ϵ
(t0 − τ )Qt − (t0 − τ )2 (H + GR) .
≤ (t0 + t − 2)(τ − 1)H(H + GR) −
2
≤ (t0 + t − 2)(τ − 1)H(H + GR) −

As a result, we obtain
Et−1 [Q2t+t0 ] = Q2t + 2

t+t
0 −1
X

Et−1 [∆i ]

i=t

≤ Q2t − ϵ(t0 − τ )Qt + ϵ(t0 − τ )2 (H + GR) + 2(t0 + t − 2)(τ − 1)H(H + GR)
β
t0 (t + t0 − 1)β + 2τmix (t + t0 − 1)R2
+ 2t0 (H 2 + G2 R2 ) + 2RF τmix

≤ Q2t − ϵ(t0 − τ )Qt + ϵ(t0 − τ )2 (H + GR) + 4τ (t0 + t − 1)(H + GR)2
β
+ 2t0 (H 2 + G2 R2 ) + 2RF τmix
t0 (t + t0 − 1)β

If Qt satisfies
ϵ
(t0 − 2τ )Qt ≥ ϵ(t0 − τ )2 (H + GR) + 4τ (t0 + t − 1)(H + GR)2
2
β
t0 (t + t0 − 1)β ,
+ 2t0 (H 2 + G2 R2 ) + 2RF τmix
which is equivalent to Qt ≥ θt0 (t), then
ϵ
Et−1 [Q2t+t0 ] ≤ Q2t − t0 Qt ≤
2



ϵt0
Qt −
4

2

as required.
Having prepared Lemmas 14 and 15, we can derive refined bounds on the expected
virtual queue size. Based on this, we are ready to prove Theorem 3.
Proof [Proof of Theorem 3] As the result is trivial when T ≤ 4H/ϵ where H and ϵ are
constants, we may assume without loss of generality that T ≥ 4H/ϵ. Moreover, we will use
Lemma 14 for many values of t0 greater than 2τ . Setting δmax = H + GR, ζ = ϵ/4, and
θ = θt0 where θt0 is defined as in Lemma 15, Lemma 14 gives us that
16(H + GR)2
E[Qt+1 ] ≤ θt0 (t) + t0 (H + GR) + t0
log
ϵ
35



128(H + GR)2
ϵ2


.

Yeongjong Kim and Dabeen Lee

√
In particular, as this inequality holds for any t0 > 2τ , the inequality with t0 = 2τ + ⌈ τmix t⌉
holds. Recall that in Lemma 15, we set
θt0 (t) =


2
ϵ(t0 − τ )2 (H + GR) + 4τ (t0 + t − 1)(H + GR)2
ϵ(t0 − 2τ )


2
β
+
2t0 (H 2 + G2 R2 ) + 2RF τmix
t0 (t + t0 − 1)β .
ϵ(t0 − 2τ )

Moreover,

√
√
ϵ⌈ τmix t⌉
ϵ τmix t
ϵ(t0 − 2τ )
θt0 (t) =
θt0 (t) ≥
θt0 (t).
2
2
2
This implies the following.
√
ϵ τmix t
θt0 (t)
2
√
√
≤ ϵ(τ + ⌈ τmix t⌉)2 (H + GR) + 4τ (2τ + ⌈ τmix t⌉ + t − 1)(H + GR)2
√
√
√
β
+ 2(2τ + ⌈ τmix t⌉)(H 2 + G2 R2 ) + 2RF τmix
(2τ + ⌈ τmix t⌉)(t + 2τ + ⌈ τmix t⌉ − 1)β
√
≤ 2ϵ(H + GR)(τ 2 + 4τmix t) + 4τ (2τ + 2 τmix t + t)(H + GR)2
√
√
√
β
+ 4(τ + τmix t)(H 2 + G2 R2 ) + 4RF τmix
(τ + τmix t)(t + 2τ + 2 τmix t)β
≤ 2ϵ(H + GR)(τ 2 + 4τmix t) + 4τ (2τ + τmix + 2t)(H + GR)2
|
{z
} |
{z
}
+ 4(τ +
|

√

(a)

(b)

τmix t)(H
{z
(c)

2

√
β
+ G R ) + 4RF τmix
(τ + τmix t)(2t + 2τ + τmix )β
}
2

2

|

{z

}

(d)

where the second inequality follows from (a + b)2 ≤ 2(a2 + b2 ), ⌈x⌉ ≤ 2x which holds for
x ≥ 1 and the third inequality follows from AM-GM inequality. We see that term (c) grows
more slowly than term (d) which is of order




√
√
β
β
O τmix
(τ + t)β (τ + τmix t) = O τmix
(τ β + tβ )(τ + τmix t) .
Here, we used the known fact that (a + b)β ≤ aβ + bβ for a, b ≥ 0 and β < 1. Likewise, term
(a) grows more slowly than term (b) which is of order
O (τ (τ + t)) .
Then it follows that




τ
τ (τ + t)
β
β
β
√
√
θt0 (t) = O τmix (τ + t ) 1 +
+
τmix t
τmix t





τ 1+β τ β
τ τ β tβ
τ2
τt 
 β β

β
+√
= O τmix
τ + τmix
tβ + √ mix + √ mix + √
.
| {z } | {z }
τmix t
τmix t
τmix t
τmix t 
| {z } | {z } | {z } | {z }
(a′ )
(b′ )
(c′ )

36

(d′ )

(e′ )

(f ′ )

Stochastic-Constrained Stochastic Optimization with Markovian Data

Here, term (c′ ) grows more slowly than term (e′ ), and term (f ′ ) dominates term (b′ ). Also,
terms (a′ ) and (d′ ) are dominated by

 r

τ
= O (e′ ) + (f ′ ) ,
τ =O τ
τmix
which is due to the AM-GM inequality. Therefore,


τ (τ + t)
√
θt0 (t) = O
.
τmix t
Furthermore, since
√
t0 = 2τ + ⌈ τmix t⌉ = O



τ (τ + t)
√
τmix t



√
which holds because τ (τ + t) ≥ 2τ τ t, we have


τ (τ + t)
E[Qt ] ≤ O √
,
τmix t

(2)

which essentially gives rise to the desired refined bound on the expected virtual queue size.
Having provided the bound on the expected queue size, we now proceed to provide
bounds on the performance guarantees of Algorithm 1. First of all, note that the regret
bound given by Theorem 1 still holds with β = 1/2, as we use the same algorithm parameters.
Then we deduce that
√ !
τ T
.
E [Regret(T )] = O √
τmix
For constraint violation, we deduce from Lemma 22 and (2) that
T

T

F G X Vt
G2 X E[Qt ]
E [Violation(T )] ≤ E[QT +1 ] +
+
2
αt
2
αt
t=1

t=1
!
T
T
X
τ (τ + T + 1)
1 X τ (τ + t)
β−1
β−1
√
p
+ τmix
t
+
τmix
t τmix t
τmix (T + 1)
t=1
t=1

=O

=O

τ (τ + T )
√
τmix T


.

Next, we consider the optimality gap. In the proof of Theorem 2, we argued that
E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )] ≤

t+τ −2
β−1 

F 2 τmix
F G X E[Qs ]
(t + τ − 2)β − (t − 1)β +
.
2β
2τmix s=t
s

Here, using (2) again, the second term on the right-hand side can be bounded as follows.
1
τmix

t+τ
−2
X
s=t



E[Qs ]
s






√
τ √




= O  3/2 λ((t − 1)−1/2 ) − λ((t + τ − 2)−1/2 ) + 3/2 ( t + τ − 2 − t − 1)
{z
}
τmix |
τmix
′′
τ2

(a )

37

Yeongjong Kim and Dabeen Lee

where

(
x,
λ(x) =
3/2,

if x > 0
.
if x = 0

Therefore, it follows that
T −τ
X+1

β−1 

τmix
(T − 1)β + · · · + (T − τ + 1)β
2β


3
τ2
−1/2
−1/2
+ 3/2
+1
+ · · · + (τ − 2)
τmix 2
!

√
τ √
T − 1 + ··· + T − τ + 1
+ 3/2
τmix

E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )] = O

t=1

(3)

if τ ≥ 2. Thus, we deduce
T −τ
X+1

E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )]

t=1


√ 
√
−3/2
−3/2
β−1
(τ − 1)T β + τ 2 τ − 1τmix + τ (τ − 1)τmix T
= O τmix
√ !
τ 5/2 τ 2 T
=O
+ 3/2
.
3/2
τmix
τmix
We combine this result with the other parts of the proof of the first part of Theorem 2 to get
" T
#
!
√
√ !
5/2
5/2
2 T
2 T
X

τ
τ
τ
τ
−β
1−β
+τ =O
E
f¯(xt ) − f¯(x) = O
+ 3/2 + τmix τ T
+ 3/2
3/2
3/2
τ
τ
τ
τmix
t=1
mix
mix
mix
for any x ∈ X such that ḡ(x) ≤ 0 since β = 1/2. Similarly, we get
T −τ
X+1

E[gt+τ −1 (xt ) − gt+τ −1 (xt+τ −1 )] = O

t=1

τ 5/2
3/2

+

τmix

√ !
τ2 T
3/2

.

τmix

As before, we combine this result with the other parts of the proof of the second part of
Theorem 2. Then we get
" T
#
!
!
√
√
X
τ 5/2 τ 2 T
τ (τ + T )
τ 5/2 τ 2 T
τ2
E
ḡ(xt ) = O
+ 3/2 + √
+τ =O
+ 3/2 + √
.
3/2
3/2
τ
T
τmix T
mix
τ
τ
τ
τ
t=1
mix

mix

mix

mix

Finally, we obtain that
E[Gap(x̄T )] = O
E[Infeasibility(x̄T )] =

τ 5/2

τ2
+
3/2
3/2 √
τmix T
τmix T
τ 5/2

!
,

τ2
τ2
+
+
√
3/2
3/2 √
τmix T 3/2
τmix T
τmix T

as required.

38

!
,

Stochastic-Constrained Stochastic Optimization with Markovian Data

7 Analysis of MLMC Adaptive Drift-Plus-Penalty for the Unknown
Mixing Time Case
This section presents a complete proof of Theorem 8. Section 7.1 provides bounds on several
terms that involve the virtual queue size. In Section 7.2, we state the proof of Lemma 6,
and lastly, we prove Theorem 8.
7.1 Controlling the Expected Virtual Queue Size
In this section we provide upper bounds on terms E [Qt /Vt ] and E [Qt ] under the MLMC
Adaptive Drift-Plus-Penalty algorithm (Algorithm 2). Lemma 16 gives a refined bound on
the term E [gt (x)]. Using this, Lemma 17 derives an adaptive upper bound on the term
E [Qt /Vt ], and Lemma 18 deduces an adaptive upper bound on the term E [Qt ].
Note that if x̄ ∈ X satisfies ḡ(x) ≤ 0, then gt (x) ≤ gt (x) − ḡ(x). Then we deduce
h j
i
h j
i
max
max
Et−1 [gt (x)] ≤ Et−1 [gt (x) − ḡ(x)] = Et−1 gt2
(x) − ḡ(x) ≤ Et−1 |gt2
(x) − ḡ(x)| .
where the equality is due to Lemma 4. Applying Lemma 5 together with Jensen’s inequality
on the right-hand side, we obtain


1/2
1/2
Et−1 [gt (x)] ≤ C(T )τmix T −1 = Õ τmix T −1 .
(4)
In fact, we may derive a tighter bound on the term Et−1 [gt (x)] as follows.
Lemma 16 For any x ∈ X such that ḡ(x) ≤ 0,
1/4

Et−1 [gt (x)] ≤ D(T )τmix T −1/2 ,
where D(T ) =

p
C(T )H(4 log2 T + 3)

Proof Note that gt (x) ≤ |gt (x)| ≤ (2Nt + 1)H by the definition of gt . This implies that
Et−1 [gt (x)] ≤ Et−1 [(2Nt + 1)H] = (4 log2 T + 3)H,
where the equality follows from Lemma 4 and the fact that Nt is independent of Ft−1 .
Moreover, as we argued above, we have
h j
i q
 j

max
1/2
Et−1 [gt (x)] ≤ Et−1 |gt2
(x) − ḡ(x)| ≤ Et−1 |gt2 max (x) − ḡ(x)|2 ≤ C(T )τmix T −1 ,
where the last inequality follows from Jensen’s inequality and the equality follows from
Lemma 5. By taking the geometric mean of these two upper bounds, we get that
Et−1 [gt (x)] ≤

p
1/4
C(T )H(4 log2 T + 3)τmix T −1/2 ,

as required.

39

Yeongjong Kim and Dabeen Lee

Next, Lemma 24 implies that Qs+1 is at most
v
v
v
u s
u s
u s
p
u X
u X
u X
t2
RVt Ft + t2
(R2 G2t + Ht2 ) + 2R2 αs +t2
Qt (gt (xt ) + ∇gt (xt )⊤ (x − xt )).
| {z }
t=1
t=1
t=1
(c)
{z
} |
{z
}
|
(a)

(b)

Here, term (c) is equal to
For term (a), we have
(a)2 = 2

s
X

√
√
√
2Ss−1 which is at most 2Ss , and term (b) is at most 2Ss .

β
St−1
Ft ≤ 2Ssβ

t=1

s
X

v
u s
u X
√
βt
Ft ≤ 2Ss s
Ft2 ≤ 2 sSsβ+1/2 ,

t=1

t=1

where the second inequality holds by the power mean inequality. Thus,
s
X
√
Qt (gt (xt ) + ∇gt (xt )⊤ (x − xt )),
Q2s+1 ≤ 2 sSsβ+1/2 + 4Ss + 2

(5)

t=1

which in turn implies that
√

Qs+1 ≤ 2 2Ss1/2 +

√

v
u s
u X
1/4 β/2+1/4
2s Ss
+ t2
Qt (gt (xt ) + ∇gt (xt )⊤ (x − xt ))

(6)

t=1

Moreover, (6) implies that
v
u
s
u 2 X
√
√
Qs+1
Qt
1/2−β
1/4 1/4−β/2
≤ 2 2RSs
+ 2Rs Ss
+t
(gt (xt ) + ∇gt (xt )⊤ (x − xt ))
Vs+1
Vs+1
Vt
t=1

as Vt is non-decreasing. Taking the expectation, applying Jensen’s inequality, and using the
fact that Vs+1 ≥ sδ, it follows that


√
√
Qs+1
E
≤ 2 2RE[Ss ]1/2−β + 2Rs1/4 E[Ss ]1/4−β/2
Vs+1
v
u


s
u 2R X
Qt
⊤
t
+
E
Et−1 [gt (xt ) + ∇gt (xt ) (x − xt )] .
Vt
(sδ)β
t=1

Here, the last term on the right-hand side can be further upper bounded as follows.
v
v
u

 u


s
s
u 2R X
u 2R X
Qt
Qt
⊤ (x − x )] ≤ t
t
E
E
[g
(x
)
+
∇g
(x
)
E
E
[g
(x)]
t−1 t t
t t
t
t−1 t
Vt
Vt
(sδ)β
(sδ)β
t=1

t=1

where we used the fact that
h
i
Et−1 [gt (xt ) + ∇gt (xt )⊤ (x − xt )] = Et−1 g jmax (xt ) + ∇g jmax (xt )⊤ (x − xt )


≤ Et−1 g jmax (x)
= Et−1 [gt (x)]
40

(7)

Stochastic-Constrained Stochastic Optimization with Markovian Data

holds due to Lemma 4. Furthermore,
v
v
u
 u

 
s
s
X
X
u 2R
u 2R
Qt
Qt
1/4 −1/2
t
t
Et−1 [gt (x)] ≤
E
D(T )τmix T
E
β
β
Vt
Vt
(sδ)
(sδ)
t=1
t=1
 
s
2Rs
1 X
Qt
1/4 −1/2
≤
D(T )τmix T
+
E
β
2s
Vt
(sδ)
t=1


s
2RD(T ) 1/4 1/2−β
1 X
Qt
≤
τmix T
+
E
.
2s
Vt
δβ
t=1

where the second inequality follows from the AM-GM inequality and the last equality follows
from s ≤ T . Thus, we have


Qs+1
E
Vs+1
 
s
√
√
2RD(T ) 1/4 1/2−β
1 X
Qt
1/2−β
1/4
1/4−β/2
≤ 2 2RE[Ss ]
+ 2Rs E[Ss ]
+
τmix T
+
E
(8)
β
2s
Vt
δ
t=1

 
s
√
√
2RD(T ) 1/4 1/2−β
1 X
Qt
1/2−β
1/4
1/4−β/2
+ 2Rs E[ST ]
+
≤ 2 2RE[ST ]
τmix T
+
E
.
β
2s
Vt
δ
t=1

Now we can bound E [Qt /Vt ] as follow.
Lemma 17 For t ∈ [T + 1],
 
√
√
4RD(T ) 1/4 1/2−β
Qt
E
τmix T
.
≤ 4 2RE[ST ]1/2−β + 2 2RT 1/4 E[ST ]1/4−β/2 +
Vt
δβ
Proof For t = 1, Q1 = 0 and the inequality of the lemma holds. Assume that it holds for
t = 1, . . . , s with s ≤ T . Substituting the inequalities for t = 1, . . . , s into (8), we derive the
inequality for t = s + 1, as required.
Next, we deduce an upper bound on E [Qt ]. First, we observe that (6) and Jensen’s
inequality imply the following. For s ∈ [T ],
v
u
√
s
u
√
√
τmix X
1/2
1/4
β/2+1/4
t
E [Qs+1 ] ≤ 2 2E[Ss ] + 2s E[Ss ]
+ 2C(T )
E[Qt ]
T
t=1

√

s
√
√
τmix
1 X
≤ 2 2E[Ss ]1/2 + 2s1/4 E[Ss ]β/2+1/4 + sC(T )
+
E[Qt ]
T
2s

(9)

t=1

√

≤ 2 2E[Ss ]1/2 +

√

1/2

2s1/4 E[Ss ]β/2+1/4 + C(T )τmix +

s
1 X

2s

E[Qt ],

t=1

where the first inequality is implied by combining (6) and (7) and applying Lemma 16. Next
we provide an adaptive upper bound on E [Qt ]
41

Yeongjong Kim and Dabeen Lee

Lemma 18 For any t ∈ [T + 1],
√

1/2

E[Qt ] ≤ 4 2E[St−1 ]

√
5 2
1/2
+
(t − 1)1/4 E[St−1 ]β/2+1/4 + 2C(T )τmix .
3

Proof We argue by induction. The inequality trivially holds when t = 1 as Q1 = 0. Suppose
that the inequality holds for t = 1, 2, . . . , s. Then
√
√
1/2
E[Qs+1 ] ≤ 2 2E[Ss ]1/2 + 2s1/4 E[Ss ]β/2+1/4 + C(T )τmix

s 
1 X √
10
1/2
1/2
1/4
β/2+1/4
+
4 2E[St−1 ] + (t − 1) E[St−1 ]
+ 2C(T )τmix
2s
3
t=1
√
√
1/2
≤ 2 2E[Ss ]1/2 + 2s1/4 E[Ss ]β/2+1/4 + C(T )τmix
!
√
√
1
4
2
1/2
+
4 2sE[Ss ]1/2 +
s5/4 E[Ss ]β/2+1/4 + 2sC(T )τmix
2s
3
√
√
5 2 1/4
1/2
1/2
= 4 2E[Ss ] +
s E[Ss ]β/2+1/4 + 2C(T )τmix ,
3
where the first inequality is from (9) and the second inequality is by Corollary 29.

7.2 MLMC Adaptive Drift-Plus-Penalty for Stochastic-Constrained Stochastic
Optimization
Based on the bounds on Et−1 [gt (x)], E[Qt /Vt ], and E[Qt ], in this section, we prove Lemma 6
and Theorem 8. To better present the proof of Lemma 6, we divide the analysis into two
parts, one for the regret and the other for the constraint violation. Then, using Lemma 6,
we prove Theorem 8.
Proof [Proof of the first part of lemma 6] We first bound
" T
#
T
X
X
E
ft (xt ) −
ft (x) .
t=1

t=1

By Lemma 21, this is less than or equal to
" T
#
" T
#
" T
#


X Vt F 2
X (Ht + Gt R)2
X Qt
αT 2
1
t
E
R +E
+E
+E
gt (x) .
VT
4αt
2
Vt
Vt
t=1

t=1

t=1

1−β
Here, the first term is equal to RST1−β
, and the second term
−1 which is less or equal to RST
satisfies
" T
#
" T
#
" T
#
 h i
 h i
X Vt F 2
X β−1
X β−1
R
t
E
= E
St−1 Ft2 ≤ R
St−1 at = O E STβ = O E STβ
4αt
4
t=1

t=1

t=1

where the inequality holds by Corollary 31. The third term satisfies
" T
#
" T
#
" T
#
 h
i
X (Ht + Gt R)2
X R 2 G2 + H 2
X at
1−β
t
t
E
≤E R
≤
E
R
=
O
E
S
.
T
β
β
2Vt
St−1
St−1
t=1

t=1

t=1

42

Stochastic-Constrained Stochastic Optimization with Markovian Data

h
i
h i
By Jensen’s inequality, we have E ST1−β ≤ E [ST ]1−β and E STβ ≤ E [ST ]β . The fourth
term can be bounded as follows. Note that
#
" T


T
X Qt
X
Qt
gt (x) =
Et−1 [gt (x)] .
E
E
Vt
Vt
t=1

t=1

We have two upper bounds on Et−1 [gt (x)], one given in (4) and the other one due to Lemma
16. Let m(T ) be the minimum of the two upper bounds, i.e.,
n
o
1/2
1/4
m(T ) = min C(T )τmix T −1 , D(T )τmix T −1/2 .
Then it follows that
T
X




Qt
E
Et−1 [gt (x)]
Vt
t=1


T
X
√
√
4RD(T ) 1/4 1/2−β
1/2−β
1/4
1/4−β/2
+ 2 2RT E[ST ]
+
τmix T
≤
m(T ) 4 2RE[ST ]
δβ
t=1

T 
 √
 4RD(T )2
X
√
1/2 −1
1/2 −β
1/2−β
1/4
1/4−β/2
4 2RE[ST ]
C(T )τmix T
+
τmix T
≤
+ 2 2RT E[ST ]
δβ
t=1


1/2
1/2
1/2
= Õ τmix E[ST ]1/2−β + τmix T 1/4 E[ST ]1/4−β/2 + τmix T 1−β

where the first inequality follows from (4) and Lemmas 16 hand 17. Combining the bounds
i
PT
PT
on the four terms, we have proved the desired bound on E
t=1 ft (xt ) −
t=1 ft (x) .
Proof [Proof of the second part of Lemma 6] By Lemma 22, we have
T
X

gt (xt ) ≤ QT +1 +

t=1

T
X
Gt
t=1

2αt

(Vt Ft + Qt Gt ).

Moreover, by Lemma 18, we have


1/2
E[QT +1 ] ≤ Õ E[ST ]1/2 + T 1/4 E[ST ]β/2+1/4 + τmix .
Next it follows from Corollary 31 that
" T
#
" T β−1 #
 h i


X Vt Ft Gt
X St−1 at
E
≤E
= O E STβ = O E [ST ]β .
2αt
2
t=1

t=1

Furthermore,
#
"
#
" T



T
2 G2
X G2
X
R
I +δ 1
ST − I − δ
t
t
E
Qt ≤ E max Qt ·
≤E
+ log
max Qt
2αt
2St−1
2δ
2
δ
t∈[T ]
t∈[T ]
t=1

t=1

43

Yeongjong Kim and Dabeen Lee

where the second inequality is implied by Lemma 30. Here, the last term can be upper
bounded using the AM-GM inequality as follows.


I +δ 1
ST − I − δ
max Qt
+ log
2δ
2
δ
t∈[T ]


maxt∈[T ] Q2t
1 I +δ 1
ST 2
≤
+
+ log
(Iτmix )β/2+1/4 T β/2+1/2 .
β/2+1/4
β/2+1/2
2
2δ
2
δ
2(Iτmix )
T
Let s = argmaxt∈[T ] Qt . Then by (5) and (7),
s−1 h
h
ii
X
 2
√
β+1/2
+2
E Qs ≤ 4E[Ss−1 ] + 2 sE[Ss−1 ]
E Qt Et−1 gt (xt ) + ∇gt (xt )⊤ (x − xt )

√
≤ 4E[ST ] + 2 T E[ST ]β+1/2 + 2

t=1
s−1
X

E[Qt Et−1 [gt (x)]].

t=1

Moreover,
s−1
X

E[Qt Et−1 [gt (x)]]

t=1
1/2
≤ C(T )τmix T −1

s−1
X

E[Qt ]

t=1
T
X

!
√
√
5 2
1/2
1/4
β/2+1/4
1/2
(t − 1) E[ST ]
+ 2C(T )τmix
4 2E[ST ] +
3
t=1
!
√
√
4 2 5/4
1/2 −1
1/2
1/2
β/2+1/4
≤ C(T )τmix T
4 2T E[ST ] +
T E[ST ]
+ 2C(T )τmix T
3


1/2
1/2
= Õ τmix E[ST ]1/2 + τmix T 1/4 E[ST ]β/2+1/4 + τmix ,
1/2
≤ C(T )τmix T −1

where the first inequality follows from (4), the second inequality follows from Lemma 18,
and the third inequality follows from Corollary 29. Thus, we get
" T
#
X
E[ST ] + T 1/2 E[ST ]β+1/2
1/2
E
gt (xt ) = Õ E[ST ]1/2 + T 1/4 E[ST ]β/2+1/4 + τmix +
β/2+1/4 β/2+1/2
τmix
T
t=1
!
1/2
τmix + E[ST ]1/2 + E[ST ]β/2+1/4
2 β/2−1/4 β/2+1/2
+
+ (log ST ) τmix
T
,
β/2+1/4 β/2+1/2
τmix
T
as required.
Proof [Proof of Theorem 8] As f¯ is convex,
1
f¯(x̄T ) − f¯(x# ) ≤
T

T 
X
t=1

44


f¯(xt ) − f¯(x# ) .

Stochastic-Constrained Stochastic Optimization with Markovian Data

We can decompose the right-hand side as follows
f¯(xt ) − f¯(x# )
f¯(xt ) − ft (xt ) ft (xt ) − ft (x# ) ft (x# ) − f¯(x# )
=
+
+
.
T
T
T
T
Here, Jensen’s inequality and Lemma 5 imply that
q 


1
1  ¯
1/2
E |f (xt ) − ft (xt )| ≤
E |f¯(xt ) − ft (xt )|2 = Õ(τmix T −2 ),
T
Tq
i


1 h ¯ #
1
1/2
E |f¯(x# ) − ft (x# )|2 = Õ(τmix T −2 ).
E |f (x ) − ft (x# )| ≤
T
T
Moreover,
we havePderived Proposition 7 which provides an upper bound on the expectation
P
of Tt=1 ft (xt ) − Tt=1 ft (x# ). Consequently,
h
i


E f¯(x̄T ) − f¯(x# ) ≤ Õ (τmix )1−β T −β .
For the second part, Jensen’s inequality and Lemma 5 imply that
1
1p
1/2
E [|ḡ(xt ) − gt (xt )|] ≤
E [|ḡ(xt ) − gt (xt )|2 ] = Õ(τmix T −2 )
T
T
As a result,
T

E[ḡ(x̄T )] ≤

1X
E[ḡ(xt )]
T

1
=
T

t=1
T
X

T

1X
E[ḡ(xt ) − gt (xt )] +
E[gt (xt )]
T

t=1
t=1

1/2 −2
β/2+1/4 β/2−1/2
3/4−β/2 −β/2−1/2
= Õ τmix T + τmix
T
+ τmix
T





β/2+1/4 β/2−1/2
3/4−β/2 −β/2−1/2
,
= Õ τmix
T
+ τmix
T
where the first inequality holds because ḡ is convex and the second equality follows from
Proposition 7.

Acknowledgments and Disclosure of Funding
This research is supported, in part, by KAIST Starting Fund (KAIST-G04220016), FOUR
Brain Korea 21 Program (NRF-5199990113928), and National Research Foundation of Korea
(NRF-2022M3J6A1063021).

45

Yeongjong Kim and Dabeen Lee

Algorithm 3 Adaptive Drift-Plus-Penalty
Initialize: Initial iterates x1 ∈ X , Q1 = 0.
for t = 1 to T do
Observe ft and gt .
Set penalty parameter Vt and step size parameter αt such that 0 ≤ Vt−1 ≤ Vt ,
0 ≤ αt−1 ≤ αt , and 0 ≤ αt−1 /Vt−1 ≤ αt /Vt .
Primal update: Set xt+1 as
n
o
xt+1 = argmin (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ x + αt D(x, xt )
x∈X

Dual update: Set Qt+1 as
h
i
Qt+1 = Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt )

+

end for

Appendix A. Analysis of Adaptive Drift-Plus-Penalty
Algorithm 3 is a general template for adaptive variants of the drift-plus-penalty algorithm
whose parameters Vt and αt satisfy that {Vt }Tt=1 , {αt }Tt=1 , and {αt /Vt }Tt=1 are non-decreasing
sequences of non-negative numbers. In this section, we analyze the general template of DPP
given by Algorithm 3, based on which we deduce performance guarantees on Algorithms 1
and 2.
Recall that ∆t = (Q2t+1 − Q2t )/2 is the Lyapunov drift term. The following lemma
provides a bound on the drift term.
Lemma 19 For t ≥ 1,

 1
∆t ≤ Qt gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + (Ht + Gt R)2 .
2

Proof As Qt+1 = max Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ), 0 , we have that Q2t+1 ≤
2
Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) . Hence, it follows that
Q2t+1 Q2t
−
2
2

 1
2
≤ Qt gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt )
2

 1
⊤
≤ Qt gt (xt ) + ∇gt (xt ) (xt+1 − xt ) + (Ht + Gt R)2
2

∆t =

where the last inequality is from Assumption 2.
Since our drift-plus-penalty algorithm is a mirror descent version, we need the following lemma, which is obtained by substituting y = xt , x∗ = xt+1 , and f (x) =
(Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ x into (Wei et al., 2020, Lemma 2.1).
46

Stochastic-Constrained Stochastic Optimization with Markovian Data

Lemma 20 (Wei et al., 2020, Equation (22)) For any x ∈ X and t ≥ 1,
(Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ (xt+1 − xt ) + αt D(xt+1 , xt )
≤ (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ (x − xt ) + αt D(x, xt ) − αt D(x, xt+1 ).
Recall that ft and gt for the known mixing time case in Section 3 correspond to one
sample and are assumed to be convex. In contrast, ft and gt for the unknown mixing time
setting in Section 4 come from the MLMC estimation scheme with multiple data samples
and thus are not necessarily convex. Nevertheless, we use the fact that Et−1 [ft ] and Et−1 [gt ]
are convex. Based on this, we deduce the following lemma.
Lemma 21 Suppose that Et−1 [ft ] = Et−1 [fˆt ] and Et−1 [gt ] = Et−1 [ĝt ] where fˆt and ĝt are
convex functions and that Et−1 [∇ft ] = Et−1 [∇fˆt ] and Et−1 [∇gt ] = Et−1 [∇ĝt ]. Then for any
x ∈ X,
" T
#
" T
#


T
X
X
X Qt
αT 2
E
ft (xt ) −
ft (x) ≤ E
gt (x) + E
R
V
VT
t=1
t=1
t=1 t
#
" T
#
" T
X (Ht + Gt R)2
X Vt F 2
1
t
+ E
.
+E
4αt
2
Vt
t=1

t=1

Proof Dividing both sides of the inequality given in Theorem 20 by Vt and addding
ft (xt ) + Qt gt (xt )/Vt to both sides, we get

⊤
Qt
Qt
αt
ft (xt ) +
gt (xt ) + ∇ft (xt ) +
∇gt (xt )
(xt+1 − xt ) + D(xt+1 , xt )
Vt
Vt
Vt

⊤
Qt
Qt
αt
αt
≤ ft (xt ) +
gt (xt ) + ∇ft (xt ) +
∇gt (xt )
(x − xt ) + D(x, xt ) − D(x, xt+1 ).
Vt
Vt
Vt
Vt
Here, the left-hand side is bounded below by
ft (xt ) + ∇ft (xt )⊤ (xt+1 − xt ) +

αt
∆t
1
D(xt+1 , xt ) +
−
(Ht + Gt R)2
Vt
Vt
2Vt

by Lemma 19. Moreover, we have
αt
αt
∇ft (xt )⊤ (xt+1 − xt ) + D(xt+1 , xt ) ≥ −Ft ∥xt+1 − xt ∥ + ∥xt+1 − xt ∥2
Vt
Vt


2
Vt Ft
αt
Vt Ft 2
=−
+
∥xt+1 − xt ∥ −
4αt
Vt
2αt
2
Vt Ft
≥−
4αt
where the first inequality holds by 2-strong convexity of Φ with respect to ∥ · ∥. Hence, we
deduce that
Vt Ft2 ∆t
1
+
−
(Ht + Gt R)2
4αt
Vt
2Vt

⊤
Qt
αt
αt
Qt
(x − xt ) + D(x, xt ) − D(x, xt+1 ).
≤ ft (xt ) +
gt (xt ) + ∇ft (xt ) +
∇gt (xt )
Vt
Vt
Vt
Vt

ft (xt ) −

47

Yeongjong Kim and Dabeen Lee

For the right-hand side of this inequality, consider
T
X
αt



T
X
α1
αT
αt αt−1
(D(x, xt ) − D(x, xt+1 )) =
D(x, xt ) +
−
−
D(x, xt+1 )
D(x, xt )
V
V1
Vt
Vt−1
VT
t=1 t
t=2


T
α1 2 X 2 αt αt−1
≤
R +
R
−
V1
Vt
Vt−1
t=2
αT 2
=
R
VT

where the inequality holds since {αt /Vt }Tt=1 is a non-decreasing sequence. Furthermore,
T
X
∆t

Vt

t=1

T
T
Q2
1X 1 2
Q2
1X 2
(Qt+1 − Q2t ) = − 1 + T +1 +
Qt
2
Vt
2V1
2VT
2

=

t=1



t=2

1
Vt−1

−

1
Vt


≥−

Q21
= 0,
2V1

where the inequality holds since the sequence {Vt }Tt=1 is non-negative and non-decreasing.
Combining these inequalities, we deduce that
T
X

ft (xt ) −

t=1

≤

T 
X

ft (xt ) + ∇ft (xt )⊤ (x − xt )



t=1

T
X

Q 

t=1

Vt

t

(10)

T

T

t=1

t=1

αT 2 X Vt Ft2 1 X (Ht + Gt R)2
gt (xt ) + ∇gt (xt )⊤ (x − xt ) +
R +
+
.
VT
4αt
2
Vt


Recall that Et−1 [ft ] = Et−1 [fˆt ] and Et−1 [gt ] = Et−1 [ĝt ] where fˆt and ĝt are convex and that
Et−1 [∇ft ] = Et−1 [∇fˆt ] and Et−1 [∇gt ] = Et−1 [∇ĝt ]. Then it follows that
Et−1 [ft (xt )] + Et−1 [∇ft (xt )]⊤ (x − xt ) = Et−1 [fˆt (xt )] + Et−1 [∇fˆt (xt )]⊤ (x − xt )
≤ Et−1 [fˆt (x)]
= Et−1 [ft (x)]
where the second inequality holds because fˆt is convex and xt is Ft−1 -measurable. Likewise,
we deduce that
Et−1 [gt (xt )] + Et−1 [∇gt (xt )]⊤ (x − xt ) ≤ Et−1 [gt (x)].
Taking the expectations of both sides of (10), we obtain the inequality of this lemma, as
required.
Next, we state a lemma that will be useful to provide an upper bound on the constraint
violation.
Lemma 22 Algorithm 3 achieves
∥xt+1 − xt ∥ ≤

1
(Vt Ft + Qt Gt ),
2αt

T
X

gt (xt ) ≤ QT +1 +

t=1

48

T
X
Gt
t=1

2αt

(Vt Ft + Qt Gt ).

Stochastic-Constrained Stochastic Optimization with Markovian Data

Proof As Qt+1 ≥ Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ), it follows that
gt (xt ) ≤ Qt+1 − Qt − ∇gt (xt )⊤ (xt+1 − xt ) ≤ Qt+1 − Qt + Gt ∥xt+1 − xt ∥.
On the other hand, if we set x = xt for the inequaity of Lemma 20, we get
⊤
αt D(xt+1 , xt ) + αt D(xt , xt+1 ) ≤ Vt ∇ft (xt ) + Qt ∇gt (xt ) (xt − xt+1 ).
Here, the left-hand side is greater or equal to 2αt ∥xt+1 − xt ∥2 while the right-hand side is
less or equal to (Vt Ft + Qt Gt )∥xt+1 − xt ∥. Therefore, it follows that
∥xt+1 − xt ∥ ≤

1
(Vt Ft + Qt Gt ),
2αt

(11)

which implies
T
X
t=1

gt (xt ) ≤ QT +1 +

T
X
Gt
t=1

2αt

(Vt Ft + Qt Gt ),

as required.
To bound the constraint violation, we still need to bound the virtual queue size QT +1 .
We also need the following lemma.
Lemma 23 For any x ∈ X ,


∆t ≤ Ht2 + R2 G2t + Qt gt (xt ) + ∇gt (xt )⊤ (x − xt ) + Vt RFt + αt (D(x, xt ) − D(x, xt+1 ).
Proof By Lemma 19,

 1
∆t ≤ Qt gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + (Ht + RGt )2

 2
⊤
≤ Qt gt (xt ) + ∇gt (xt ) (xt+1 − xt ) + Ht2 + R2 G2t
where the last inequality comes from the fact that (A + B)2 ≤ 2(A2 + B 2 ). Here, using
Lemma 20, the right-hand side can be further bounded above as follows.
Qt (gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt )) + Ht2 + R2 G2t

≤ Qt gt (xt ) + ∇gt (xt )⊤ (x − xt ) + Vt ∇ft (xt )⊤ (x − xt ) − Vt ∇ft (xt )⊤ (xt+1 − xt )

− αt D(xt+1 , xt ) + αt D(x, xt ) − D(x, xt+1 ) + Ht2 + R2 G2t .
Moreover, it follows from the Cauchy-Schwarz inequality that
Vt ∇ft (xt )⊤ (x − xt ) − Vt ∇ft (xt )⊤ (xt+1 − xt ) = Vt ∇ft (xt )⊤ (x − xt+1 ) ≤ Vt RFt
by Cauchy-Schwarz inequality. Then we have proved the lemma, as desired.
Based on Lemma 23, we may provide the following bound on the virtual queue size.
49

Yeongjong Kim and Dabeen Lee

Lemma 24 For any x ∈ X , s ∈ [T ],
v
u s
s
X
u X

Qs+1 ≤ t2
Ht2 + R2 G2t + Vt RFt + 2
Qt (gt (xt ) + ∇gt (xt )⊤ (x − xt )) + 2R2 αs .
t=1

t=1

Proof From Lemma 23, we get
s

Q2s+1 X
=
∆t
2
t=1

≤

s
X




Ht2 + R2 G2t + Qt gt (xt ) + ∇gt (xt )⊤ (x − xt ) + Vt RFt

t=1

+ α1 D(x, x1 ) +

s
X

D(x, xt )(αt − αt−1 ) − αs D(x, xt+1 ).

t=2

Since αt is non-decreasing and D(x, xt ) ≤ R2 , it follows that
α1 D(x, x1 ) +

s
X

D(x, xt )(αt − αt−1 ) − αs D(x, xt+1 ) ≤ R2 αs .

t=2

This implies the desired bound on Qs+1 .

Appendix B. Sum of Sequences
In this section, we consider some series of numbers and provide bounds on their partial sums
∞
to make
Psour paper self-contained. Given a sequence {xt }t=1 of numbers, we use notation
Xs := t=1 xt to denote its partial sums.
Lemma 25 If f : R+ → R+ is continuous and non-increasing, then
T
X

Z XT
f (Xt )xt ≤ x1 f (X1 ) +

Z XT
f (x)dx ≤

X1

t=1

f (x)dx
0

for any nonnegative x1 , . . . , xT .
Proof By considering the area between f (x) and the x-axis over the interval [Xt−1 , Xt ] of
R Xt
length xt , we have f (Xt )xt ≤ Xt−1
f (x)dx since f is non-increasing. Then it follows that
T
X

f (Xt )xt = x1 f (X1 ) +

t=1

T
X

Z XT
f (Xt )xt ≤ x1 f (X1 ) +

f (x)dx ≤
X1

t=2

Z XT
f (x)dx,
0

as required.
As a consequence of Lemma 25, we deduce the following list of bounds on series.
50

Stochastic-Constrained Stochastic Optimization with Markovian Data

Corollary 26 (Auer et al., 2002)
T
X
p
x
√ t ≤ 2 XT .
Xt
t=1

Corollary 27
T
X
xt
t=1

Xt

≤ 1 + logX1 (XT ).

Next we consider the following.
Lemma 28 If f : R+ → R+ is continuous and non-decreasing, then
T
X

Z XT +1
f (Xt )xt+1 ≤

f (x)dx
X1

t=1

for any nonnegative x1 , . . . , xT .
Proof By considering the area between f (x) and the x-axis over the interval [Xt , Xt+1 ] of
RX
length xt+1 , we have f (Xt )xt+1 ≤ Xtt+1 f (x)dx since f is non-decreasing. Then it follows
that
T
X

f (Xt )xt+1 = x2 f (X1 ) +

t=1

T
X

Z XT +1
f (Xt )xt+1 ≤ x2 f (X1 ) +

Z XT +1
f (x)dx ≤

X2

t=2

f (x)dx,
X1

as required.
Lemma 28 implies the following.
Corollary 29 For q > 0,
T
X
t=1

tq ≤


1
(T + 1)q+1 − 1 .
q+1

Moreover, when each xt is bounded by some fixed constant, we can deduce the following
result. For ease of P
notation, we start a sequence with x0 and, with abuse of notation, define
partial sum Xs = st=0 xt with x0 = X0 = δ.
Lemma 30 If 0 ≤ xt ≤ C for t = 0, . . . , T for some fixed constant C and f : R++ → R+ is
continuous and non-increasing, then
T
X

Z max{δ,XT −C}
f (Xt−1 )xt ≤ Cf (δ) +

f (x)dx.
δ

t=1

Proof If we consider the area between f (x) and the x-axis over the interval [Xt−1 , Xt ], we
R Xt
have f (Xt−1 )xt ≥ Xt−1
f (x)dx in which the inequality direction is the opposite of what we
51

Yeongjong Kim and Dabeen Lee

want. However, since xt ≤ C, we can use the idea of translation by C in the x-axis direction
in the following way. Let
(
f (δ),
x ∈ (−∞, δ],
f˜(x) =
f (x),
x ∈ (δ, ∞).
Then the graph of the translation f˜(x − C) is above the squares of height f (Xt−1 ) on the
interval [Xt−1 , Xt ]. Thus,
(
Z XT
T
X
(XT − δ)f (δ) ≤ Cf (δ),
if XT ≤ δ + C,
˜
R XT −C
f (Xt−1 )xt ≤
f (x − C)dx ≤
Cf (δ) + δ
f (x)dx,
if XT > δ + C
X0
t=1
which implies the desired statement of this lemma.
As a corollary of Lemma 30 with f (x) = x−γ , we deduce the following inequality.
Corollary 31 If xt ≤ C, 0 < γ ̸= 1, then
T
X

−γ
Xt−1
xt ≤ Cδ −γ +

t=1


1
max{δ, XT − C}1−γ − δ 1−γ .
1−γ

Appendix C. Online Convex Optimization with Adversarial Losses and
Constraints
In this section, we show the performance of Algorithm 2, which is an AdaGrad-style variant
of the drift-plus-penalty algorithm for online convex optimization with adversarial loss and
constraint functions. To deal with adversarial constraint functions, we set the parameters Vt
and αt differently as follows.
F2
at := t + R2 G2t + Ht2 ,
4

St :=

t
X

as

s=1

and then set parameters as
Stβ
St
, αt = 2 ,
(12)
R
R
for some 0 < β ≤ 1/2. One distinction from (4) is the presence of additional parameter δ,
and another difference is that Vt and αt are defined with St , not St−1 . We now assume that
the convex constraint functions g1 , . . . , gT as well as the convex loss functions f1 , . . . , fT
are chosen adversarially. Following Neely and Yu (2017), we set the benchmark x◦ as an
optimal solution to
Vt =

min

T
X

ft (x)

subject to gt (x) ≤ 0

for t = 1, . . . , T .

t=1

Then the goal is to obtain upper bounds on
Regret(T ) =

T
X
t=1

ft (xt ) −

T
X

◦

ft (x ),

t=1

Violation(T ) =

T
X
t=1

52

gt (xt )

Stochastic-Constrained Stochastic Optimization with Markovian Data

in sublinear orders of T by properly choosing our inputs xt . For this, we need the following
theorem.
Theorem 32 Algorithm 2 with Vt and α set as in (12) guarantees that




1/2
β/2+1/4
Regret(T ) = O ST1−β , Violation(T ) = O ST + T 1/4 ST
.
Proof Applying Lemma 21 with fˆt = ft , ĝt = gt , and x = x◦ , we obtain
T

T

t=1

t=1

αT 2 X Vt Ft2 1 X (Ht + Gt R)2
Regret(T ) ≤
R +
+
.
VT
4αt
2
Vt
The first term of the right hand side is RST1−β = O(ST1−β ), and the second term satisfies
T
X
Vt F 2
t

t=1

4αt

=

T

T

t=1

t=1

 
X β−1
R
R X β−1 2
St at ≤ STβ = O STβ ,
S t Ft ≤ R
4
β

where the last inequality follows from Lemma 25. The third term satisfies
T
X
(Ht + Gt R)2
t=1

2Vt

≤R

T
X
R 2 G2 + H 2
t

t=1

t

Stβ

≤R

T
X
at

≤
β

t=1 St



R
ST1−β = O ST1−β .
1−β

Combining these two inequalities, we get


Regret(T ) = O ST1−β .
Next, we prove the second part of the theorem. By Lemma 22, we have
T
X

gt (xt ) ≤ QT +1 +

t=1

T
X
Gt
t=1

2αt

(Vt Ft + Qt Gt ).

If we apply Lemma 24 with x = x◦ , we obtain
v
u T
u X

QT +1 ≤ t2
Ht2 + R2 G2t + Vt RFt + 2R2 αT
t=1

v
v
u T
u T
u X
u X
p
≤ t2
RVt Ft + t2
(R2 G2t + Ht2 ) + 2R2 αT .
| {z }
t=1
t=1
(c)
|
{z
} |
{z
}
(a)

(b)

◦
where the√first inequality follows from the convexity of
√ gt and gt (x ) ≤ 0. Here, term (c) is
equal to 2ST , and term (b) is less than or equal to 2ST . For term (a), we have
v
u T
T
T
u X
X
X
√ β+1/2
β
β
β
(a)2 = 2
St Ft ≤ 2ST
Ft ≤ 2ST tT
Ft2 ≤ 4 T ST
,
t=1

t=1

t=1

53

Yeongjong Kim and Dabeen Lee

where the second inequality holds by the power mean inequality. Thus,


√ 1/2
β/2+1/4
1/2
β/2+1/4
QT +1 ≤ (a) + (b) + (c) ≤ 2 2ST + 2T 1/4 ST
= O ST + T 1/4 ST
.

(13)

We also have that
T
X
Vt Ft Gt
t=1

2αt

=

T
X

Stβ−1 RFt Gt /2

t=1

≤

≤

T
X
t=1
T
X

Stβ−1 (Ft2 /4 + R2 G2t )/2
Stβ−1 at /2 ≤

t=1

STβ
2β

 
= O STβ ,
where the last inequality follows from Lemma 25. Lastly,
T
X
G2

T
T
X
√ G2t 1/2 X
G2t 1/4 β/2+1/4
Qt ≤
2 St +
T St
2αt
αt
αt
t

t=1

t=1

≤

√

2

t=1

T
X

at

+ T 1/4
1/2

T
X

at
3/4−β/2
t=1 St

t=1 St
p
T 1/4
β/2+1/4
≤ 2 2ST +
S
β/2 + 1/4 T


1/2
β/2+1/4
= O ST + T 1/4 ST
,

where the first inequality follows from (13) and the last inequality follows from Lemma 25.
Combining the results, we get


1/2
β/2+1/4
Violation(T ) = O ST + T 1/4 ST
,
as required.

Appendix D. Proof of the Time-Varying Drift Lemma
In this section, we prove Lemma 14 for the case of time-varying parameter θt0 (t). We closely
follow the proof of (Yu et al., 2017, Lemma 5).
2 ) and ρ = 1 − ζ 2 /(8δ 2 ) = 1 − rt ζ/2. Then
Lemma 33 Let r = ζ/(4t0 δmax
0
max

h
i ert0 δmax
E erZ(t) ≤
erθ(t)
1−ρ
for all t ≥ 0.
54

Stochastic-Constrained Stochastic Optimization with Markovian Data

Proof Since 0 < ζ < δmax , we have 0 < ρ < 1 < erδmax . Define η(t) = Z(t + t0 ) − Z(t).
Note that |η(t)| ≤ t0 δmax for all t ≥ 0 which implies that |rη(t)| ≤ ζ/(4δmax ) ≤ 1. Then,




1
rZ(t+t0 )
rZ(t) rη(t)
rZ(t)
2 2 2
rZ(t)
1 + rη(t) + rt0 ζ
e
=e
e
≤e
1 + rη(t) + 2r t0 δmax = e
(14)
2
where the inequality follows from the fact that ex ≤ 1 + x + 2x2 for |x| ≤ 1, |rη(t)| ≤ 1, and
2 ).
|η(t)| ≤ t0 δmax while the equality follows by substituting r = ζ/(4t0 δmax
Next, we consider the cases Z(t) ≥ θ(t) and Z(t) < θ(t), separately. First, consider
the case Z(t) ≥ θ(t). Taking the conditional expectation of each side of (14) gives us the
following.


h
i
1
rZ(t+t0 )
rZ(t)
E e
| Z(t) ≤ E e
(1 + rη(t) + rt0 ζ) | Z(t)
2


1
≤ erZ(t) 1 − rt0 ζ + rt0 ζ
2


rt0 ζ
= erZ(t) 1 −
2
= ρerZ(t)
where the inequality follows from the fact that E[Z(t + t0 ) − Z(t)|F(t)] ≤ −t0 ζ when
Z(t) ≥ θ(t) while the second equality follows from the fact that ρ = 1 − rt0 ζ/2. Likewise,
for the case Z(t) < θ(t), we deduce that
h
i
h
i
h
i
E erZ(t+t0 ) | Z(t) = E erZ(t) erη(t) | Z(t) = erZ(t) E erη(t) | Z(t) ≤ ert0 δmax erZ(t) ,
where the inequality follows from the fact that η(t) ≤ t0 δmax .
Putting the two cases together, we deduce that
h
i
E erZ(t+t0 )
h
i
h
i
= P(Z(t) ≥ θ(t))E erZ(t+t0 ) | Z(t) ≥ θ(t) + P(Z(t) < θ(t))E erZ(t+t0 ) | Z(t) < θ(t)
h
i
h
i
≤ ρE erZ(t) | Z(t) ≥ θ(t) P(Z(t) ≥ θ(t)) + ert0 δmax E erZ(t) | Z(t) < θ(t) P(Z(t) < θ(t))
h
i 
 h
i
= ρE erZ(t) + ert0 δmax − ρ E erZ(t) | Z(t) < θ(t) P(Z(t) < θ(t))
h
i 

≤ ρE erZ(t) + ert0 δmax − ρ erθ(t)
h
i
≤ ρE erZ(t) + ert0 δmax erθ(t)
where the first inequality follows from the analysis of the two separate cases and the second
inequality follows from the fact that ert0 δmax > ρ.
Then we argue by induction to prove the statement of this lemma. We first consider the
base case t ∈ {0, 1, . . . , t0 }. Since Z(t) ≤ tδmax for all t ≥ 0, it follows that
E[erZ(t) ] ≤ ertδmax ≤ ert0 δmax ≤
55

ert0 δmax rθ(t)
e
1−ρ

Yeongjong Kim and Dabeen Lee

for all t ∈ {1, . . . , t0 }, where the last inequality follows because erθ(t) /(1 − ρ) ≥ 1. Next
we assume that the inequality holds for all t ∈ {0, 1, . . . , s} with some s ≥ t0 and consider
iteration t = s + 1.Note that
h
i
h
i
E erZ(s+1) ≤ ρE erZ(s+1−t0 ) + ert0 δmax erθ(s+1−t0 )
≤ρ

ert0 δmax rθ(s+1−t0 )
e
+ ert0 δmax erθ(s+1−t0 )
1−ρ

ert0 δmax rθ(s+1−t0 )
e
1−ρ
ert0 δmax rθ(s+1)
e
≤
1−ρ
≤

where the second inequality comes from the induction hypothesis by noting that 0 ≤
τ + 1 − t0 ≤ τ while the last inequality follows from the fact that θ(t) is non-decreasing.
Based on this lemma, we prove Lemma 14.
Proof [Proof of Lemma 14] Note that erx is convex in x when r > 0. By Jensen’s inequality,

erE[Z(t)] ≤ E[erZ(t) ] ≤

er(θ(t)+t0 δmax )
1−ρ

where the inequality is implied by Lemma 33. Taking logarithm on both sides and dividing
by r yields that

E[Z(t)] ≤ θ(t) + t0 δmax +

2
 1 
 8δmax

1
4δ 2
log
= θ(t) + t0 δmax + t0 max log
,
2
r
1−ρ
ζ
ζ

where the equality holds because recalling that r = 4t0 δζ2

2

max

and ρ = 1 − 8δζ2

max

.

Appendix E. Properties of the MLMC Estimator
In this section, we prove Lemmas 4 and 5.

Lemma 34 (Dorfman and Levy, 2022, Lemma A.6) Let h : X × S → Rk for some k ≥ 1.
Suppose that there exists some constant L > 0 such that ∥h(x, ξ)∥ ≤ L for every (x, ξ) ∈
X × S, where the norm ∥ · ∥ satisfies ∥ · ∥ ≤ η∥ · ∥2 for some η > 0. We denote by
N

h̄(x) := Eξ∼µ [h(x, ξ)],

hN
t (x) :=

1 X
(i)
h(x, ξt ).
N
i=1

56

Stochastic-Constrained Stochastic Optimization with Markovian Data

Suppose that x is Ft−1 measurable and N ≤ A for some A ∈ N. If 2τmix ⌈2 log A⌉ ≤ N , then
r


p
τmix ⌈2 log A⌉ 
1 + log(τmix ⌈2 log A⌉N )
N
6Lητmix ⌈2 log A⌉ 4Lη
+
+
,
N
N

 576L2 η 2 τmix ⌈2 log A⌉
2
Et−1 ∥hN
≤
(1 + log(τmix ⌈2 log A⌉N ))
t (x) − h̄(x)∥
N
2 ⌈2 log A⌉2
72L2 η 2 τmix
8L2 η 2
+
+
.
N2
N
Et−1 ∥hN
t (x) − h̄(x)∥ ≤ 12Lη


We point out that (Dorfman and Levy, 2022, Lemma A.6) was originally stated for the ℓ2
norm, but the statement holds for any norm over a finite-dimensional vector space assuming
that ∥ · ∥ ≤ η∥ · ∥2 for some fixed constant η > 0 and O hides the dependence on η. In the
following lemma, we simplify the upper bounds of Lemma 34.
Lemma 35 Let h : X × S → Rk for some k ≥ 1. Suppose that there exists some constant
L > 0 such that ∥h(x, ξ)∥ ≤ L for every (x, ξ) ∈ X × S, where the norm ∥ · ∥ satisfies
∥ · ∥ ≤ η∥ · ∥2 for some η > 0. We denote by
N

h̄(x) := Eξ∼µ [h(x, ξ)],

hN
t (x) :=

1 X
(i)
h(x, ξt ).
N
i=1

Suppose that x is Ft−1 measurable and N ≤ A for some A ∈ N. Then
r


p
τmix ⌈2 log A⌉ 
2 + log(τmix ⌈2 log A⌉N ) ,
N
2 max(1, η)2 τ
 N

576L
⌈2
log A⌉
mix
Et−1 ∥ht (x) − h̄(x)∥2 ≤
(2 + log(τmix ⌈2 log A⌉N )).
N


Et−1 ∥hN
t (x) − h̄(x)∥ ≤ 12L max(1, η)

Proof First, we consider the case where 2τmix ⌈2 log A⌉ > N . Then by the triangle inequality,
r
 N

2τmix ⌈2 log A⌉
Et−1 ∥ht (x) − h̄(x)∥ ≤ 2L < 2L
rN

p
τmix ⌈2 log A⌉ 
≤ 12L max(1, η)
2 + log(τmix ⌈2 log A⌉N ) ,
N
 N

2τmix ⌈2 log A⌉
Et−1 ∥ht (x) − h̄(x)∥2 ≤ 4L2 < 4L2
N
576L2 max(1, η)2 τmix ⌈2 log A⌉
≤
(2 + log(τmix ⌈2 log A⌉N )).
N
57

Yeongjong Kim and Dabeen Lee

Next, we consider the case where 2τmix ⌈2 log A⌉ ≤ N . By Lemma 34,
r

p
 N

τmix ⌈2 log A⌉ 
Et−1 ∥ht (x) − h̄(x)∥ ≤ 12Lη
1 + log(τmix ⌈2 log A⌉N )
N
6Lητmix ⌈2 log A⌉ 4Lη
+
+
,
N
N

 576L2 η 2 τmix ⌈2 log A⌉
2
(1 + log(τmix ⌈2 log A⌉N ))
Et−1 ∥hN
≤
t (x) − h̄(x)∥
N
2 ⌈2 log A⌉2
72L2 η 2 τmix
8L2 η 2
+
+
.
N2
N
Since τmix ⌈2 log A⌉/N ≤ 21 , we deduce that
6Lητmix ⌈2 log A⌉ 4Lη
12Lητmix ⌈2 log A⌉
+
≤
≤ 12Lη
N
N
N

r

τmix ⌈2 log A⌉
N

and that
2 ⌈2 log A⌉2
72L2 η 2 τmix
8L2 η 2
72L2 η 2 τmix ⌈2 log A⌉ 8L2 η 2 τmix ⌈2 log A⌉
+
≤
+
N2
N
N
N
576L2 η 2 τmix ⌈2 log A⌉
≤
.
N

As a result, we obtain
r


p
τmix ⌈2 log A⌉ 
2 + log(τmix ⌈2 log A⌉N )
Nr

p
τmix ⌈2 log A⌉ 
2 + log(τmix ⌈2 log A⌉N ) ,
≤ 12L max(1, η)
N
2
2
 N

576L η τmix ⌈2 log A⌉
(2 + log(τmix ⌈2 log A⌉N ))
Et−1 ∥ht (x) − h̄(x)∥2 ≤
N
576L2 max(1, η)2 τmix ⌈2 log A⌉
≤
(2 + log(τmix ⌈2 log A⌉N )),
N


Et−1 ∥hN
t (x) − h̄(x)∥ ≤ 12Lη

as required.
Proof [Proof of Lemma 4] We first argue that
h j
i
max
Et−1 [ft (x)] = Et−1 ft2
(x)
for any x and t. Note that
max
h j
i
h j i
 1  jX
j−1
max
Et−1 [ft ] = Et−1 ft +
P (Jt = j) 2j Et−1 ft2 − ft2
= Et−1 ft2

j=1

as P (Jt = j) = 1/2j Similarly, we can show that
h j i
h
i
jmax
max
,
Et−1 [gt ] = Et−1 gt2
, Et−1 [∇ft ] = Et−1 ∇ft2
58

h
i
jmax
Et−1 [∇gt ] = Et−1 ∇gt2

Stochastic-Constrained Stochastic Optimization with Markovian Data

holds for any x and t. For the second part, we have
i
h
i
h
2
+ 2H 2
E |gt |2 ≤ 2E gt − gt1
since ∥x + y∥2 ≤ (∥x∥ + ∥y∥)2 ≤ 2∥x∥2 + 2∥y∥2 and gt1 ≤ H. Note that

 jX


max
max
i jX
h
j
j−1 2
j
j−1 2
2
=
P(Jt = j)22j E gt2 − gt2
=
2j E gt2 − gt2
E gt − gt1
j=1

j=1

because P (Jt = j) = 1/2j . Here we can bound the right-most side based on the following.






τ 
2
2
2
mix
2j
2j−1
2j
2j−1
E gt − gt
≤ 2E gt − ḡ
+ 2E gt
− ḡ
= Õ
2j
where the last inequality follows from Lemma 35. Then it follows that
i
h
2
≤ Õ(jmax τmix ) = Õ(τmix )
E gt − gt1
where the last equality holds because jmax = O(log T ). For the last part,
E[Nt ] = 1 +

jX
max

P(Jt = j)(2j − 1) ≤ 1 + jmax ≤ 1 + 2 log2 T,

j=1

as required.
Proof [Proof of Lemma 5] By Assumption 2, we can apply Lemma 35 to gt , ∇gt , ∇ft .
Assumptions 1, 2 implies that |ft (xt )| ≤ J for some J > 0. Thus, we can apply Lemma
35 to ft as well. Let L = max(F, G, H, J), A = T 2 , and ∥ · ∥∗ ≤ η∥ · ∥2 for some η > 0. By
Lemma 35, the statement follows for
q
C(T ) = 576L2 max(1, η)2 ⌈4 log T ⌉(2 + log(τmix ⌈4 log T ⌉2jmax )).
Since 2jmax = O(T 2 ), the order of C(T ) follows.

References
Z. Akhtar, A. S. Bedi, and K. Rajawat. Conservative stochastic optimization with expectation
constraints. IEEE Transactions on Signal Processing, 69:3190–3205, 2021. doi: 10.1109/
TSP.2021.3082467.
C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for
machine learning. Machine Learning, 50(1):5–43, 2003. doi: 10.1023/A:1020281327116.
URL https://doi.org/10.1023/A:1020281327116.
59

Yeongjong Kim and Dabeen Lee

P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning
algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002. ISSN 0022-0000.
doi: https://doi.org/10.1006/jcss.2001.1795. URL https://www.sciencedirect.com/
science/article/pii/S0022000001917957.
G. Ayache, V. Dassari, and S. E. Rouayheb. Walk for learning: A random walk approach
for federated learning from heterogeneous data. IEEE J.Sel. A. Commun., 41(4):929–940,
apr 2023. ISSN 0733-8716. doi: 10.1109/JSAC.2023.3244250. URL https://doi.org/10.
1109/JSAC.2023.3244250.
A. Benveniste, P. Priouret, and M. Métivier. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, Berlin, Heidelberg, 1990. ISBN 0387528946.
D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1st
edition, 1996. ISBN 1886529108.
J. H. Blanchet and P. W. Glynn. Unbiased monte carlo for optimization and functions
of expectations via multi-level randomization. In 2015 Winter Simulation Conference
(WSC), pages 3656–3667, 2015. doi: 10.1109/WSC.2015.7408524.
V. S. Borkar. Stochastic Approximation A Dynamical Systems Viewpoint. Texts and Readings
in Mathematics ; 48. Hindustan Book Agency, Gurgaon, 1st ed. 2008. edition, 2008. ISBN
93-86279-38-X.
L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine
learning. SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173. URL https:
//doi.org/10.1137/16M1080173.
J. Cao, R. Jiang, N. Abolfazli, E. Y. Hamedani, and A. Mokhtari. Projection-free methods
for stochastic simple bilevel optimization with convex lower-level problem, 2023.
L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with Fairness Constraints. In
I. Chatzigiannakis, C. Kaklamanis, D. Marx, and D. Sannella, editors, 45th International
Colloquium on Automata, Languages, and Programming (ICALP 2018), volume 107 of
Leibniz International Proceedings in Informatics (LIPIcs), pages 28:1–28:15, Dagstuhl,
Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ISBN 978-3-95977076-7. doi: 10.4230/LIPIcs.ICALP.2018.28. URL http://drops.dagstuhl.de/opus/
volltexte/2018/9032.
L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi. Classification with fairness constraints:
A meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness,
Accountability, and Transparency, FAT* ’19, page 319–328, New York, NY, USA, 2019.
Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.
3287586. URL https://doi.org/10.1145/3287560.3287586.
O. Chapelle, B. Schlkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 1st
edition, 2010. ISBN 0262514125.
60

Stochastic-Constrained Stochastic Optimization with Markovian Data

D. Dentcheva and A. Ruszczynski. Optimization with stochastic dominance constraints.
SIAM Journal on Optimization, 14(2):548–566, 2003. doi: 10.1137/S1052623402420528.
URL https://doi.org/10.1137/S1052623402420528.
T. T. Doan. Finite-time analysis of markov gradient descent. IEEE Transactions on
Automatic Control, 68(4):2140–2153, 2023. doi: 10.1109/TAC.2022.3172593.
T. T. Doan, L. M. Nguyen, N. H. Pham, and J. Romberg. Convergence rates of accelerated
markov gradient descent with applications in reinforcement learning, 2020.
M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. Empirical risk minimization under fairness constraints. In Proceedings of the 32nd International Conference
on Neural Information Processing Systems, NIPS’18, page 2796–2806, Red Hook, NY,
USA, 2018. Curran Associates Inc.
R. Dorfman and K. Y. Levy. Adapting to mixing time in stochastic optimization with
Markovian data. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and
S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning,
volume 162 of Proceedings of Machine Learning Research, pages 5429–5446. PMLR, 17–23
Jul 2022. URL https://proceedings.mlr.press/v162/dorfman22a.html.
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
URL http://jmlr.org/papers/v12/duchi11a.html.
J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan. Ergodic mirror descent. SIAM
Journal on Optimization, 22(4):1549–1578, 2012. doi: 10.1137/110836043. URL https:
//doi.org/10.1137/110836043.
Y. Ermoliev. stochastic quasigradient methods and their application to system optimization
†. Stochastics, 9(1-2):1–36, 1983. doi: 10.1080/17442508308833246.
M. Even. Stochastic gradient descent under Markovian sampling schemes. In A. Krause,
E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings
of the 40th International Conference on Machine Learning, volume 202 of Proceedings
of Machine Learning Research, pages 9412–9439. PMLR, 23–29 Jul 2023. URL https:
//proceedings.mlr.press/v202/even23a.html.
J. Garcı́a, Fern, and o Fernández. A comprehensive survey on safe reinforcement learning.
Journal of Machine Learning Research, 16(42):1437–1480, 2015. URL http://jmlr.org/
papers/v16/garcia15a.html.
M. B. Giles. Multilevel monte carlo methods. Acta Numerica, 24:259–328, 2015. doi:
10.1017/S096249291500001X.
H. Guo, X. Liu, H. Wei, and L. Ying. Online convex optimization with hard constraints:
Towards the best of two worlds and beyond. In A. H. Oh, A. Agarwal, D. Belgrave,
and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL
https://openreview.net/forum?id=rwdpFgfVpvN.
61

Yeongjong Kim and Dabeen Lee

H. Hendrikx. A principled framework for the design and analysis of token algorithms. In
F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International
Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine
Learning Research, pages 470–489. PMLR, 25–27 Apr 2023. URL https://proceedings.
mlr.press/v206/hendrikx23a.html.
A. Jalilzadeh, F. Yousefian, and M. Ebrahimi. Stochastic approximation for estimating the
price of stability in stochastic nash games, 2023.
R. Jenatton, J. Huang, and C. Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In M. F. Balcan and K. Q. Weinberger, editors,
Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 402–411, New York, New York, USA, 20–22
Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/jenatton16.html.
B. Johansson, M. Rabi, and M. Johansson. A simple peer-to-peer algorithm for distributed
optimization in sensor networks. In 2007 46th IEEE Conference on Decision and Control,
pages 4705–4710, 2007. doi: 10.1109/CDC.2007.4434888.
B. Johansson, A. Speranzon, M. Johansson, and K. H. Johansson. On decentralized
negotiation of optimal consensus. Automatica, 44(4):1175–1179, 2008. ISSN 0005-1098. doi:
https://doi.org/10.1016/j.automatica.2007.09.003. URL https://www.sciencedirect.
com/science/article/pii/S0005109807003962.
B. Johansson, M. Rabi, and M. Johansson. A randomized incremental subgradient method
for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):
1157–1170, 2010. doi: 10.1137/08073038X. URL https://doi.org/10.1137/08073038X.
A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic
mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011. doi: 10.1287/10-SSY011.
URL https://doi.org/10.1287/10-SSY011.
S. Kowshik, D. Nagaraj, P. Jain, and P. Netrapalli. Streaming linear system identification with
reverse experience replay. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W.
Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages
30140–30152. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/
paper_files/paper/2021/file/fd2c5e4680d9a01dba3aada5ece22270-Paper.pdf.
G. Lan and Z. Zhou. Algorithms for stochastic optimization with function or expectation
constraints. Computational Optimization and Applications, 76(2):461–498, 2020. doi:
10.1007/s10589-020-00179-x. URL https://doi.org/10.1007/s10589-020-00179-x.
D. Lee, N. Ho-Nguyen, and D. Lee. Projection-free online convex optimization with stochastic
constraints, 2023.
D. Levin and Y. Peres. Markov Chains and Mixing Times. MBK. American Mathematical Society, 2017. ISBN 9781470429621. URL https://books.google.com/books?id=
f208DwAAQBAJ.
62

Stochastic-Constrained Stochastic Optimization with Markovian Data

K. Levy. Online to offline conversions, universality and adaptive minibatch sizes. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates,
Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/
ce5140df15d046a66883807d18d0264b-Paper.pdf.
Q. Lin, S. Nadarajah, N. Soheili, and T. Yang. A data efficient and feasible level set method for
stochastic convex optimization with expectation constraints. Journal of Machine Learning
Research, 21(143):1–45, 2020. URL http://jmlr.org/papers/v21/19-1022.html.
C. G. Lopes and A. H. Sayed. Incremental adaptive strategies over distributed networks. IEEE
Transactions on Signal Processing, 55(8):4064–4077, 2007. doi: 10.1109/TSP.2007.896034.
M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: Online convex optimization
with long term constraints. Journal of Machine Learning Research, 13(81):2503–2528,
2012. URL http://jmlr.org/papers/v13/mahdavi12a.html.
X. Mao, K. Yuan, Y. Hu, Y. Gu, A. H. Sayed, and W. Yin. Walkman: A communicationefficient random-walk algorithm for decentralized optimization. IEEE Transactions on
Signal Processing, 68:2513–2528, 2020. doi: 10.1109/TSP.2020.2983167.
D. Nagaraj, X. Wu, G. Bresler, P. Jain, and P. Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16666–16676. Curran Associates,
Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/
c22abfa379f38b5b0411bc11fa9bf92f-Paper.pdf.
M. J. Neely and H. Yu. Online convex optimization with time-varying constraints, 2017.
URL https://arxiv.org/abs/1702.04783.
A. Nemirovski and A. Shapiro. Convex approximations of chance constrained programs.
SIAM Journal on Optimization, 17(4):969–996, 2007. doi: 10.1137/050622328. URL
https://doi.org/10.1137/050622328.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation
approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609,
2009. doi: 10.1137/070704277. URL https://doi.org/10.1137/070704277.
G. Pflug. The Interface Between Simulation and Optimization. 1996.
B. Poljak and J. Tsypkin. Robust identification. Automatica, 16(1):53–63, 1980. ISSN
0005-1098. doi: https://doi.org/10.1016/0005-1098(80)90086-2. URL https://www.
sciencedirect.com/science/article/pii/0005109880900862.
B. Polyak. A general method of solving extremum problems. Doklady Akademii Nauk SSSR,
174(1):33–36, 1967.
63

Yeongjong Kim and Dabeen Lee

M. Rabbat and R. Nowak. Distributed optimization in sensor networks. In Third International
Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004, pages 20–27,
2004. doi: 10.1145/984622.984626.
S. S. Ram, A. Nedić, and V. V. Veeravalli. Incremental stochastic subgradient algorithms
for convex optimization. SIAM Journal on Optimization, 20(2):691–717, 2009a. doi:
10.1137/080726380. URL https://doi.org/10.1137/080726380.
S. S. Ram, A. Nedić, and V. V. Veeravalli. Incremental stochastic subgradient algorithms
for convex optimization. SIAM Journal on Optimization, 20(2):691–717, 2009b. doi:
10.1137/080726380. URL https://doi.org/10.1137/080726380.
P. Rigollet and X. Tong. Neyman-pearson classification, convexity and stochastic constraints.
Journal of Machine Learning Research, 12(86):2831–2855, 2011. URL http://jmlr.org/
papers/v12/rigollet11a.html.
H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical
Statistics, 22(3):400 – 407, 1951. doi: 10.1214/aoms/1177729586. URL https://doi.
org/10.1214/aoms/1177729586.
R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:
21–41, 2000.
A. Roy, K. Balasubramanian, and S. Ghadimi.
Constrained stochastic nonconvex optimization with state-dependent markov data. In S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23256–23270. Curran Associates,
Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/
93b11b5128ced940120f41ce9b216f39-Paper-Conference.pdf.
A. Ruszczyński and W. Syski. A method of aggregate stochastic subgradients with on-line
stepsize rules for convex stochastic programming problems, pages 113–131. Springer Berlin
Heidelberg, Berlin, Heidelberg, 1986. ISBN 978-3-642-00927-3. doi: 10.1007/BFb0121128.
URL https://doi.org/10.1007/BFb0121128.
T. Sarkar and A. Rakhlin. Near optimal finite time identification of arbitrary linear
dynamical systems. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of
the 36th International Conference on Machine Learning, volume 97 of Proceedings of
Machine Learning Research, pages 5610–5618. PMLR, 09–15 Jun 2019. URL https:
//proceedings.mlr.press/v97/sarkar19a.html.
C. Scott and R. Nowak. A neyman-pearson approach to statistical learning. IEEE Transactions on Information Theory, 51(11):3806–3819, 2005. doi: 10.1109/TIT.2005.856955.
T. Sun, Y. Sun, and W. Yin. On markov chain gradient descent. In S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 31. Curran Associates,
Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/
1371bccec2447b5aa6d96d2a540fb401-Paper.pdf.
64

Stochastic-Constrained Stochastic Optimization with Markovian Data

T. Sun, Y. Sun, Y. Xu, and W. Yin. Markov chain block coordinate descent. Computational
Optimization and Applications, 75(1):35–61, 2020.
T. Sun, D. Li, and B. Wang. Adaptive random walk gradient descent for decentralized
optimization. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato,
editors, Proceedings of the 39th International Conference on Machine Learning, volume
162 of Proceedings of Machine Learning Research, pages 20790–20809. PMLR, 17–23 Jul
2022. URL https://proceedings.mlr.press/v162/sun22b.html.
T. Sun, D. Li, and B. Wang. On the decentralized stochastic gradient descent with markov
chain sampling. IEEE Transactions on Signal Processing, 71:2895–2909, 2023. doi:
10.1109/TSP.2023.3297053.
P. Wang, Y. Lei, Y. Ying, and D.-X. Zhou.
Stability and generalization for
markov chain stochastic gradient methods. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 37735–37748. Curran Associates,
Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/
f61538f83b0f19f9306d9d801c15f41c-Paper-Conference.pdf.
R. Ward, X. Wu, and L. Bottou. AdaGrad stepsizes: Sharp convergence over nonconvex
landscapes. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 6677–6686. PMLR, 09–15 Jun 2019. URL https://proceedings.
mlr.press/v97/ward19a.html.
X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic
constraints. Proc. ACM Meas. Anal. Comput. Syst., 4(2), jun 2020. doi: 10.1145/3392157.
URL https://doi.org/10.1145/3392157.
X. Xiao. Penalized stochastic gradient methods for stochastic convex optimization
with expectation constraints. Technical report, September 2019. URL https://
optimization-online.org/?p=15973.
Y. Yan and Y. Xu. Adaptive primal-dual stochastic gradient method for expectationconstrained convex stochastic programs. Mathematical Programming Computation, 14
(2):319–363, 2022. doi: 10.1007/s12532-021-00214-w. URL https://doi.org/10.1007/
s12532-021-00214-w.
Z. Yang, Y. Lei, P. Wang, T. Yang, and Y. Ying. Simple stochastic and online
gradient descent algorithms for pairwise learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 20160–20171. Curran Associates,
Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/
a87d27f712df362cd22c7a8ef823e987-Paper.pdf.
S. Yao and B. Huang. Beyond parity: Fairness objectives for collaborative filtering. In
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
65

Yeongjong Kim and Dabeen Lee

R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30.
Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/
paper/2017/file/e6384711491713d29bc63fc5eeb5ba4f-Paper.pdf.
E. Yazdandoost Hamedani, A. Jalilzadeh, and N. S. Aybat. Randomized primal-dual methods
with adaptive step sizes. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings
of The 26th International Conference on Artificial Intelligence and Statistics, volume 206
of Proceedings of Machine Learning Research, pages 11185–11212. PMLR, 25–27 Apr 2023.
URL https://proceedings.mlr.press/v206/yazdandoost-hamedani23a.html.
X. Yi, X. Li, T. Yang, L. Xie, T. Chai, and K. Johansson. Regret and cumulative constraint
violation analysis for online convex optimization with long term constraints. In M. Meila
and T. Zhang, editors, Proceedings of the 38th International Conference on Machine
Learning, volume 139 of Proceedings of Machine Learning Research, pages 11998–12008.
PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yi21b.html.
H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30.
Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/
file/da0d1111d2dc5d489242e60ebcbaf988-Paper.pdf.
J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/
9cb9ed4f35cf7c2f295cc2bc6f732a84-Paper.pdf.
M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi. Fairness constraints: A
flexible approach for fair classification. Journal of Machine Learning Research, 20(75):
1–42, 2019. URL http://jmlr.org/papers/v20/18-262.html.
L. Zhang, Y. Zhang, J. Wu, and X. Xiao. Solving stochastic optimization with expectation
constraints efficiently by a stochastic augmented lagrangian-type algorithm. INFORMS
Journal on Computing, 34(6):2989–3006, 2022. doi: 10.1287/ijoc.2022.1228. URL https:
//doi.org/10.1287/ijoc.2022.1228.
L. Zhang, Y. Zhang, X. Xiao, and J. Wu. Stochastic approximation proximal method of
multipliers for convex stochastic programming. Mathematics of Operations Research, 48
(1):177–193, 2023. doi: 10.1287/moor.2022.1257. URL https://doi.org/10.1287/moor.
2022.1257.
R. Zhao. Accelerated stochastic algorithms for convex-concave saddle-point problems.
Mathematics of Operations Research, 47(2):1443–1473, 2022. doi: 10.1287/moor.2021.1175.
URL https://doi.org/10.1287/moor.2021.1175.
Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra. Federated learning with non-iid
data. 2018. doi: 10.48550/ARXIV.1806.00582. URL https://arxiv.org/abs/1806.
00582.
66