Stochastic-Constrained Stochastic Optimization with Markovian Data Yeongjong Kim kimyj@kaist.ac.kr arXiv:2312.04312v1 [math.OC] 7 Dec 2023 Department of Mathematical Sciences KAIST Daejeon 34141, South Korea Dabeen Lee dabeenl@kaist.ac.kr Department of Industrial and Systems Engineering KAIST Daejeon 34141, South Korea Abstract This paper considers stochastic-constrained stochastic optimization where the stochastic constraint is to satisfy that the expectation of a random function is below a certain threshold. In particular, we study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed. We generalize the drift-plus-penalty framework, a primal-dual stochastic gradient method developed for the i.i.d. case, to the Markov chain sampling setting. We propose two variants of drift-plus-penalty; one is for the case when the mixing time of the underlying Markov chain is known while the other is for the case of unknown mixing time. In fact, our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain. Both algorithms are adaptive in that the first works without knowledge of the time horizon while the second uses AdaGrad-style algorithm parameters, which is of independent interest. We demonstrate the effectiveness of our proposed methods through numerical experiments on classification with fairness constraints. Keywords: stochastic-constrained stochastic optimization, constrained online convex optimization, Markov chain stochastic gradient descent, drift-plus-penalty, classification with fairness constraints 1 Introduction In this paper, we consider the stochastic-constrained stochastic optimization (SCSO) problem min x∈X Eξ∼µ [f (x, ξ)] s.t. Eξ∼µ [g(x, ξ)] ≤ 0 (SCSO) where the expectation is taken with respect to the random parameter ξ over a stationary distribution µ, f and g are convex loss and constraint functions, and X is a compact domain. This formulation of SCSO has applications in stochastic programming with a risk constraint (Rockafellar and Uryasev, 2000), chance-constrained programming (Nemirovski and Shapiro, 2007), portfolio optimization (Dentcheva and Ruszczynski, 2003), sparse matrix completion (Akhtar et al., 2021), semi-supervised learning (Chapelle et al., 2010), classification with fairness constraints (Zafar et al., 2019; Celis et al., 2019; Donini et al., 2018), Neyman-Pearson classification (Scott and Nowak, 2005; Rigollet and Tong, 2011), ranking with fairness constraints (Celis et al., 2018), recommendation systems with ©2023 Yeongjong Kim and Dabeen Lee. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Yeongjong Kim and Dabeen Lee fairness constraints (Yao and Huang, 2017), scheduling in distributed data centers (Yu et al., 2017), safe reinforcement learning (Garcı́a et al., 2015), and stochastic simple bilevel optimization (Jalilzadeh et al., 2023; Cao et al., 2023). Stochastic approximation (SA) algorithms are prevalent solution methods for SCSO. Basically, we run gradient-based algorithms with an oracle providing i.i.d. samples of f (x, ξ), g(x, ξ), ∇f (x, ξ), ∇g(x, ξ) for a given solution x. Lan and Zhou (2020) proposed the cooperative stochastic approximation scheme for SCSO, which is a stochastic extension of Polyak’s subgradient method developed for constrained optimization (Polyak, 1967). Xiao (2019) developed the penalized stochastic gradient method that takes the square of the constrained function as a penalty term. Lin et al. (2020) developed a level setbased algorithm for SCSO. Akhtar et al. (2021) studied an augmented Lagrangian-based stochastic primal-dual algorithm, which was inspired by the primal-dual framework for online convex optimization with long-term constraints due to Mahdavi et al. (2012). Furthermore, motivated by recent success in adaptive gradient algorithms, Yan and Xu (2022) considered an adaptive primal-dual stochastic gradient method for SCSO. Zhang et al. (2023) proposed a stochastic variant of the proximal method of multipliers. Zhang et al. (2022) studied another stochastic proximal method of multipliers based on linearization. SA algorithms for other related problem settings are as follows. Yu et al. (2017) studied online convex optimization with stochastic constraints where the constraint functions are stochastic i.i.d. while the objective functions are arbitrary, for which they proposed the driftplus-penalty (DPP) algorithm. DPP applies to SCSO given that i.i.d. samples of g(x, ξ), ∇g(x, ξ), and ∇f (x, ξ) are available. Wei et al. (2020) developed an extension of DPP, and Lee et al. (2023) provided a projection-free algorithm for the setting. Moreover, SCSO can be formulated as a stochastic saddle-point problem by taking the Lagrangian if certain constraint qualifications hold. Nemirovski et al. (2009); Juditsky et al. (2011) developed stochastic mirror-prox algorithms for general stochastic saddle point problems. Zhao (2022) devised an accelerated stochastic framework for convex-concave saddle-point problems, and Yazdandoost Hamedani et al. (2023) devised a randomized adaptive primal-dual method. The aforementioned solution methods for SCSO require i.i.d. data samples from the stationary distribution µ when running SA algorithms. However, there are several application scenarios in which sampling from the stationary distribution µ independently and identically is difficult. For example, federated learning serves as an alternative to traditional machine learning systems that require data centralization, with the purpose of improving data privacy (Zhao et al., 2018). The basic framework is that data is stored on individual local devices while the training is governed by a central server. Another related setting is distributed optimization over sensor networks (Rabbat and Nowak, 2004; Lopes and Sayed, 2007; Johansson et al., 2007) and multi-agent systems (Johansson et al., 2008). For these applications, an enormous amount of data is distributed over distinct nodes of a network, for which data exchange and massage passing are between immediate neighboring nodes. One resolution approach for such application scenarios is Markov chain stochastic gradient descent (SGD) (Johansson et al., 2010; Ram et al., 2009a). Basically, Markov chain SGD takes a random walk over a network of local data centers and updates solutions based on data acquired from the data centers visited. Here, as the data is generated by a Markov random walk, there is inherent dependence and bias between data samples. More generally, the framework can be viewed as a variant of the Markov chain Monte Carlo 2 Stochastic-Constrained Stochastic Optimization with Markovian Data method (Andrieu et al., 2003). That said, Markov chain SGD can be applied to other applications domains where data is collected from a Markov process, such as decentralized learning (Yang et al., 2021), robust estimation (Poljak and Tsypkin, 1980; Sarkar and Rakhlin, 2019), and reinforcement learning (Nagaraj et al., 2020; Kowshik et al., 2021). Following the Markov incremental subgradient methods due to Johansson et al. (2010); Ram et al. (2009a) designed for distributed optimization problems, Markov chain SGD with data sampled from a general Markov process have been studied. Duchi et al. (2012) developed Markov chain SGD with data sampled from an ergodic process. Later, Sun et al. (2018) studied Markov chain SGD for convex and nonconvex problems when the underlying Markov chain is nonreversible. Doan et al. (2020) proposed an accelerated version of Markov chain SGD for both convex and nonconvex settings. Dorfman and Levy (2022) considered the setting where the mixing time of the underlying Markov chain is unknown, for which they developed Markov chain SGD based on the multi-level Monte Carlo estimation scheme (Giles, 2015; Blanchet and Glynn, 2015). Roy et al. (2022) studied nonconvex problems where the transition of the underlying Markov chain is state-dependent, motivated by strategic classification and reinforcement learning. Recent works (Doan, 2023; Even, 2023) established some performance guarantees of Markov chain SGD under minimal assumptions. Applications of SCSO naturally motivate and necessitate algorithmic frameworks that can handle data sets with inherent dependence and bias. Portfolio optimization in finance takes time series data. Machine learning with fairness constraints processes heterogeneous data sets from disjoint sources. Scheduling in distributed data centers will benefit from distributed optimization technologies. Safe and constrained reinforcement learning can be formulated as SCSO for which the training data is obtained from trajectories of the underlying Markov decision process. Despite this immediate need, no prior work exists for solving SCSO with non-i.i.d. data samples. Motivated by this, the objective of this paper is to develop stochastic approximation algorithms that run with data sampled from a Markov chain. 1.1 Our Contributions This paper initiates the study of stochastic approximation algorithms for stochastic-constrained stochastic optimization (SCSO) using non-i.i.d. data samples. Inspired by recent advances in Markov chain SGD, we develop primal-dual stochastic gradient methods using data sampled from a Markov chain, which can be viewed as primal-dual variants of Markov chain SGD. Specifically, we extend the drift-plus-penalty algorithm by Yu et al. (2017) that was originally developed for the i.i.d. setting. We adopt the approach of ergodic mirror descent by Duchi et al. (2012) for the case of known mixing time and the framework of Dorfman and Levy (2022) for the setting where the mixing time is unknown. Our key technical contribution is to develop two variants of the drift-plus-penalty algorithm that can take a sequence of dependent constraint functions. These two algorithms solve constrained online convex optimization where the objective functions can be arbitrary and the constraint functions are generated from a Markov chain. One of them is for the case of known mixing time, while the other is for the case when the mixing time is unknown. We provide regret and constraint violation bounds for the algorithms, which delineate how their performance depends on the mixing time. Based on the regret and constraint violation 3 Yeongjong Kim and Dabeen Lee Algorithm 1 ✗ Oblivious to τmix Regret Constraint violation   1−β 1−β Õ τmix T   β/2 (β+1)/2 O τmix T  1−β Optimality gap Õ τmix Tβ  Feasibility gap Õ + β/2 τmix (1−β)/2 T β/2 τmix  T (1−β)/2  Algorithm 1 under Slater’s condition ✗  √ Õ τmix T  √ Õ τmix T √  τ Õ √mix T √  τ Õ √mix T Algorithm 2 ✔ adaptive adaptive  1−β  τ Õ mix Tβ  (2β+1)/4  Õ τmix T (1−β)/2 Table 1: Bounds on regret, constraint violation, optimality gap, and feasibility gap under Algorithms 1 and 2 (β ∈ (0, 1/2] is a predetermined algorithm parameter that controls the balance between regret and constraint violation) guarantees, we analyze the optimality gap and feasibility gap for SCSO. The connection between the constrained online convex optimization formulation and SCSO for our Markovian setting is not as immediate as the i.i.d. setting because the expectation in (SCSO) is taken with respect to the stationary distribution of the underlying Markov chain. What follows is a more detailed description of our contributions. Our results are also summarized in Table 1. • In Section 3, we consider the case when the mixing time of the underlying Markov chain is known, for which we develop Algorithm 1, a variant of the drift-plus-penalty algorithm. We first prove that for online convex optimization with Markovian constraints, the β/2 1−β 1−β regret of Algorithm 1 is Õ(τmix T ) and the constraint violation is O(τmix T (β+1)/2 ) where τmix is the mixing time, T is the length of horizon, β can be chosen to be any number in (0, 1/2], and Õ hides a poly log T factor. If we further assume that (SCSO) satisfies Slater’s constraint qualification, then √ the regret and constraint violation of Algorithm 1 can be both bounded by Õ( τmix T ). We remark that Algorithm 1 is adaptive in that its parameters are chosen without knowledge of T , unlike the vanilla drift-plus-penalty algorithm. These results generalize the work of Yu et al. (2017). • Based on the regret and constraint violation analysis for Algorithm 1, we show that the averaging of the sequence of solutions generated by Algorithm 1 guarantees that the β/2 1−β −β optimality gap is bounded by Õ(τmix T + τmix T −(1−β)/2 ) while the feasibility gap β/2 is bounded by Õ(τmix T −(1−β)/2 ). If we further assume that (SCSO) satisfies Slater’s constraint qualification, p then the optimality gap and feasibility gap of Algorithm 1 can be both bounded by Õ( τmix /T ). • In Section 4, we propose Algorithm 2, another variant of the drift-plus-penalty algorithm, for the setting where the mixing time is unknown. The parameters of Algorithm 2 are set in an adaptive fashion, as in the AdaGrad method (Duchi et al., 4 Stochastic-Constrained Stochastic Optimization with Markovian Data 2011; Levy, 2017). Then we apply Algorithm 2 to constrained online convex optimization where the objective and constraint functions are given by the Multi-level Monte Carlo estimation scheme as in Dorfman and Levy (2022). We provide adaptive regret and constraint violation bounds for Algorithm 2. We note that Algorithm 2 is the first AdaGrad-style adaptive variant of the drift-plus-penalty algorithm. In fact, Algorithm 2 applies to online convex optimization with adversarial constraints (Neely and Yu, 2017) and provides adaptive performance guarantees. We include this result in Appendix C. • Combining the estimation accuracy bounds on the Multi-level Montel Carlo method by Dorfman and Levy (2022) and our adaptive regret and constraint violation bounds, 1−β −β we prove that the optimality gap under Algorithm 2 is bounded by Õ(τmix T ) while (2β+1)/4 −(1−β)/2 the feasibility gap is bounded by Õ(τmix T ). • In Section 5, we provide numerical results from experiments on a classification problem with fairness constraints. Specifically, we take the logistic regression formulation proposed in Zafar et al. (2019). The numerical results on random problem instances demonstrate the efficacy of our proposed algorithmic frameworks for solving SCSO with Markovian data. The main component of our analysis is to provide bounds on the terms " T # " T #   X X Qt Qt E Qt gt (x) , E gt (x) , E [Qt ] , E Vt Vt t=1 t=1 under our Markovian data regime. All these terms involve the virtual queue size Qt , and therefore, controlling the virtual queue size is crucial to guarantee a fast convergence rate. We include our proofs of the main theorems in Sections 6 and 7. We include some of the known lemmas and tools for analyzing the drift-plus-penalty method due to Yu et al. (2017) in Appendices A and D. In fact, Appendix A state these results for any adaptive version of the drift-plus-penalty algorithm, which uses time-varying parameters Vt and αt . 1.2 Related Work Although we have listed and explained important lines of previous work that motivate this paper, we supplement the list by mentioning a few more relevant results in stochastic approximation algorithms and Markov chain stochastic gradient methods. Stochastic Approximation for Stochastic Optimization Starting from the seminar paper by Robbins and Monro (1951), stochastic approximation algorithms, also known as stochastic gradient methods, have been a central topic of research in the domain of machine learning, operations research, and optimization (Ermoliev, 1983; Pflug, 1996; Ruszczyński and Syski, 1986; Bottou et al., 2018). There already exist numerous works on stochastic approximation algorithms for stochastic optimization. In particular, for SCSO or stochastic optimization with expectation constraints, various stochastic approximation methods were proposed and studied by Lan and Zhou (2020); Xiao (2019); Lin et al. (2020); Akhtar et al. (2021); Yan and Xu (2022); Zhang et al. (2023, 2022), as discussed earlier. 5 Yeongjong Kim and Dabeen Lee We note that online convex optimization with stochastic constraints is a superclass of SCSO under the i.i.d. data regime. Mahdavi et al. (2012); Jenatton et al. (2016) developed augmented Lagrangian-based primal-dual algorithms for online convex optimization with deterministic long-term constraints. It turns out that these algorithms and their analysis can be adapted to the setting of stochastic constraints (Akhtar et al., 2021). Later, Yu et al. (2017); Wei et al. (2020) proposed the drift-plus-penalty algorithm that achieves better regret and constraint violation guarantees under Slater’s condition. Yuan and Lamperski (2018); Yi et al. (2021); Guo et al. (2022) also provided algorithms for online convex optimization with stochastic constraints. Markov Chain Stochastic Gradient Descent The asymptotic convergence of Markov chain SGD was studied in Bertsekas and Tsitsiklis (1996); Borkar (2008); Benveniste et al. (1990) based on ordinary differential equation methods. Lopes and Sayed (2007); Johansson et al. (2007, 2010); Ram et al. (2009a) developed incremental subgradient methods for the distributed optimization setting where there is a network of data centers and an algorithm performs a random walk over the network to obtain data samples. These methods are also referred to as token algorithms. More recently, Mao et al. (2020) devised what they called the walkman algorithm, which is a token algorithm based on an augmented Lagrangian method. Sun et al. (2022); Ayache et al. (2023) considered adaptive variants of token algorithms. Hendrikx (2023) developed a more general framework for token algorithms that allows multiple tokens for improving communication efficiency and may adopt existing optimization tools such as variance reduction and acceleration. In addition to random walk-based token algorithms, there exist general Markov chain sampling frameworks for stochastic gradient methods. As mentioned earlier, Duchi et al. (2012); Sun et al. (2018); Doan et al. (2020); Dorfman and Levy (2022); Roy et al. (2022) developed Markov chain SGD methods for convex and nonconvex stochastic optimization problems. Moreover, Sun et al. (2020) developed a Markov chain sampling-based block coordinate descent method. Sun et al. (2023) proposed a decentralized variant of Markov chain SGD. Wang et al. (2022) considered the stability of Markov chain SGD and deduced its generalization bounds. Doan (2023) derived convergence guarantees on Markov chain SGD without a smoothness assusmption, and Even (2023) studied convergence of Markov chain SGD without the bounded gradient assumption. 2 Preliminaries In this section, we provide problem formulations for stochastic-constrained stochastic and online convex optimization under the Markovian data sampling regime. In addition, Section 2.1 gives the formal definition of the mixing time of a Markov chain. Section 2.4 describes a list of assumptions considered throughout the paper. 2.1 Markov Chain and Mixing Time Given two probability distributions P, Q over the probability space (S, F), the total variation distance between them is defined as ∥P − Q∥T V := sup |P(A) − Q(A)| . A∈F 6 Stochastic-Constrained Stochastic Optimization with Markovian Data Let {ξt }∞ t=1 be a time-homogeneous ergodic Markov chain with a finite state space S. For a distribution ν over (S, F), we denote by Pt (ν, ·) the conditional probability distribution of ξt+1 given ξ1 ∼ ν. Since {ξt }∞ t=1 is ergodic, it has a unique stationary distribution µ, i.e., Pt (µ, ·) = µ(·). In fact, the long-term distribution of the ergodic Markov chain converges to µ regardless of the initial distribution, which can be demonstrated as follows. It is known that Pt (ν, ·) for any t and ν satisfies ∥Pt (ν, ·) − µ∥T V ≤ Cαt for some α ∈ (0, 1) and C > 0 (Levin and Peres, 2017). Then we define quantities dmix and τmix (ϵ) as follows. dmix := sup∥P t (ν, ·) − µ∥T V , ν τmix (ϵ) := inf{t ∈ N : dmix (t) ≤ ϵ}. Moreover, following the convention, we define τmix as τmix := τmix (1/4), and we refer to τmix as the mixing time of the underlying Markov chain. It is known that dmix (lτmix ) ≤ 2−l for every l ∈ N (Levin and Peres, 2017, Chapter 4), which implies that τmix (ϵ) ≤ ⌈log2 ϵ−1 ⌉τmix . In particular, throughout the paper, we will use quantity τ defined as τ := τmix (1/T ) = O(τmix log T ) where T is the length of the time horizon. 2.2 Stochastic-Constrained Online Convex Optimization with Markovian Data Let X ⊂ Rd be a compact convex domain. Let {ft : X → R}∞ t=1 be a sequence of arbitrary convex functions. Another sequence of convex functions {gt : X → R}∞ t=1 is assumed to follow an ergodic Markov chain. More precisely, there exists a time-homogeneous ergodic Markov chain {ξt }∞ t=1 such that gt (x) = g(x, ξt ). Let ḡ(x) = Eξ∼µ [g(x, ξ)] for x ∈ X where the expectation is taken with respect to the stationary distribution µ of the Markov chain. Then the problem is to solve and compute x∗ ∈ argmin x∈X T X ft (x) s.t. ḡ(x) ≤ 0. t=1 However, the information about the functions {ft }Tt=1 and {gt }Tt=1 is revealed online. Basically, at each step t, we choose our decision xt ∈ X before observing ft , gt , after which we receive feedback about them. The stochastic setting studied in Yu et al. (2017) is that constraint functions g1 , . . . , gT are i.i.d., which means that ξ1 , . . . , ξT are i.i.d., while we consider the 7 Yeongjong Kim and Dabeen Lee case where constraint functions g1 , . . . , gT follow an ergodic Markov chain and thus are dependent. That said, we may refer to the problem setting as online convex optimization with ergodic constraints. To measure the performance of a learning algorithm for the online optimization problem, we adopt the regret and cumulative constraint violation definition due to Yu et al. (2017). The regret and cumulative constraint violation of an algorithm that generates solutions x1 , . . . , xT over T time steps are given by Regret(T ) = T X ft (xt ) − t=1 T X ft (x∗ ), Violation(T ) = t=1 T X gt (xt ). t=1 We want these quantities to have a sublinear growth in T . Note that the benchmark solution x∗ satisfies the expectation constraint Eξ∼µ [g(x∗ , ξ)] ≤ 0. 2.3 Stochastic-Constrained Stochastic Optimization with Markovian Data Next, we consider the setting where the objective functions {ft }∞ t=1 as well as the constraint functions {gt }∞ are given by an ergodic Markov chain. Without loss of generality, we t=1 may assume that ft (x) = f (x, ξt ) and gt (x) = g(x, ξt ) for some time-homogeneous ergodic Markov chain {ξt }∞ t=1 with a stationary distribution µ. As before, let f¯(x) = Eξ∼µ [f (x, ξ)] and ḡ(x) = Eξ∼µ [g(x, ξ)] for x ∈ X . Then the problem is to solve and compute x# ∈ f¯(x) argmin s.t. ḡ(x) ≤ 0. x∈X Here, we do not have direct access to (f¯, ḡ), but we receive samples (ft , gt ) which converge to (f¯, ḡ) in expectation. For the performance measure of a learning algorithm for this problem, we consider the following standard notions of optimality gap and constraint violation. For a solution x ∈ X Gap(x) = f¯(x) − f¯(x# ), Infeasibility(x) = ḡ(x). We want these quantities to approach 0 as T grows. Hereinafter, we refer to Gap(x) and Infeasibility(x) as the optimality gap and the feasibility gap, respectively. In the special case where {ξt }∞ t=1 are i.i.d. samples drawn from µ, solving stochasticconstrained online convex optimization would provide a solution to stochastic-constrained stochastic optimization. One can argue that if ξ1 , . . . , ξT are i.i.d., then Gap(x̄T ) ≤ 1 E [Regret(T )] , T Infeasibility(x̄T ) ≤ where x̄T denotes the simple average of x1 , . . . , xT : T x̄T = 1X xt . T t=1 8 1 E [Violation(T )] T Stochastic-Constrained Stochastic Optimization with Markovian Data However, in contrast to the i.i.d. case, the above inequalities do not hold for the case of an arbitrary ergodic Markov chain. This is because the distribution of ξt+1 conditioned on ξt is not equal to the stationary distribution µ. On the other hand, we will use the fact that the long-term distribution of an ergodic Markov chain converges to its stationary distribution. Based on this observation, we first bound the expected regret and the expected constraint violation of the online problem, and then we use this to bound the optimality gap and the feasibility gap of the stochastic optimization problem. 2.4 Notations and Assumptions We work over a norm ∥·∥ and its dual norm ∥·∥∗ . We choose a convex mirror map Φ : C → R, where C ⊆ Rd is a convex domain containing X . We use the corresponding Bregman divergence defined as D(x, y) = Φ(x) − Φ(y) − ∇Φ(y)⊤ (x − y). Assumption 1 There is a constant R > 0 such that D(x, y) ≤ R2 for any x, y ∈ X , and Φ is 2-strongly convex with respect to norm ∥·∥ i.e. ∥x − y∥2 ≤ D(x, y) for any x, y ∈ X . Moreover, f (·, ξ) and g(·, ξ) are differentiable for each ξ ∈ S. We use notations Ft , Gt , Ht for t ∈ [T ] given by Ft := ∥∇ft (xt )∥∗ , Gt := ∥∇gt (xt )∥∗ , Ht := |gt (xt )| which are parameters used in our adaptive algorithm. Due to the stochasticity of ft , gt , these are random variables. We assume these quantities are bounded. Assumption 2 There exist constants F, G, H > 0 such that ∥∇x f (x, ξ)∥∗ ≤ F, ∥∇x g(x, ξ)∥∗ ≤ G, |g(x, ξ)| ≤ H for any x ∈ X and ξ ∈ S. Assumptions 1 and 2 are common in online convex optimization, stochastic optimization, and the Markov chain SGD literature. The following assumption is referred to as Slater’s condition or Slater’s constraint qualification for constrained optimization. Assumption 3 (Slater’s condition) There exists x̂ ∈ X such that ḡ(x̂) ≤ −ϵ for some ϵ > 0. For online convex optimization with stochastic i.i.d. constraint functions, Slater’s condition leads to improvement in regret and constraint violation (Yu et al., 2017). In Section 3, we will show that even for online convex optimization with ergodic constraint functions, assuming that Slater’s condition holds results in an improvement in the regret and constraint violation of Algorithm 1. 9 Yeongjong Kim and Dabeen Lee Algorithm 1 Ergodic Drift-Plus-Penalty (EDPP) Initialize: Initial iterates x1 ∈ X , Q1 = 0, and 0 < β ≤ 1/2. for t = 1 to T do Observe ft and gt . Set penalty parameter Vt and step size parameter αt as Vt = (τmix t)β , αt = τmix t. Primal update: Set xt+1 as n o xt+1 = argmin (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ x + αt D(x, xt ) x∈X Dual update: Set Qt+1 as h i Qt+1 = Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + end for 3 Known Mixing Time We first focus on the setting where we have access to the mixing time τmix of the underlying Markov chain {ξt }∞ t=1 . Duchi et al. (2012) studied this case for stochastic convex minimization without a stochastic functional constraint, for which they modified the step size of the stochastic gradient descent method based on the mixing time parameter τmix . Inspired by this approach, we take and modify the drift-plus-penalty (DPP) algorithm due to Neely and Yu (2017); Yu et al. (2017) developed for stochastic-constrained online convex optimization. Based on DPP, we develop our algorithm by setting the algorithm parameters properly to adapt to the mixing time τmix . 3.1 Ergodic Drift-Plus-Penalty The DPP algorithm has two parameters, V and α,√where V is the penalty parameter and α determines the step size. Yu et al. (2017) set V = T and α = T . In contrast to the vanilla DPP algorithm, our algorithm uses parameters Vt = (τmix t)β , αt = τmix t for iterations t = 1, . . . , T , where T is the length of the horizon and β is another algorithm parameter that controls the balance between the regret and the constraint violation. Our algorithm, which we call ergodic drift-plus-penalty (EDPP), is described in Algorithm 1. Note that the mixing time τmix is now part of parameters Vt and αt , as in the ergodic mirror descent algorithm by Duchi et al. (2012). Second, Vt and αt are time-varying, so our algorithm is adaptive (Jenatton et al., 2016) and oblivious to the length of the time horizon T . One may wonder why we do not use T instead of t, i.e., V = (τmix T )β and α = τmix T . In fact, our numerical results, which will be presented in Section 5, demonstrate that Algorithm 1 with the adaptive parameters Vt and αt outperforms the algorithm with 10 Stochastic-Constrained Stochastic Optimization with Markovian Data fixed parameters V = (τmix T )β and α = τmix T . However, the performance analysis of the vanilla DPP algorithm by Yu et al. (2017) does not immediately extend to such adaptive parameters. Hence, we prove that the DPP framework with parameters of varying t still achieves the desired regret and constraint violation guarantees. Let us also briefly explain how the DPP framework initially developed by Yu et al. (2017) as well as our Algorithm 1 works. We may regard Qt as the size of a virtual queue at time t. Then we consider the associated quadratic Lyapunov term Lt = Q2t /2 and study the corresponding drift given by ∆t = Lt+1 − Lt = (Q2t+1 − Q2t )/2. It is not difficult to see that   1 ∆t ≤ Qt gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + (Ht + Gt R)2 2 holds (Lemma 19). Here, the upper bound on the drift ∆t has term Qt ∇gt (xt )⊤ (xt+1 − xt ) that depends on the next iterate xt+1 . Hence, by choosing xt+1 that minimizes Qt ∇gt (xt )⊤ (xt+1 − xt ), we may attempt to control the drift. In fact, the primal update sets xt+1 to be the minimizer of Qt ∇gt (xt )⊤ (x − xt ) + Vt ∇ft (xt )⊤ (x − xt ) + αt D(x, xt ) {z } | {z } | drift penalty over X . Consequently, at each iteration, we get to choose a solution that minimizes the drift term ∆t and a penalty term for controlling the objective simultaneously. 3.2 Performance Guarantees of Ergodic Drift-Plus-Penalty First, we analyze the regret and constraint violation of EDPP for the constrained online convex optimization setting under the Markovian sampling regime, formulated in Section 2.2. Recall that τmix = τmix (1/4) and that parameter τ is defined as τ = τmix (T −1 ), which satisfies τ ≤ ⌈log2 T ⌉τmix . Theorem 1 Suppose that Assumptions 1 and 2 hold. Then for online convex optimization with ergodic constraints, Algorithm 1 achieves   −β E [Regret(T )] = O τmix τ T 1−β ,   p β/2 E [Violation(T )] = O τmix T (β+1)/2 + (τ − 1)T where the expectation is taken with respect to the randomness in running the algorithm. One of the key components of the analysis for proving Theorem 1 is that we bound the expected size of the virtual queue Qt at time t as follows.   p β/2 E[Qt ] = O τmix T (β+1)/2 + (τ − 1)T . Another key part is to analyze the term T X E [Qt gt (x)] t=1 11 Yeongjong Kim and Dabeen Lee where function gt is not independent of the virtual queue size Qt under our Markovian sampling regime. In contrast, if g1 , . . . , gT were i.i.d., then E [Qt gt (x)] = E [Qt ḡ(x)] would hold. Instead, we relate the term with T −τ X+1 E [Qt gt+τ −1 (x)] t=1 and use the intuition that the distribution of the ergodic Markov chain after τ steps is close to its stationary distribution. We provide a bound on the term in Lemma 11. Similarly, we also need to analyze the term   T X Qt E gt (x) , Vt t=1 and an upper bound on the sum is given in Lemma 12. Based on these observations, we provide an upper bound on the expected virtual queue size E [Qt ], which is given in Lemma 13. Then we apply the performance analysis results of the general adaptive drift-plus-penalty method provided in Appendix A. The complete proof of Theorem 1 is given in Section 6.1. Next we analyze the performance of EDPP on the stochastic-constrained stochastic optimization problem under Markovian data sampling, whose formulation is given by (SCSO). Theorem 2 Suppose that Assumptions 1 and 2 hold. Then for stochastic-constrained stochastic optimization (SCSO), Algorithm 1 guarantees that ! τ τ −1 τ (τ − 1)3/2 E [Gap(x̄T )] = O + 1−β/2 + + , β T τmix T 1/2 τmix Tβ τmix T (1−β)/2 ! τ (τ − 1)3/2 τ E [Infeasibility(x̄T )] = O + + 1−β/2 (1−β)/2 T τmix T 1/2 τ T mix P where x̄T = Tt=1 xt /T and the expectation is taken with respect to the randomness in running the algorithm. Note that Theorem 1 implies 1 E [Regret(T )] = O T τ ! β τmix Tβ , but the bound on the optimality gap given in Theorem 2 has additional terms due to the difference between the stationary distribution of the ergodic Markov chain and the distribution of ft+1 conditioned on ft . In fact, under Markovian sampling, we have T X E [ft (xt )] ̸= T X t=1   E f¯(xt ) . t=1   Here, to get around this issue, we also use the intuition that E f¯(xt ) is close to E [ft+τ −1 (xt )] as the distribution of the Markov chain after τ steps is close to its stationary distribution. 12 Stochastic-Constrained Stochastic Optimization with Markovian Data The proof of Theorem 2 is given in Section 6.2. Moreover, since τ = Õ(τmix ), it follows that     p β/2 1−β 1−β E [Regret(T )] = Õ τmix T , E [Violation(T )] = Õ τmix T (β+1)/2 + τmix T , ! β/2 1−β τmix τmix τmix , E [Gap(x̄T )] = Õ + (1−β)/2 + T Tβ T ! √ β/2 τmix τmix τmix E [Infeasibility(x̄T )] = Õ + √ + T T (1−β)/2 T where the Õ hides a log T factor. In particular, if we set β = 1/3, then we have 2/3 1/6 1/2 E [Regret(T )] = Õ(τmix T 2/3 ), E [Violation(T )] = O(τmix T 2/3 + τmix T 1/2 ), E [Gap(x̄T )] = 2/3 1/6 1/2 Õ(τmix T −1/3 + τmix T −1 ), and E [Infeasibility(x̄T )] = O(τmix T −1/3 + τmix T −1/2 + τmix T −1 ). Furthermore, observe that if {ξt }∞ t=1 is a sequence of i.i.d. random variables, then we have τmix = τ = 1. In this case, by Theorems 1 and 2, Algorithm 1 guarantees that E [Regret(T )] = O(T 1−β ), E [Violation(T )] = O(T (β+1)/2 ), E [Gap(x̄T )] = O(T −β ), and E [Infeasibility(x̄T )] = O(T −(1−β)/2 ) for any β ∈ (0, 1/2], which recovers the result of Jenatton et al. (2016). If we further assume that Slater’s constraint qualification holds, we can argue that we get a better control on the size of E[Qt ]. Note that the upper bound on the expected queue size β/2 given by Lemma 13 is E[Qt ] = O(τmix t(β+1)/2 ) which holds regardless of whether Slater’s condition holds or not. On the other hand, we will argue that under Slater’s condition, we have   √ τ (τ + t) = Õ( τmix t). E[Qt ] = O √ τmix t √ This is consistent with Yu et al. (2017) as they proved that E[Qt ] = O( T ) for the i.i.d. setting. In fact, our proof for bounding E[Qt ] is more involved than the argument of Yu et al. (2017) because we use adaptive step sizes for Algorithm 1 and consider non-i.i.d. constraint functions. This leads to improvements as stated in the following result. Theorem 3 Suppose that Assumptions 1–3 hold. Then for online convex optimization with ergodic constraints and stochastic-constrained stochastic optimization, Algorithm 1 with β = 1/2 guarantees √ !   τ T τ (τ + T ) E [Regret(T )] = O √ , E [Violation(T )] = O √ , τmix τmix T ! τ2 τ 5/2 E[Gap(x̄T )] = O + 3/2 , 3/2 √ τmix T τmix T ! τ2 τ 5/2 τ2 E[Infeasibility(x̄T )] = O + 3/2 + 1/2 , 3/2 √ τmix T τmix T τmix T 3/2 where the expectation is taken with respect to the randomness in running the algorithm. Our analysis takes into account the time-varying algorithm parameters Vt and αt as well as the fact that the functions are correlated according to a Markov chain. To consider this, we 13 Yeongjong Kim and Dabeen Lee prove Lemmas 14 and 15 that lead to time-varying bounds on the expected virtual queue size. The complete proof of Theorem 3 is included in Section 6.3. Since τ = Õ(τmix ), ! 3/2 τmix E [Violation(T )] = Õ τmix T + √ , T !  √ √ 3/2 τmix τmix τmix τmix τmix √ √ , E[Infeasibility(x̄T )] = Õ E[Gap(x̄T )] = Õ + + + 3/2 . T T T T T  p E [Regret(T )] = Õ τmix T , p 4 Unknown Mixing Time Next, we study the setting where the mixing time τmix is not observable. Even if we do not know the mixing time of the underlying Markov chain, we would still want to provide a learning algorithm that provides performance guarantees of similar orders. To achieve this goal, we develop yet another variant of the drift-plus-penalty algorithm which incorporates the multi-level Monte Carlo (MLMC) gradient estimation scheme (Giles, 2015; Blanchet and Glynn, 2015; Dorfman and Levy, 2022). Dorfman and Levy (2022) first introduced the approach of combining stochastic gradient descent with the MLMC gradient estimation framework for stochastic optimization with no stochastic functional constraint. For the case of known mixing time, we may update the step size based on the mixing time τmix to achieve an optimal dependence on the parameter. When τmix is not known, Dorfman and Levy (2022) used AdaGrad-based adaptive step sizes (Duchi et al., 2011; Levy, 2017; Ward et al., 2019). We take this idea to develop an AdaGrad variant of the drift-plus-penalty algorithm for SCSO, described in Algorithm 2, that incorporates the MLMC gradient estimation framework. The AdaGrad version of DPP itself is of independent interest. 4.1 Multi-Level Monte Carlo Sampling The idea behind the multi-level Monte Carlo estimation scheme is to obtain many consecutive samples from an ergodic Markov chain and take their average. At the same time, we may control the expected number of consecutive samples required for each time step by O(log T ). (1) (Nt ) More precisely, for each time step t, we Nt sample ξt , . . . , ξt random variable given by ( Ñt , if Ñt ≤ T 2 Nt = 1, otherwise where Nt itself is a and Ñt = 2Jt with Jt ∼ Geom(1/2). Note that in our case, the condition is that Ñt ≤ T 2 where the bound on Ñt is T 2 , while it was set to T in Dorfman and Levy (2022). With this sampling strategy, we define Ft as the σ-field ! t n o [ (1) (N ) Ft = σ {N1 , . . . , Nt } ∪ ξs , . . . , ξs s . s=1 14 Stochastic-Constrained Stochastic Optimization with Markovian Data Let Et [·] denote the conditional expectation with respect to Ft , i.e., Et [·] = E [· | Ft ]. Next, for t ≥ 1 and N ≥ 1, we define N ftN (x) := N 1 X (i) f (x, ξt ), N gtN (x) := i=1 1 X (i) g(x, ξt ). N i=1 Based on this, we define the MLMC estimators of f and g as follows.  (  Nt /2 Nt /2 Nt Nt N (f , g ) − (f , g ) , if Nt > 1 t t t t t (ft , gt ) = (ft1 , gt1 ) + 0, otherwise. Basically, functions ft and gt are obtained after applying the MLMC estimation scheme to the underlying ergodic Markov chain. One thing to note, however, is that ft and gt are not necessarily convex anymore (Dorfman and Levy, 2022). To remedy this issue, what we can argue instead is that Et−1 [ft ] and Et−1 [gt ] are convex. Based on (Dorfman and Levy, 2022, Lemma 3.1), we deduce the following lemma.  Lemma 4 Let jmax := max j ∈ N : 2j ≤ T 2 = ⌊2 log2 T ⌋. Then for each t, h j i h i 2 max 2jmax Et−1 [ft ] = Et−1 ft , Et−1 [∇ft ] = Et−1 ∇ft , h i h j i jmax max . Et−1 [gt ] = Et−1 gt2 , Et−1 [∇gt ] = Et−1 ∇gt2 Moreover, we have   E ∥∇ft (xt )∥2∗ = Õ(F 2 τmix ),   E ∥∇gt (xt )∥2∗ = Õ(G2 τmix ),   E |gt (xt )|2 = Õ(H 2 τmix ). Lastly, the expected number of samples for time step t satisfies E[Nt ] ≤ 2 log2 T + 1. The reason for setting the upper bound T 2 on Ñt instead of T is to achieve high accuracy of estimation for ft , gt , ∇ft , and ∇gt that leads to the desired performance guarantees of Algorithm 2. To be specific, we use the following estimation bounds based on (Dorfman and Levy, 2022, Lemma A.6). Lemma 5 There exists C(T ) > 0 with  1/2  C(T ) = O log(T ) log τmix T 2 log(T ) such that   2 τmix 2jmax ¯ Et−1 ft (x) − f (x) ≤ C(T )2 2 , T   2 τmix jmax Et−1 gt2 (x) − ḡ(x) ≤ C(T )2 2 , T h i τmix jmax Et−1 ∥∇ft2 (x) − ∇f¯(x)∥2∗ ≤ C(T )2 2 , T h i τmix jmax Et−1 ∥∇gt2 (x) − ∇ḡ(x)∥2∗ ≤ C(T )2 2 . T hold for any x ∈ X that is measurable with respect to Ft−1 and any t ∈ [T ]. The complete proof of this lemma is given in Appendix E. 15 Yeongjong Kim and Dabeen Lee Algorithm 2 MLMC Adaptive Drift-Plus-Penalty (MDPP) Initialize: Initial iterates x1 ∈ X , Q1 = 0 and parameters 0 < β ≤ 1/2, δ > 0. for t = 1 to T do Observe ft and gt via MLMC method. Set penalty parameter Vt , step size parameter αt as (1). Primal update: Set xt+1 as n o xt+1 = argmin (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ x + αt D(x, xt ) x∈X Dual update: Set Qt+1 as h i Qt+1 = Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + end for 4.2 Adaptive Drift-Plus-Penalty The second component of our algorithm for the unknown mixing time setting is the AdaGrad variant of the drift-plus-penalty algorithm. To develop AdaGrad-style step sizes, let us define the following sequence of parameters. For some positive constant δ > 0, we set a0 = S0 = δ, at := Ft2 + R2 G2t + Ht2 + δ, 4 St := δ + t X as s=1 for t ≥ 1. Here, we may choose any positive number for δ. Recall that Ft = ∥∇ft (xt )∥∗ , Gt = ∥∇gt (xt )∥∗ , and Ht = |gt (xt )|. Based on these parameters, we set the algorithm parameters Vt and αt as follows. Vt = β St−1 , R αt = St−1 R2 (1) for some 0 < β ≤ 1/2. Here, the penalty parameter Vt and the step size parameter αt can be chosen without knowledge of the global upper bounds F, G, H on Ft , Gt , Ht . Now we are ready to describe our algorithm, which we call the MLMC adaptive driftplus-penalty (MDPP) algorithm. Here, MDPP is a combination of the AdaGrad-style adaptive drift-plus-penalty algorithm with the MLMC estimator presented in the previous subsection. More importantly, the algorithm is designed to solve constrained online convex optimization where the MLMC estimators f1 , . . . , fT are the objective loss functions and the MLMC estimators g1 , . . . , gT are the constraint functions. The following lemma provides an adaptive regret guarantee and an adaptive constraint violation bound for Algorithm 2 for the associated online convex optimization. Lemma 6 Suppose that Assumptions 1 and 2 hold. Then for the constrained online convex optimization problem where the MLMC estimators f1 , . . . , fT are the objective loss functions and the MLMC estimators g1 , . . . , gT are the constraint functions, Algorithm 2 achieves the 16 Stochastic-Constrained Stochastic Optimization with Markovian Data following. For any x ∈ X with ḡ(x) ≤ 0, we have E " T X ft (xt ) − T X t=1 # ft (x) t=1   1/2 1/2 1/2 = Õ E [ST ]1−β + τmix E [ST ]1/2−β + τmix T 1/4 E [ST ]1/4−β/2 + τmix T 1−β , " T # X E gt (xt ) t=1 1/2 = Õ E[ST ]1/2 + T 1/4 E[ST ]β/2+1/4 + τmix + E[ST ] + T 1/2 E[ST ]β+1/2 β/2+1/4 τmix T β/2+1/2 ! 1/2 + τmix + E[ST ]1/2 + E[ST ]β/2+1/4 β/2+1/4 τmix T β/2+1/2 β/2−1/4 β/2+1/2 + (log ST )2 τmix T . Recall that the parameters Vt and αt depend on the parameter δ. Here, we may decide any positive number for δ, its choice does affect the performance of Algorithm 2. Although the bounds given in Lemma 6 do not exhibit an explicit dependence on P δ, our proof of P Lemma 6 in Section 7.2 reveals that increasing δ increases the objective gap E[ Tt=1 ft (xt ) − Tt=1 ft (x)] P and decreases the constraint violation E[ Tt=1 gt (xt )]. Likewise, decreasing δ decreases the objective gap and increases the constraint violation. Lemma 4 implies that E [ST ] = Õ(τmix T ). Plugging in this bound on E [ST ] to the adaptive performance guarantees given in Lemma 6, we deduce the following result. Proposition 7 Suppose that Assumptions 1 and 2 hold. Then for the constrained online convex optimization problem where the MLMC estimators f1 , . . . , fT are the objective loss functions and the MLMC estimators g1 , . . . , gT are the constraint functions, Algorithm 2 achieves the following. For any x ∈ X with ḡ(x) ≤ 0, we have E " T X ft (xt ) − t=1 T X #   1−β 1−β ft (x) = Õ τmix T , t=1 E " T X #   β/2+1/4 β/2+1/2 3/4−β/2 1/2−β/2 . gt (xt ) = Õ τmix T + τmix T t=1 for any x ∈ X satisfying ḡ(x) ≤ 0. We remark that the performance bounds given in Lemma 6 and Proposition 7 are not comparable to the regret and constraint violation bounds for stochastic-constrained stochastic optimization. When each pair of loss and constraint functions for stochastic-constrained stochastic optimization corresponds to a single data, the associated regret and constraint 17 Yeongjong Kim and Dabeen Lee violation measure are given by the following. Regret(T ) = Nt T X X (j) ft (xt ) − t=1 j=1 Violation(T ) = Nt T X X Nt T X X (j) ft (x∗ ), t=1 j=1 (j) gt (xt ). t=1 j=1 (1) (N ) Here, ft , . . . , ft t are the Nt sampled functions from which we derive the MLMC estimator ft for t ∈ [T ]. In contrast, Lemma 6 and Proposition 7 analyze the performance of Algorithm 2 on the sequence of the MLMC estimators. Despite this, it would be an interesting question to understand the performance of Algorithm 2 for the latter online convex optimization setting where each function pair corresponds to a single data. The proof of Lemma 6 is given in Section 7. One of the main components of the analysis is to provide adaptive bounds on the terms E [Qt ] and E [Qt /Vt ]. This is possible thanks to our subtle choice of parameters Vt and αt . It turns out that the AdaGrad style analysis for drift-plus-penalty is more sophisticated than that for unconstrained stochastic gradient descent. Finally, we state the following theorem providing upper bounds on the optimality gap and the feasibility gap under Algorithm 2 for SCSO. The argument is to use the results of Proposition 7 and the estimation error bounds due to Lemma 5. In contrast to the setting of Section 3 for which we had to rely on the mixing property of ergodic Markov chains directly, the MLMC estimators are already close to f¯ and ḡ. Theorem 8 Suppose that Assumptions 1 and 2 hold. Then for stochastic-constrained stochastic optimization (SCSO), Algorithm 2 guarantees that ! 1−β τmix E [Gap(x̄T )] = Õ , Tβ ! (3−2β)/4 (2β+1)/4 τmix τmix + (β+1)/2 . E [Infeasibility(x̄T )] = Õ T (1−β)/2 T 5 Numerical Experiments We examine the performance of the ergodic drift-plus-penalty algorithm (Algorithm 1) for the known mixing time case and the MLMC adaptive drift-plus-penalty algorithm (Algorithm 2) for the unknown mixing time case on a linear classification problem with fairness constraints using synthetic data. We follow the experimental setup of Zafar et al. (2019). We adopt Zafar et al. (2019) for creating data points and sensitive features (with ϕ = π/2) and imposing fairness constraints. To make our Markov chain more meaningful and complex enough, we experiment with a 3-state chain instead of a 2-state chain in contrast to Dorfman and Levy (2022). The 3-state Markov chain is given with the following transition matrix   1 − 2p p p  p 1 − 2p p  p p 1 − 2p 18 Stochastic-Constrained Stochastic Optimization with Markovian Data which has stationary distribution (1/3, 1/3, 1/3). Theorem 9 (Levin and Peres, 2017, Theorems 12.4 and 12.5) For an ergodic and reversible Markov chain with n states, whose transition matrix is P , let 1 = λ1 > λ2 ≥ . . . ≥ λn be the eigenvalues of P and µmin be the minimum entry of the stationary distribution. Then, the mixing time of the chain satisfies   |λ2 | 1 4 . log 2 ≤ τmix ≤ log 1 − max{|λ2 |, |λn |} 1 − max{|λ2 |, |λn |} µmin Then it follows that the mixing time of the 3-state Markov chain satisfies 1 − 3p 1 log 2 ≤ τmix ≤ log 12. 3p 3p In our experiment, we used 1/3p as an approximation of the mixing time. For each set of data points, we generate two clusters, each of which has 1,000 data points in R2 sampled from a multivariate normal distribution. All data points from a cluster have the same label in {−1, 1}. We denote the index set corresponding to each state j ∈ {1, 2, 3} as Dj and the whole index set as D = D1 ∪ D2 ∪ D3 . The data clusters are drawn in Figure 1. The color corresponding to each state and label is summarized in Table 2. We then generate Figure 1: Data Points PP P PP label PP state P P 1 2 3 1 −1 purple blue green yellow orange red Table 2: Colors for (State, Label) Pairs binary sensitive feature zi ∈ {0, 1} randomly for each data point xi ∈ Rd , i.e., gender. We want the binary-sensitive feature of the data points to have low covariance with the results 19 Yeongjong Kim and Dabeen Lee from our classifier. More details about how to create sensitive features are included in the supplement. We use logistic regression classifiers with the following loss functions fj (w, b) = 1 X ⊤ log(1 + e−yi (w xi +b) ) |Dj | i∈Dj and constraint functions gj (w, b) = 1 X (zi − z̄)(w⊤ xi + b) − c |Dj | i∈Dj hj (w, b) = − 1 X (zi − z̄)(w⊤ xi + b) − c |Dj | i∈Dj P for each state j, where z̄ = i∈D zi /|D| and c > 0. Then the stochastic-constrained stochastic optimization problem is to minimize the usual logistic regression loss function under fairness constraints, which was proposed by Zafar et al. (2019), as follows. 1 X ⊤ log(1 + e−yi (w xi +b) ) |D| (w,b)∈X i∈D 1 X (zi − z̄)(w⊤ xi + b) ≤ c. s.t. − c ≤ |D| min i∈D Solving this problem with our framework can be viewed as a distributed optimization scheme of Ram et al. (2009b) and Johansson et al. (2010). Basically, there are agents 1,2, and 3 sharing z̄, and agent i has data Di . After we update our parameters wt and bt , they are sent to agent it , and the agent sends us back the information ∇fit (wt , bt ), gt (wt , bt ), ∇git (wt , bt ), ht (wt , bt ), ∇hit (wt , bt ). Here, we may impose that the sequence of selected agents gives rise to an ergodic Markov chain. The experimental setup involves two constraints. Although we consider the single constraint setting in the paper for simplicity, our results can be easily extended to the case with multiple constraint functions ḡ1 , . . . , ḡn . For each ḡi , we obtain sampled functions gt,i for t ∈ [T ]. Then for each t, we update xt and {Qt,i }ni=1 as   !⊤ n   X xt+1 = argmin Vt ∇ft (xt ) + Qt,i gt,i (xt ) x + αt D(x, xt ) ,  x∈X  i=1 h i Qt+1,i = Qt,i + gt,i (xt ) + ∇gt,i (xt )⊤ (xt+1 − xt ) , i = 1, . . . , n. + For Algorithm 1, we use the parameters Vt = (τmix t)β and αt = τmix t as before. For (j) t Algorithm 2, we define the MLMC estimator gt,i using {gt,i }N j=1 for each i ∈ [n] and define n a0 = S0 = δ, at = n X Ft2 X 2 2 2 + R Gt,i + Ht,i , 4 i=1 i=1 20 St = δ + t X s=1 as , Stochastic-Constrained Stochastic Optimization with Markovian Data where Gt,i = ∥∇gt,i (xt )∥∗ , Ht,i = |gt,i (xt )|. The parameters Vt and αt are defined as in (1). We compare our DPP-based algorithms with some existing algorithms developed for stochastic-constrained stochastic optimization with i.i.d. data. The list of algorithms that we tested is given as follows. • PD : Primal-dual method by Mahdavi et al. (2012). • PD2 : Primal-dual method by Jenatton et al. (2016). • DPP : Drift-plus-penalty algorithm by Yu et al. (2017). • EDPP-t : Ergodic drift-plus-penalty (Algorithm 1). • EDPP-T : modification of Algorithm 1 with non-adaptive parameters Vt = and αt = τmix T . √ τmix T • MDPP : MLMC adaptive drift-plus-penalty (Algorithm 2). For MDPP, we observed that the MLMC estimator usually has a high variance in practice, making experimental results unstable. Hence, we truncated MLMC sampling so that the number of samples per iteration is at most 24 . We chose δ = F 2 /4 + 2R2 G2 + 2H 2 , for which we computed the constants F, G, H > 0 such that Ft ≤ F, Gt,i ≤ G, Ht,i ≤ H. We set parameters to p = 0.001 and c = 0.5 with which we ran the list of algorithms with the same initial parameters and sequence of states. We first ran MDPP with 25,000 iterations which created 101,034 samples. The results on the optimality gap are summarized in Figure 2, and the results on the infeasibility are presented in Figure 3 and Table 3. The results on the regret and the cumulative constraint violations are shown in Figures 4 and 5, respectively. Figure 2: Optimality Gap (Left), Enlarged Figure around 5,000 - 30,000 Samples (Right) As shown in the figures, Our algorithms (EDPP-T, EDPP-t, MDPP) outperform the other algorithms in terms of the optimality gap. DPP also shows a good optimality gap but it ends with a positive constraint 1 infeasibility. In contrast, EDPP-T, EDPP-t, and MDPP 21 Yeongjong Kim and Dabeen Lee Figure 3: Constraint 1 Infeasibility (Left), Constraint 2 Infeasibility (Right) Algorithm PD PD2 DPP-T DPP-t EDPP-T EDPP-t MDPP Final values of constraint 1 infeasibility −0.0533 −0.4997 0.0622 0.0800 0.2042 −0.0904 −0.1536 Table 3: Final Values of Constraint 1 Infeasibility all end with a negative constraint infeasibility. Note that after 20,000 samples, EDPP-T achieves the smallest optimality gap, followed by EDPP-t and MDPP. However, Figure 3 shows that EDPP-T incurs a significantly higher infeasibility for constraint 1, given by 1 X (zi − z̄)(w⊤ xi + b) ≤ c |D| i∈D than the other algorithms. In contrast, EDPP-t outperforms DPP-T and DPP-t in terms of both the optimality gap and the infeasibility measure. It is also interesting to see that DPP-T and DPP-t behave similarly, while DPP-t performs better than DPP-T. Figure 4 shows the regret values under various algorithms for the online convex optimization setting where each pair (ft , gt ) of functions corresponds to one data sample. That said, we excluded MDPP as it requires multiple samples in one round. In addition, we excluded PD2, which exhibits much higher regret values than the other algorithms, to focus on and better present the performance of the other algorithms. Figure 5 shows that EDPP-T incurs a positive cumulative constraint 1 violation, while the other algorithms result in a negative 22 Stochastic-Constrained Stochastic Optimization with Markovian Data Figure 4: Regret Figure 5: Constraint 1 Cumulative Violation (Left), Constraint 2 Cumulative Violation (Right) cumulative constraint violation. We may check from Figures 4 and 5 that EDPP-t performs the best for online convex optimization with ergodic constraints. 6 Analysis of Ergodic Drift-Plus-Penalty for the Known Mixing Time Case This section presents the proofs of Theorems 1 to 3 given in Section 3. Section 6.1 contains the proof of Theorem 1 which provides regret and constraint violation bounds on the formulation of online convex optimization with ergodic constraints. Then Section 6.2 presents the proof of Theorem 2 that gives bounds on the optimality gap and the feasibility gap for SCSO. In Section 6.3, we prove Theorem 3 for the case where Slater’s condition is satisfied. As before, throughout this section, we denote by Ft the σ-field generated by the information 23 Yeongjong Kim and Dabeen Lee accumulated up to time step t. That is, Ft = σ ({ξ1 , . . . , ξt }) . Moreover, Et [·] in this section refers to the conditional expectation with respect to Ft , i.e., Et [·] = E [· | Ft ] (In Sections 4 and 7, Ft denotes the σ-field generated up to time step t under the MLMC estimation scheme). Furthermore, Ps[t] for t > s denotes the probability measure of ξs conditional on Ft . 6.1 Ergodic Drift-Plus-Penalty for Online Convex Optimization with Ergodic Constraints The three of our analysis are the one (Lemma 11) bounding the hPimportant components i T term E t=1 Qt gt (x) , the part (Lemma 12) providing an upper bound on the term hP i T E (Q /V )g (x) , and Lemma 13 that derives an upper bound on the expected queue t t t t=1 size E [Qt ]. Then, plugging in the deduced bounds to the lemmas in Appendix A analyzing the general template of adaptive drift-plus-penalty, we prove Theorem 1. By the update rule and the convexity of gt , we deduce the following straightforward bound on |Qt+1 − Qt |. Lemma 10 For t ≥ 1, we have −H − GR ≤ Qt+1 − Qt ≤ H. Proof Note that we have   Qt+1 = Qt + gt (xt ) + ∇gt (xt )T (xt+1 − xt ) + ≤ [Qt + gt (xt+1 )]+ ≤ Qt + H. For the lower bound, Qt+1 ≥ Qt + gt (xt ) + ∇gt (xt )T (xt+1 − xt ) ≥ Qt − |gt (xt )| − |∇gt (xt )∥∗ ∥xt+1 − xt ∥ ≥ Qt − H − GR, as required. −1 As Q1 = 0, it follows that Qt ≤ (t − 1)H. PT Recall that we denote τ = τmix (T ). The next lemma provides a bound on the term t=1 E [Qt gt (x)]. As mentioned in Section 3.1, we need to take into account that g1 , . . . , gT are not i.i.d. while they are generated by an ergodic Markov chain. Lemma 11 For any x ∈ X such that ḡ(x) ≤ 0, " T # T X 2H X E Qt gt (x) ≤ H(H + GR)(τ − 1)T + E[Qt ]. T t=1 t=1 Proof If T < τ , then T X t=1 Qt gt (x) ≤ T X T (T − 1)H 2 (t − 1)H 2 = < H 2 (τ − 1)T, 2 t=1 24 Stochastic-Constrained Stochastic Optimization with Markovian Data and the statement follows. If T ≥ τ , then " T # X E Qt gt (x) = ≤ t=1 τX −1 t=1 τX −1 E[Qt gt (x)] + 2 (t − 1)H + t=1 ≤ T −τ X+1 t=1 T −τ X+1 E[Qt+τ −1 gt+τ −1 (x)] E[(Qt+τ −1 − Qt )gt+τ −1 (x)] + t=1 T −τ X+1 (τ − 1)(τ − 2)H 2 + (τ − 1)H(H + GR) + 2 T −τ X+1 E[Qt gt+τ −1 (x)] t=1 T −τ +1 X t=1 E [Qt Et−1 [gt+τ −1 (x) − ḡ(x)]] t=1 T −τ X+1 (τ − 1)(2T − τ )H(H + GR) = + E [Qt Et−1 [gt+τ −1 (x) − ḡ(x)]] 2 t=1 where the second inequality holds because (Qt+τ −1 − Qt )gt+τ −1 (x) ≤ (H + GR)H due to Lemma 10, ḡ(x) ≤ 0, and Qt is Ft−1 -measurable. Here, to bound Et−1 [gt+τ −1 (x) − ḡ(x)], we consider Et−1 [gt+τ −1 (x) − ḡ(x)] = Et−1 [g(x, ξt+τ −1 ) − g(x, ξ)] Z Z t+τ −1 −1 H(dP[t−1] − dµ) + H(dPt+τ ≤ [t−1] − dµ) AC A −1 ≤ 2H∥Pt+τ [t−1] − µ∥T V −1 where Pt+τ [t−1] is the probability measure of ξt+τ −1 conditional on Ft−1 , ξ ∼ µ, and A is any −1 measurable set. By the definition of τ = τmix (T −1 ), it follows that ∥Pt+τ [t−1] − µ∥T V ≤ 1/T, which implies that Et−1 [gt+τ −1 (x) − ḡ(x)] ≤ 2H/T. From this, we deduce the desired statement. P Next we provide an upper bound on the term Tt=1 E [Qt gt (x)/Vt ], as we mentioned in Section 3.1. Lemma 12 For any x ∈ X such that ḡ(x) ≤ 0, " T # X Qt 2H(H + GR)(τ + 1) E gt (x) ≤ (T + 1)1−β . β Vt τmix (1 − β) t=1 Proof Consider the case T < τ first. Note that T X Qt T H 2 X 1−β H2 2H 2 (τ + 1) gt (x) ≤ β t ≤ β (T + 1)2−β ≤ β (T + 1)1−β Vt τ τ (2 − β) τ (1 − β) t=1 mix t=1 mix mix where the second inequality is due to Corollary 29 and the third inequality is from T < τ . Next we consider the case T ≥ τ . Note that " T # τ −1   T −τ  X Qt X X+1  Qt+τ −1 Qt E gt (x) = E gt (x) + E gt+τ −1 (x) . Vt Vt Vt+τ −1 t=1 t=1 t=1 25 Yeongjong Kim and Dabeen Lee Here the first term on the right-hand side can be bounded as τ −1 X  τ −1 H 2 X 1−β H 2τ Qt H2 gt (x) ≤ β τ 2−β ≤ β (T + 1)1−β E t ≤ β Vt τ τ (2 − β) τ (1 − β) t=1 mix t=1 mix mix  as above. Moreover, the second term can be bounded as follows. T −τ X+1   Qt+τ −1 gt+τ −1 (x) Vt+τ −1 t=1  T −τ  T −τ X+1  (Qt+τ −1 − Qt ) X+1  Qt ≤ gt+τ −1 (x) + gt+τ −1 (x) E E Vt+τ −1 Vt+τ −1 E t=1 (τ − 1)H(H + GR) T −τ X+1 t=1 T −τ X+1 1 E[Qt Et−1 [gt+τ −1 (x) − ḡ(x)]] Vt+τ −1 t=1 t=1  T −τ +1  2H X (τ − 1)H(H + GR) 1−β Qt 1−β (T − (τ − 1) )+ ≤ E β T Vt+τ −1 τmix (1 − β) t=1   T 2H X Qt H(H + GR) 1−β τ (T + 1) + E ≤ β T Vt τ (1 − β) ≤ β τmix 1 + (t + τ − 1)β t=1 mix ≤ H(H + GR)τ β (1 − β) τmix (T + 1)1−β + H(H + GR)τ 2H 2 T X β T t=1 τmix 2H 2 t1−β (T + 1)1−β + β (T + 1)2−β β τmix (1 − β) τmix T (2 − β) H(H + GR)(τ + 1) (T + 1)1−β , ≤ β (1 − β) τmix ≤ where the second inequality is from Lemma 10 and ḡ(x) ≤ 0, the third inequality holds due to Corollary 29 and the choice of τ , the sixth inequality comes from Corollary 29, and the last inequality holds because T (2 − β) > (T + 1)(1 − β). Based on Lemmas 11 and 12, we can now prove the first part of Theorem 1, which upper bounds the regret of Algorithm 1 for online convex optimization with ergodic constraints. Proof [The first part of Theorem 1] Lemma 21 implies that E " T X t=1 ft (xt ) − T X t=1   T T T X αT 2 F 2 X V t (H + GR)2 X 1 Qt ∗ ft (x ) ≤ R + + + E gt (x ) . VT 4 αt 2 Vt Vt # ∗ t=1 26 t=1 t=1 Stochastic-Constrained Stochastic Optimization with Markovian Data Next using Lemma 12 with x∗ satisfying ḡ(x∗ ) ≤ 0 and plugging in our choice of αt and Vt , we deduce the following. E " T X ft (xt ) − t=1 T X # ft (x) t=1 ≤ R2 (τmix T )1−β + β−1 −β F 2 τmix (H + GR)2 τmix 2H(H + GR)(τ + 1) Tβ + T 1−β + (T + 1)1−β 4β 2(1 − β) τ β (1 − β) mix −β = O(τmix τ T 1−β ), as required. Next, we prove the second part of Theorem 1, which gives an upper bound on constraint violation under Algorithm 1. The following lemma provides a time-varying bound on the expected virtual queue size. Lemma 13 For t ∈ [T + 1], E[Qt ] is bounded above by 3p β+3 2I(t − 1) + 2 β+1 s β 2F Rτmix tβ+1 3R p 3H p + 2τmix (t − 1) + 2(τ − 1)(t − 1) + 4H 1+β 2 2 where I = H 2 + G2 R2 + F 2 /4. Proof To prove the lemma, we argue by induction. Note that the statement of the lemma trivially holds when t = 1 because Q1 = 0. Suppose the statement holds for t ≤ s for some s ≥ 1. What remains is to provide an upper bound on E[Qs+1 ]. Note that h  i h h ii E Qt gt (xt ) + ∇gt (xt )⊤ (x − xt ) = E Qt Et−1 gt (xt ) + ∇gt (xt )⊤ (x − xt ) ≤ E[Qt Et−1 [gt (x)]] = E[Qt gt (x)]. Then Lemma 24 together with Jensen’s inequality implies the following. v u s s X X u E[Qs+1 ] ≤ t2s (H 2 + R2 G2 ) + 2RF Vt + 2R2 αs + 2 E[Qt gt (x)]. t=1 Moreover, Ps t=1 t=1 E[Qt gt (x)] can be upper bounded based on Lemma 11. Then it follows that 27 Yeongjong Kim and Dabeen Lee E[Qs+1 ] v u s β u 2F Rτmix (s + 1)β+1 4H X ≤ t2Is + E[Qt ] + 2R2 τmix s + 2H(H + GR)(τ − 1)s + 1+β s t=1 v s u s β p √ uH X 2F Rτmix (s + 1)β+1 p 2 ≤ 2Is + E[Qt ] + 2R τmix s + 2H 2 (τ − 1)s + 2t 1+β s t=1 s s β p √ 2F Rτmix (s + 1)β+1 p 2 1 X ≤ 2Is + + 2R τmix s + 2H 2 (τ − 1)s + 2H + E[Qt ] 1+β 2s t=1 for s ≤ T . By the induction hypothesis, it follows that E[Qt ] for any 1 ≤ t ≤ s is upper bounded by s β 3p β + 3 2F Rτmix tβ+1 3R p 3H p 2I(t − 1) + + 2τmix (t − 1) + 2(τ − 1)(t − 1) + 4H 2 β+1 1+β 2 2 s β √ 3 β + 3 2F Rτmix (s + 1)β+1 3R √ 3H p ≤ 2Is + + 2τmix s + 2(τ − 1)s + 4H. 2 β+1 1+β 2 2 This leads to the desired upper bound on E[Qs+1 ]. We are now ready to prove the second part of Theorem 1, which proves the constraint violation bound of Algorithm 1. Proof [The second part of Theorem 1] Combining Lemmas 22 and 13, we deduce that E " T X # gt (xt ) ≤ E[QT +1 ] + t=1 T T t=1 t=1   p F G X Vt G2 X E[Qt ] β/2 + = O τmix T β/2+1/2 + (τ − 1)T , 2 αt 2 αt as required. 6.2 Ergodic Drift-Plus-Penalty for Stochastic-Constrained Stochastic Optimization In this section, we provide our formal proof of Theorem 2. To better organize and present the result, we divide the analysis into two, one of which is for bounding the optimality gap while the other is for bounding the feasibility gap. The main idea is to use the performance bounds given in Theorem 1 and relate them to the optimality gap and the feasibility gap. The relation between them is not as clear as in the i.i.d. setting, but we use the mixing property of ergodic Markov chains. 28 Stochastic-Constrained Stochastic Optimization with Markovian Data Proof [The first part of Theorem 2] For T ≥ τ = τmix (T −1 ), we have the following T X X+1  T −τ  f¯(xt ) − f¯(x) = f¯(xt ) − f¯(x) − ft+τ −1 (xt ) + ft+τ −1 (x) t=1 t=1 T −τ X+1 + + (ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )) t=1 T X T X t=τ t=T −τ +2 (ft (xt ) − ft (x)) +  f¯(xt ) − f¯(x) . We consider the four parts of the right-hand side separately. Here, the first part satisfies   Et−1 f¯(xt ) − f¯(x) − ft+τ −1 (xt ) + ft+τ −1 (x) Z   −1 = (f (xt , ξ) − f (x, ξ)) dµ(ξ) − dPt+τ (ξ) [t−1] Z t+τ −1 ≤ F R |dµ(ξ) − dP[t−1] (ξ)| −1 ≤ 2F R∥µ − Pt+τ [t−1] ∥T V ≤ 2F R . T P −τ +1 ¯ Hence, it follows that Tt=1 E[f (xt ) − f¯(x) − ft+τ −1 (xt ) + ft+τ −1 (x)] ≤ 2F R. To bound the second part, we consider the following. E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )] = ≤ ≤ t+τ −2 X s=t t+τ −2 X s=t t+τ −2 X s=t = E[ft+τ −1 (xs ) − ft+τ −1 (xs+1 )] E[F ∥xs − xs+1 ∥] F E[Vs F + Qs G] 2αs −2 t+τ −2 β−1 t+τ X F 2 τmix F G X E[Qs ] sβ−1 + 2 2τmix s=t s s=t t+τ −2 β−1   F 2 τmix F G X E[Qs ] β β ≤ (t + τ − 2) − (t − 1) + 2β 2τmix s=t s 29 Yeongjong Kim and Dabeen Lee where the last inequality holds due to Corollary 29. Here, we consider the following to provide an upper bound on the second term on the right-most side. t+τ −2 X s=t  p p E[Qs ] 2I(t + τ − 2) − 2I(t − 1) ≤3 s s  q β q 2(β + 3) 2F Rτmix β+1 β+1 + (t + τ − 2) − (t − 1) (β + 1)2 β+1 p  p + 3R 2τmix (t + τ − 2) − 2τmix (t − 1)  p p 2(τ − 1)(t + τ − 2) − 2(τ − 1)(t − 1) + 3H   t+τ −2 + 4H log . σ(t − 1) Here, we handle the above terms in the following manner. Note that T −τ X+1 q (t + τ − 2)β+1 −  q β+1 (t − 1) t=1 β+1 β+1 β+1 = (T − 1) 2 + · · · + (T − τ + 1) 2 − (τ − 2) 2 − · · · − 1 ≤ (τ − 1)T β/2+1/2 holds and that T −τ X+1 p (t + τ − 2) −  p 1 1 1 (t − 1) = (T − 1) 2 + · · · + (T − τ + 1) 2 − (τ − 2) 2 − · · · − 1 t=1 ≤ (τ − 1)T 1/2 . The resulting terms form dominant terms, which means that T −τ X+1 E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )] t=1   β/2−1 −1 = O τmix (τ − 1)T β/2+1/2 + τmix (τ − 1)3/2 T 1/2 . Moreover, by Theorem 1 and the triangular inequality, the third part is bounded as E " T X #   −β (ft (xt ) − ft (x)) = O τmix τ T 1−β + (τ − 1) . t=τ The fourth part can be bounded as follows. " E T X # f¯(xt ) − f¯(x) t=T −τ +2 30  ≤ RF (τ − 1). Stochastic-Constrained Stochastic Optimization with Markovian Data Combining the bounds on the four parts, we can conclude that for any x ∈ X such that ḡ(x) ≤ 0, " T # X  E f¯(xt ) − f¯(x) t=1   β/2−1 −β −1 = O τmix τ T 1−β + τmix (τ − 1)T β/2+1/2 + τmix (τ − 1)3/2 T 1/2 + τ , which implies " T # X   1 E f¯(x̄T ) − f¯(x) ≤ E f¯(xt ) − f¯(x) T  t=1 =O τ β τmix β/2 (τ − 1)3/2 τ τ − 1 τmix + + + (1−β)/2 1/2 τmix T T τmix T Tβ ! , as required. Next we prove the feasibility gap bound of Algorithm 1, which is given as the second part of Theorem 2. Proof [Proof of the second part of Theorem 2] For T ≥ τ = τmix (T −1 ), we have the following, T X ḡ(xt ) = t=1 T −τ X+1 t=1 T X + (ḡ(xt ) − gt+τ −1 (xt )) + T −τ X+1 (gt+τ −1 (xt ) − gt+τ −1 (xt+τ −1 )) t=1 gt (xt ) + t=τ T X ḡ(xt ) t=T −τ +2 where the right-hand side consists of four terms. We separately upper bound the four parts of the right-hand side. As before, the first part satisfies Z   t+τ −1 Et−1 [ḡ(xt ) − gt+τ −1 (xt )] = g(xt , ξ) dµ(ξ) − dP[t−1] (ξ) Z −1 ≤ H |dµ(ξ) − dPt+τ [t−1] (ξ)| −1 ≤ 2H∥µ − Pt+τ [t−1] ∥T V ≤ This implies that PT −τ +1 t=1 2H . T E[ḡ(xt ) − gt+τ −1 (xt )] ≤ 2H. Next, the second part satisfies E[gt+τ −1 (xt ) − gt+τ −1 (xt+τ −1 )] = ≤ t+τ −2 X s=t t+τ −2 X s=t 31 E[gt+τ −1 (xs ) − gt+τ −1 (xs+1 )] E[G∥xs − xs+1 ∥] Yeongjong Kim and Dabeen Lee As in the proof of the first part of Theorem 2, we deduce that T −τ X+1   β/2−1 −1 E[gt+τ −1 (xt ) − gt+τ −1 (xt )] ≤ O τmix (τ − 1)T β/2+1/2 + τmix (τ − 1)3/2 T 1/2 . t=1 Moreover, it follows from Theorem 1 and the triangular inequality that the third part can be bounded as " T #   X p β/2 E gt (xt ) = O τmix T β/2+1/2 + (τ − 1)T + τ − 1 . t=τ +2 Lastly, the following provides an upper bound on the fourth part. " T # X E ḡ(xt ) ≤ H(τ − 1). t=T −τ +2 Combining the bounds on the four parts, we have that " T #   X β/2 β/2−1 −1 (τ − 1)3/2 T 1/2 + τ , E ḡ(xt ) ≤ O τmix T β/2+1/2 + τmix (τ − 1)T β/2+1/2 + τmix t=1 implying in turn that " T # X 1 ḡ(xt ) = O E [ḡ(x̄T )] ≤ E T t=1 β/2−1 τmix τ (τ − 1)3/2 τ + + T T (1−β)/2 τmix T 1/2 ! , as required. 6.3 Ergodic Drift-Plus-Penalty under Slater’s Condition In this section, we prove Theorem 3, which analyzes the performance of Theorem 3 under Slater’s constraint qualification stated in Assumption 3. We will see that Slater’s condition does lead to improvement. Basically, we deduce a reduction on E[Qt ]. To argue this, we follow the proof path of Yu et al. (2017). We first present a lemma, which is analogous to (Yu et al., 2017, Lemma 5). The difference is that we use a time-varying parameter θ(t) which allows time-varying algorithm parameters Vt and αt . In fact, its proof is similar to that of (Yu et al., 2017, Lemma 5), so we defer it to the appendix (Appendix D). Lemma 14 Let {Zt , t ≥ 0} be a discrete-time stochastic process adapted to a filtration {Wt , t ≥ 0} with Z0 = 0 and W0 = {∅, S}. Suppose there exists a positive integer t0 , real numbers 0 < ζ ≤ δmax , and a non-decreasing function θ(t) > 0 such that |Zt+1 − Zt | ≤ δmax , ( t0 δmax , E[Zt+t0 − Zt |Wt ] ≤ −t0 ζ, 32 if Zt < θ(t) if Zt ≥ θ(t) Stochastic-Constrained Stochastic Optimization with Markovian Data for all positive integer t. Then, 4δ 2 E[Zt ] ≤ θ(t) + t0 δmax + t0 max log ζ  2 8δmax ζ2  for all positive integer t. Here, the important part is that although θ(t) is time-varying, the parameter t0 is some fixed value that does not depend on t. That is why the condition in Lemma 14 can be recursively applied. Lemma 14 implies that if the stochastic process given by {Qt , t ≥ 0} where Q0 = 0 satisfies the drift condition as in Lemma 14 with appropriate parameters δmax , t0 , and ζ, then the expected queue size E [Qt ] can be bounded. Moreover, by properly setting θ(t) as well as the parameters, we may control the size of E [Qt ]. The next lemma shows that the stochastic process given by {Qt , t ≥ 0} indeed satisfies the desired drift condition. While proving the lemma, we need to consider the term Et−1 [Qi gi (x̂)] for i ≥ t where x̂ is the solution satisfying Slater’s constraint qualification, i.e. ḡ(x̂) ≤ −ϵ. Here, we need to relate ḡ(x̂) ≤ −ϵ and the term Et−1 [Qi gi (x̂)], from which we deduce the drift condition. Although gi (x̂) is not necessarily negative, we again use the property of ergodic Markov chains that the distribution of gi (x̂) conditional on Ft−1 gets close to the stationary distribution for a sufficiently large i. Lemma 15 Let T ≥ 4H/ϵ, t0 > 2τ = 2τmix (T −1 ), and θt0 (t) =  2 ϵ(t0 − τ )2 (H + GR) + 4τ (t0 + t − 1)(H + GR)2 ϵ(t0 − 2τ )   2 β + 2t0 (H 2 + G2 R2 ) + 2RF τmix t0 (t + t0 − 1)β . ϵ(t0 − 2τ ) Then Algorithm 1 under Assumption 3 satisfies |Qt+1 − Qt | ≤ H + GR, ( (H + GR)t0 , Et−1 [Qt+t0 − Qt ] ≤ −ϵt0 /4, if Qt < θt0 (t) if Qt ≥ θt0 (t). Proof |Qt+1 − Qt | ≤ H + GR follows directly from Lemma 10. For the second part, note that by Jensen’s inequality, we have Et−1 [Qt+t0 ]2 ≤ Et−1 [Q2t+t0 ]. Moreover, observe that θt0 (t) ≥ ϵt0 2ϵ(t0 − τ )2 (H + GR) ≥ (t0 − τ )(H + GR) ≥ , ϵ(t0 − 2τ ) 2 33 Yeongjong Kim and Dabeen Lee where the last inequality holds because t0 > 2τ and H > ϵ. This means that Qt > ϵt0 /4 if Qt > θ(t0 ). Therefore, it is sufficient to show that if Qt ≥ θ(t0 ), Et−1 [Q2t+t0 ] ≤   ϵt0 2 Qt − . 4 Recall that By Lemma 23 and the convexity of gt , we have for any i ≥ 1, ∆i ≤ H 2 + G2 R2 + Qi gi (x̂) + Vi RF + αi (D(x̂, xi ) − D(x̂, xi+1 )) . Thus, we have t+t 0 −1 X Et−1 [∆i ] i=t 2 2 2 ≤ t0 (H + G R ) + + αt D(x̂, xt ) + t+t 0 −1 X β Et−1 [Qi gi (x̂)] + RF τmix t+t 0 −1 X i=t t+t −2 0 X iβ i=t (αi+1 − αi )D(x̂, xi+1 ) − αt+t0 −1 D(x̂, xt+t0 ) i=t ≤ t0 (H 2 β t0 (t + t0 − 1)β + αt+t0 −1 R2 + + G R ) + RF τmix 2 2 t+t 0 −1 X Et−1 [Qi gi (x̂)]. i=t We factor the last term as follows. t+t 0 −1 X Et−1 [Qi gi (x̂)] i=t t+τ −2 X = Et−1 [Qi gi (x̂)] + i=t t+t 0 −τ X Et−1 [(Qi+τ −1 − Qi )gi+τ −1 (x̂)] i=t Et−1 [Qi (gi+τ −1 (x̂) − ḡ(x̂))] + + ≤H t+t 0 −τ X 2 i=t t+τ −3 X t+t 0 −τ X Et−1 [Qi ḡ(x̂)] i=t  i + (t0 − τ + 1)(τ − 1)H(H + GR) +  t+t 0 −τ X 2H −ϵ Et−1 [Qi ] T i=t i=t−1 where the inequality is due to Z Et−1 [ḡ(xt ) − gt+τ −1 (xt )] ≤ H t+τ −1 −1 |dµ(ξ) − dP[t−1] (ξ)| ≤ 2H∥µ − Pt+τ [t−1] ∥T V ≤ 34 2H . T Stochastic-Constrained Stochastic Optimization with Markovian Data Then, since T ≥ 4H/ϵ and thus 2H/T ≤ ϵ/2, it follows that t+t 0 −1 X Et−1 [Qi gi (x̂)] i=t t+t0 −τ ϵ X ≤ (t0 + t − 2)(τ − 1)H(H + GR) − Et−1 [Qi ] 2 i=t tX 0 −τ ϵ (Qt − i(H + GR)) 2 i=0  ϵ (t0 − τ )Qt − (t0 − τ )2 (H + GR) . ≤ (t0 + t − 2)(τ − 1)H(H + GR) − 2 ≤ (t0 + t − 2)(τ − 1)H(H + GR) − As a result, we obtain Et−1 [Q2t+t0 ] = Q2t + 2 t+t 0 −1 X Et−1 [∆i ] i=t ≤ Q2t − ϵ(t0 − τ )Qt + ϵ(t0 − τ )2 (H + GR) + 2(t0 + t − 2)(τ − 1)H(H + GR) β t0 (t + t0 − 1)β + 2τmix (t + t0 − 1)R2 + 2t0 (H 2 + G2 R2 ) + 2RF τmix ≤ Q2t − ϵ(t0 − τ )Qt + ϵ(t0 − τ )2 (H + GR) + 4τ (t0 + t − 1)(H + GR)2 β + 2t0 (H 2 + G2 R2 ) + 2RF τmix t0 (t + t0 − 1)β If Qt satisfies ϵ (t0 − 2τ )Qt ≥ ϵ(t0 − τ )2 (H + GR) + 4τ (t0 + t − 1)(H + GR)2 2 β t0 (t + t0 − 1)β , + 2t0 (H 2 + G2 R2 ) + 2RF τmix which is equivalent to Qt ≥ θt0 (t), then ϵ Et−1 [Q2t+t0 ] ≤ Q2t − t0 Qt ≤ 2  ϵt0 Qt − 4 2 as required. Having prepared Lemmas 14 and 15, we can derive refined bounds on the expected virtual queue size. Based on this, we are ready to prove Theorem 3. Proof [Proof of Theorem 3] As the result is trivial when T ≤ 4H/ϵ where H and ϵ are constants, we may assume without loss of generality that T ≥ 4H/ϵ. Moreover, we will use Lemma 14 for many values of t0 greater than 2τ . Setting δmax = H + GR, ζ = ϵ/4, and θ = θt0 where θt0 is defined as in Lemma 15, Lemma 14 gives us that 16(H + GR)2 E[Qt+1 ] ≤ θt0 (t) + t0 (H + GR) + t0 log ϵ 35  128(H + GR)2 ϵ2  . Yeongjong Kim and Dabeen Lee √ In particular, as this inequality holds for any t0 > 2τ , the inequality with t0 = 2τ + ⌈ τmix t⌉ holds. Recall that in Lemma 15, we set θt0 (t) =  2 ϵ(t0 − τ )2 (H + GR) + 4τ (t0 + t − 1)(H + GR)2 ϵ(t0 − 2τ )   2 β + 2t0 (H 2 + G2 R2 ) + 2RF τmix t0 (t + t0 − 1)β . ϵ(t0 − 2τ ) Moreover, √ √ ϵ⌈ τmix t⌉ ϵ τmix t ϵ(t0 − 2τ ) θt0 (t) = θt0 (t) ≥ θt0 (t). 2 2 2 This implies the following. √ ϵ τmix t θt0 (t) 2 √ √ ≤ ϵ(τ + ⌈ τmix t⌉)2 (H + GR) + 4τ (2τ + ⌈ τmix t⌉ + t − 1)(H + GR)2 √ √ √ β + 2(2τ + ⌈ τmix t⌉)(H 2 + G2 R2 ) + 2RF τmix (2τ + ⌈ τmix t⌉)(t + 2τ + ⌈ τmix t⌉ − 1)β √ ≤ 2ϵ(H + GR)(τ 2 + 4τmix t) + 4τ (2τ + 2 τmix t + t)(H + GR)2 √ √ √ β + 4(τ + τmix t)(H 2 + G2 R2 ) + 4RF τmix (τ + τmix t)(t + 2τ + 2 τmix t)β ≤ 2ϵ(H + GR)(τ 2 + 4τmix t) + 4τ (2τ + τmix + 2t)(H + GR)2 | {z } | {z } + 4(τ + | √ (a) (b) τmix t)(H {z (c) 2 √ β + G R ) + 4RF τmix (τ + τmix t)(2t + 2τ + τmix )β } 2 2 | {z } (d) where the second inequality follows from (a + b)2 ≤ 2(a2 + b2 ), ⌈x⌉ ≤ 2x which holds for x ≥ 1 and the third inequality follows from AM-GM inequality. We see that term (c) grows more slowly than term (d) which is of order     √ √ β β O τmix (τ + t)β (τ + τmix t) = O τmix (τ β + tβ )(τ + τmix t) . Here, we used the known fact that (a + b)β ≤ aβ + bβ for a, b ≥ 0 and β < 1. Likewise, term (a) grows more slowly than term (b) which is of order O (τ (τ + t)) . Then it follows that     τ τ (τ + t) β β β √ √ θt0 (t) = O τmix (τ + t ) 1 + + τmix t τmix t    τ 1+β τ β τ τ β tβ τ2 τt   β β  β +√ = O τmix τ + τmix tβ + √ mix + √ mix + √ . | {z } | {z } τmix t τmix t τmix t τmix t  | {z } | {z } | {z } | {z } (a′ ) (b′ ) (c′ ) 36 (d′ ) (e′ ) (f ′ ) Stochastic-Constrained Stochastic Optimization with Markovian Data Here, term (c′ ) grows more slowly than term (e′ ), and term (f ′ ) dominates term (b′ ). Also, terms (a′ ) and (d′ ) are dominated by   r  τ = O (e′ ) + (f ′ ) , τ =O τ τmix which is due to the AM-GM inequality. Therefore,   τ (τ + t) √ θt0 (t) = O . τmix t Furthermore, since √ t0 = 2τ + ⌈ τmix t⌉ = O  τ (τ + t) √ τmix t  √ which holds because τ (τ + t) ≥ 2τ τ t, we have   τ (τ + t) E[Qt ] ≤ O √ , τmix t (2) which essentially gives rise to the desired refined bound on the expected virtual queue size. Having provided the bound on the expected queue size, we now proceed to provide bounds on the performance guarantees of Algorithm 1. First of all, note that the regret bound given by Theorem 1 still holds with β = 1/2, as we use the same algorithm parameters. Then we deduce that √ ! τ T . E [Regret(T )] = O √ τmix For constraint violation, we deduce from Lemma 22 and (2) that T T F G X Vt G2 X E[Qt ] E [Violation(T )] ≤ E[QT +1 ] + + 2 αt 2 αt t=1 t=1 ! T T X τ (τ + T + 1) 1 X τ (τ + t) β−1 β−1 √ p + τmix t + τmix t τmix t τmix (T + 1) t=1 t=1 =O  =O τ (τ + T ) √ τmix T  . Next, we consider the optimality gap. In the proof of Theorem 2, we argued that E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )] ≤ t+τ −2 β−1   F 2 τmix F G X E[Qs ] (t + τ − 2)β − (t − 1)β + . 2β 2τmix s=t s Here, using (2) again, the second term on the right-hand side can be bounded as follows. 1 τmix t+τ −2 X s=t  E[Qs ] s    √ τ √     = O  3/2 λ((t − 1)−1/2 ) − λ((t + τ − 2)−1/2 ) + 3/2 ( t + τ − 2 − t − 1) {z } τmix | τmix ′′ τ2 (a ) 37 Yeongjong Kim and Dabeen Lee where ( x, λ(x) = 3/2, if x > 0 . if x = 0 Therefore, it follows that T −τ X+1 β−1   τmix (T − 1)β + · · · + (T − τ + 1)β 2β   3 τ2 −1/2 −1/2 + 3/2 +1 + · · · + (τ − 2) τmix 2 !  √ τ √ T − 1 + ··· + T − τ + 1 + 3/2 τmix E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )] = O t=1 (3) if τ ≥ 2. Thus, we deduce T −τ X+1 E[ft+τ −1 (xt ) − ft+τ −1 (xt+τ −1 )] t=1  √  √ −3/2 −3/2 β−1 (τ − 1)T β + τ 2 τ − 1τmix + τ (τ − 1)τmix T = O τmix √ ! τ 5/2 τ 2 T =O + 3/2 . 3/2 τmix τmix We combine this result with the other parts of the proof of the first part of Theorem 2 to get " T # ! √ √ ! 5/2 5/2 2 T 2 T X  τ τ τ τ −β 1−β +τ =O E f¯(xt ) − f¯(x) = O + 3/2 + τmix τ T + 3/2 3/2 3/2 τ τ τ τmix t=1 mix mix mix for any x ∈ X such that ḡ(x) ≤ 0 since β = 1/2. Similarly, we get T −τ X+1 E[gt+τ −1 (xt ) − gt+τ −1 (xt+τ −1 )] = O t=1 τ 5/2 3/2 + τmix √ ! τ2 T 3/2 . τmix As before, we combine this result with the other parts of the proof of the second part of Theorem 2. Then we get " T # ! ! √ √ X τ 5/2 τ 2 T τ (τ + T ) τ 5/2 τ 2 T τ2 E ḡ(xt ) = O + 3/2 + √ +τ =O + 3/2 + √ . 3/2 3/2 τ T τmix T mix τ τ τ τ t=1 mix mix mix mix Finally, we obtain that E[Gap(x̄T )] = O E[Infeasibility(x̄T )] = τ 5/2 τ2 + 3/2 3/2 √ τmix T τmix T τ 5/2 ! , τ2 τ2 + + √ 3/2 3/2 √ τmix T 3/2 τmix T τmix T as required. 38 ! , Stochastic-Constrained Stochastic Optimization with Markovian Data 7 Analysis of MLMC Adaptive Drift-Plus-Penalty for the Unknown Mixing Time Case This section presents a complete proof of Theorem 8. Section 7.1 provides bounds on several terms that involve the virtual queue size. In Section 7.2, we state the proof of Lemma 6, and lastly, we prove Theorem 8. 7.1 Controlling the Expected Virtual Queue Size In this section we provide upper bounds on terms E [Qt /Vt ] and E [Qt ] under the MLMC Adaptive Drift-Plus-Penalty algorithm (Algorithm 2). Lemma 16 gives a refined bound on the term E [gt (x)]. Using this, Lemma 17 derives an adaptive upper bound on the term E [Qt /Vt ], and Lemma 18 deduces an adaptive upper bound on the term E [Qt ]. Note that if x̄ ∈ X satisfies ḡ(x) ≤ 0, then gt (x) ≤ gt (x) − ḡ(x). Then we deduce h j i h j i max max Et−1 [gt (x)] ≤ Et−1 [gt (x) − ḡ(x)] = Et−1 gt2 (x) − ḡ(x) ≤ Et−1 |gt2 (x) − ḡ(x)| . where the equality is due to Lemma 4. Applying Lemma 5 together with Jensen’s inequality on the right-hand side, we obtain   1/2 1/2 Et−1 [gt (x)] ≤ C(T )τmix T −1 = Õ τmix T −1 . (4) In fact, we may derive a tighter bound on the term Et−1 [gt (x)] as follows. Lemma 16 For any x ∈ X such that ḡ(x) ≤ 0, 1/4 Et−1 [gt (x)] ≤ D(T )τmix T −1/2 , where D(T ) = p C(T )H(4 log2 T + 3) Proof Note that gt (x) ≤ |gt (x)| ≤ (2Nt + 1)H by the definition of gt . This implies that Et−1 [gt (x)] ≤ Et−1 [(2Nt + 1)H] = (4 log2 T + 3)H, where the equality follows from Lemma 4 and the fact that Nt is independent of Ft−1 . Moreover, as we argued above, we have h j i q  j  max 1/2 Et−1 [gt (x)] ≤ Et−1 |gt2 (x) − ḡ(x)| ≤ Et−1 |gt2 max (x) − ḡ(x)|2 ≤ C(T )τmix T −1 , where the last inequality follows from Jensen’s inequality and the equality follows from Lemma 5. By taking the geometric mean of these two upper bounds, we get that Et−1 [gt (x)] ≤ p 1/4 C(T )H(4 log2 T + 3)τmix T −1/2 , as required. 39 Yeongjong Kim and Dabeen Lee Next, Lemma 24 implies that Qs+1 is at most v v v u s u s u s p u X u X u X t2 RVt Ft + t2 (R2 G2t + Ht2 ) + 2R2 αs +t2 Qt (gt (xt ) + ∇gt (xt )⊤ (x − xt )). | {z } t=1 t=1 t=1 (c) {z } | {z } | (a) (b) Here, term (c) is equal to For term (a), we have (a)2 = 2 s X √ √ √ 2Ss−1 which is at most 2Ss , and term (b) is at most 2Ss . β St−1 Ft ≤ 2Ssβ t=1 s X v u s u X √ βt Ft ≤ 2Ss s Ft2 ≤ 2 sSsβ+1/2 , t=1 t=1 where the second inequality holds by the power mean inequality. Thus, s X √ Qt (gt (xt ) + ∇gt (xt )⊤ (x − xt )), Q2s+1 ≤ 2 sSsβ+1/2 + 4Ss + 2 (5) t=1 which in turn implies that √ Qs+1 ≤ 2 2Ss1/2 + √ v u s u X 1/4 β/2+1/4 2s Ss + t2 Qt (gt (xt ) + ∇gt (xt )⊤ (x − xt )) (6) t=1 Moreover, (6) implies that v u s u 2 X √ √ Qs+1 Qt 1/2−β 1/4 1/4−β/2 ≤ 2 2RSs + 2Rs Ss +t (gt (xt ) + ∇gt (xt )⊤ (x − xt )) Vs+1 Vs+1 Vt t=1 as Vt is non-decreasing. Taking the expectation, applying Jensen’s inequality, and using the fact that Vs+1 ≥ sδ, it follows that   √ √ Qs+1 E ≤ 2 2RE[Ss ]1/2−β + 2Rs1/4 E[Ss ]1/4−β/2 Vs+1 v u   s u 2R X Qt ⊤ t + E Et−1 [gt (xt ) + ∇gt (xt ) (x − xt )] . Vt (sδ)β t=1 Here, the last term on the right-hand side can be further upper bounded as follows. v v u   u   s s u 2R X u 2R X Qt Qt ⊤ (x − x )] ≤ t t E E [g (x ) + ∇g (x ) E E [g (x)] t−1 t t t t t t−1 t Vt Vt (sδ)β (sδ)β t=1 t=1 where we used the fact that h i Et−1 [gt (xt ) + ∇gt (xt )⊤ (x − xt )] = Et−1 g jmax (xt ) + ∇g jmax (xt )⊤ (x − xt )   ≤ Et−1 g jmax (x) = Et−1 [gt (x)] 40 (7) Stochastic-Constrained Stochastic Optimization with Markovian Data holds due to Lemma 4. Furthermore, v v u  u    s s X X u 2R u 2R Qt Qt 1/4 −1/2 t t Et−1 [gt (x)] ≤ E D(T )τmix T E β β Vt Vt (sδ) (sδ) t=1 t=1   s 2Rs 1 X Qt 1/4 −1/2 ≤ D(T )τmix T + E β 2s Vt (sδ) t=1   s 2RD(T ) 1/4 1/2−β 1 X Qt ≤ τmix T + E . 2s Vt δβ t=1 where the second inequality follows from the AM-GM inequality and the last equality follows from s ≤ T . Thus, we have   Qs+1 E Vs+1   s √ √ 2RD(T ) 1/4 1/2−β 1 X Qt 1/2−β 1/4 1/4−β/2 ≤ 2 2RE[Ss ] + 2Rs E[Ss ] + τmix T + E (8) β 2s Vt δ t=1   s √ √ 2RD(T ) 1/4 1/2−β 1 X Qt 1/2−β 1/4 1/4−β/2 + 2Rs E[ST ] + ≤ 2 2RE[ST ] τmix T + E . β 2s Vt δ t=1 Now we can bound E [Qt /Vt ] as follow. Lemma 17 For t ∈ [T + 1],   √ √ 4RD(T ) 1/4 1/2−β Qt E τmix T . ≤ 4 2RE[ST ]1/2−β + 2 2RT 1/4 E[ST ]1/4−β/2 + Vt δβ Proof For t = 1, Q1 = 0 and the inequality of the lemma holds. Assume that it holds for t = 1, . . . , s with s ≤ T . Substituting the inequalities for t = 1, . . . , s into (8), we derive the inequality for t = s + 1, as required. Next, we deduce an upper bound on E [Qt ]. First, we observe that (6) and Jensen’s inequality imply the following. For s ∈ [T ], v u √ s u √ √ τmix X 1/2 1/4 β/2+1/4 t E [Qs+1 ] ≤ 2 2E[Ss ] + 2s E[Ss ] + 2C(T ) E[Qt ] T t=1 √ s √ √ τmix 1 X ≤ 2 2E[Ss ]1/2 + 2s1/4 E[Ss ]β/2+1/4 + sC(T ) + E[Qt ] T 2s (9) t=1 √ ≤ 2 2E[Ss ]1/2 + √ 1/2 2s1/4 E[Ss ]β/2+1/4 + C(T )τmix + s 1 X 2s E[Qt ], t=1 where the first inequality is implied by combining (6) and (7) and applying Lemma 16. Next we provide an adaptive upper bound on E [Qt ] 41 Yeongjong Kim and Dabeen Lee Lemma 18 For any t ∈ [T + 1], √ 1/2 E[Qt ] ≤ 4 2E[St−1 ] √ 5 2 1/2 + (t − 1)1/4 E[St−1 ]β/2+1/4 + 2C(T )τmix . 3 Proof We argue by induction. The inequality trivially holds when t = 1 as Q1 = 0. Suppose that the inequality holds for t = 1, 2, . . . , s. Then √ √ 1/2 E[Qs+1 ] ≤ 2 2E[Ss ]1/2 + 2s1/4 E[Ss ]β/2+1/4 + C(T )τmix  s  1 X √ 10 1/2 1/2 1/4 β/2+1/4 + 4 2E[St−1 ] + (t − 1) E[St−1 ] + 2C(T )τmix 2s 3 t=1 √ √ 1/2 ≤ 2 2E[Ss ]1/2 + 2s1/4 E[Ss ]β/2+1/4 + C(T )τmix ! √ √ 1 4 2 1/2 + 4 2sE[Ss ]1/2 + s5/4 E[Ss ]β/2+1/4 + 2sC(T )τmix 2s 3 √ √ 5 2 1/4 1/2 1/2 = 4 2E[Ss ] + s E[Ss ]β/2+1/4 + 2C(T )τmix , 3 where the first inequality is from (9) and the second inequality is by Corollary 29. 7.2 MLMC Adaptive Drift-Plus-Penalty for Stochastic-Constrained Stochastic Optimization Based on the bounds on Et−1 [gt (x)], E[Qt /Vt ], and E[Qt ], in this section, we prove Lemma 6 and Theorem 8. To better present the proof of Lemma 6, we divide the analysis into two parts, one for the regret and the other for the constraint violation. Then, using Lemma 6, we prove Theorem 8. Proof [Proof of the first part of lemma 6] We first bound " T # T X X E ft (xt ) − ft (x) . t=1 t=1 By Lemma 21, this is less than or equal to " T # " T # " T #   X Vt F 2 X (Ht + Gt R)2 X Qt αT 2 1 t E R +E +E +E gt (x) . VT 4αt 2 Vt Vt t=1 t=1 t=1 1−β Here, the first term is equal to RST1−β , and the second term −1 which is less or equal to RST satisfies " T # " T # " T #  h i  h i X Vt F 2 X β−1 X β−1 R t E = E St−1 Ft2 ≤ R St−1 at = O E STβ = O E STβ 4αt 4 t=1 t=1 t=1 where the inequality holds by Corollary 31. The third term satisfies " T # " T # " T #  h i X (Ht + Gt R)2 X R 2 G2 + H 2 X at 1−β t t E ≤E R ≤ E R = O E S . T β β 2Vt St−1 St−1 t=1 t=1 t=1 42 Stochastic-Constrained Stochastic Optimization with Markovian Data h i h i By Jensen’s inequality, we have E ST1−β ≤ E [ST ]1−β and E STβ ≤ E [ST ]β . The fourth term can be bounded as follows. Note that # " T   T X Qt X Qt gt (x) = Et−1 [gt (x)] . E E Vt Vt t=1 t=1 We have two upper bounds on Et−1 [gt (x)], one given in (4) and the other one due to Lemma 16. Let m(T ) be the minimum of the two upper bounds, i.e., n o 1/2 1/4 m(T ) = min C(T )τmix T −1 , D(T )τmix T −1/2 . Then it follows that T X   Qt E Et−1 [gt (x)] Vt t=1   T X √ √ 4RD(T ) 1/4 1/2−β 1/2−β 1/4 1/4−β/2 + 2 2RT E[ST ] + τmix T ≤ m(T ) 4 2RE[ST ] δβ t=1  T   √  4RD(T )2 X √ 1/2 −1 1/2 −β 1/2−β 1/4 1/4−β/2 4 2RE[ST ] C(T )τmix T + τmix T ≤ + 2 2RT E[ST ] δβ t=1   1/2 1/2 1/2 = Õ τmix E[ST ]1/2−β + τmix T 1/4 E[ST ]1/4−β/2 + τmix T 1−β where the first inequality follows from (4) and Lemmas 16 hand 17. Combining the bounds i PT PT on the four terms, we have proved the desired bound on E t=1 ft (xt ) − t=1 ft (x) . Proof [Proof of the second part of Lemma 6] By Lemma 22, we have T X gt (xt ) ≤ QT +1 + t=1 T X Gt t=1 2αt (Vt Ft + Qt Gt ). Moreover, by Lemma 18, we have   1/2 E[QT +1 ] ≤ Õ E[ST ]1/2 + T 1/4 E[ST ]β/2+1/4 + τmix . Next it follows from Corollary 31 that " T # " T β−1 #  h i   X Vt Ft Gt X St−1 at E ≤E = O E STβ = O E [ST ]β . 2αt 2 t=1 t=1 Furthermore, # " # " T    T 2 G2 X G2 X R I +δ 1 ST − I − δ t t E Qt ≤ E max Qt · ≤E + log max Qt 2αt 2St−1 2δ 2 δ t∈[T ] t∈[T ] t=1 t=1 43 Yeongjong Kim and Dabeen Lee where the second inequality is implied by Lemma 30. Here, the last term can be upper bounded using the AM-GM inequality as follows.   I +δ 1 ST − I − δ max Qt + log 2δ 2 δ t∈[T ]   maxt∈[T ] Q2t 1 I +δ 1 ST 2 ≤ + + log (Iτmix )β/2+1/4 T β/2+1/2 . β/2+1/4 β/2+1/2 2 2δ 2 δ 2(Iτmix ) T Let s = argmaxt∈[T ] Qt . Then by (5) and (7), s−1 h h ii X  2 √ β+1/2 +2 E Qs ≤ 4E[Ss−1 ] + 2 sE[Ss−1 ] E Qt Et−1 gt (xt ) + ∇gt (xt )⊤ (x − xt ) √ ≤ 4E[ST ] + 2 T E[ST ]β+1/2 + 2 t=1 s−1 X E[Qt Et−1 [gt (x)]]. t=1 Moreover, s−1 X E[Qt Et−1 [gt (x)]] t=1 1/2 ≤ C(T )τmix T −1 s−1 X E[Qt ] t=1 T X ! √ √ 5 2 1/2 1/4 β/2+1/4 1/2 (t − 1) E[ST ] + 2C(T )τmix 4 2E[ST ] + 3 t=1 ! √ √ 4 2 5/4 1/2 −1 1/2 1/2 β/2+1/4 ≤ C(T )τmix T 4 2T E[ST ] + T E[ST ] + 2C(T )τmix T 3   1/2 1/2 = Õ τmix E[ST ]1/2 + τmix T 1/4 E[ST ]β/2+1/4 + τmix , 1/2 ≤ C(T )τmix T −1 where the first inequality follows from (4), the second inequality follows from Lemma 18, and the third inequality follows from Corollary 29. Thus, we get " T # X E[ST ] + T 1/2 E[ST ]β+1/2 1/2 E gt (xt ) = Õ E[ST ]1/2 + T 1/4 E[ST ]β/2+1/4 + τmix + β/2+1/4 β/2+1/2 τmix T t=1 ! 1/2 τmix + E[ST ]1/2 + E[ST ]β/2+1/4 2 β/2−1/4 β/2+1/2 + + (log ST ) τmix T , β/2+1/4 β/2+1/2 τmix T as required. Proof [Proof of Theorem 8] As f¯ is convex, 1 f¯(x̄T ) − f¯(x# ) ≤ T T  X t=1 44  f¯(xt ) − f¯(x# ) . Stochastic-Constrained Stochastic Optimization with Markovian Data We can decompose the right-hand side as follows f¯(xt ) − f¯(x# ) f¯(xt ) − ft (xt ) ft (xt ) − ft (x# ) ft (x# ) − f¯(x# ) = + + . T T T T Here, Jensen’s inequality and Lemma 5 imply that q    1 1  ¯ 1/2 E |f (xt ) − ft (xt )| ≤ E |f¯(xt ) − ft (xt )|2 = Õ(τmix T −2 ), T Tq i   1 h ¯ # 1 1/2 E |f¯(x# ) − ft (x# )|2 = Õ(τmix T −2 ). E |f (x ) − ft (x# )| ≤ T T Moreover, we havePderived Proposition 7 which provides an upper bound on the expectation P of Tt=1 ft (xt ) − Tt=1 ft (x# ). Consequently, h i   E f¯(x̄T ) − f¯(x# ) ≤ Õ (τmix )1−β T −β . For the second part, Jensen’s inequality and Lemma 5 imply that 1 1p 1/2 E [|ḡ(xt ) − gt (xt )|] ≤ E [|ḡ(xt ) − gt (xt )|2 ] = Õ(τmix T −2 ) T T As a result, T E[ḡ(x̄T )] ≤ 1X E[ḡ(xt )] T 1 = T t=1 T X T 1X E[ḡ(xt ) − gt (xt )] + E[gt (xt )] T t=1 t=1  1/2 −2 β/2+1/4 β/2−1/2 3/4−β/2 −β/2−1/2 = Õ τmix T + τmix T + τmix T    β/2+1/4 β/2−1/2 3/4−β/2 −β/2−1/2 , = Õ τmix T + τmix T where the first inequality holds because ḡ is convex and the second equality follows from Proposition 7. Acknowledgments and Disclosure of Funding This research is supported, in part, by KAIST Starting Fund (KAIST-G04220016), FOUR Brain Korea 21 Program (NRF-5199990113928), and National Research Foundation of Korea (NRF-2022M3J6A1063021). 45 Yeongjong Kim and Dabeen Lee Algorithm 3 Adaptive Drift-Plus-Penalty Initialize: Initial iterates x1 ∈ X , Q1 = 0. for t = 1 to T do Observe ft and gt . Set penalty parameter Vt and step size parameter αt such that 0 ≤ Vt−1 ≤ Vt , 0 ≤ αt−1 ≤ αt , and 0 ≤ αt−1 /Vt−1 ≤ αt /Vt . Primal update: Set xt+1 as n o xt+1 = argmin (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ x + αt D(x, xt ) x∈X Dual update: Set Qt+1 as h i Qt+1 = Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + end for Appendix A. Analysis of Adaptive Drift-Plus-Penalty Algorithm 3 is a general template for adaptive variants of the drift-plus-penalty algorithm whose parameters Vt and αt satisfy that {Vt }Tt=1 , {αt }Tt=1 , and {αt /Vt }Tt=1 are non-decreasing sequences of non-negative numbers. In this section, we analyze the general template of DPP given by Algorithm 3, based on which we deduce performance guarantees on Algorithms 1 and 2. Recall that ∆t = (Q2t+1 − Q2t )/2 is the Lyapunov drift term. The following lemma provides a bound on the drift term. Lemma 19 For t ≥ 1,   1 ∆t ≤ Qt gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + (Ht + Gt R)2 . 2  Proof As Qt+1 = max Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ), 0 , we have that Q2t+1 ≤ 2 Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) . Hence, it follows that Q2t+1 Q2t − 2 2   1 2 ≤ Qt gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) 2   1 ⊤ ≤ Qt gt (xt ) + ∇gt (xt ) (xt+1 − xt ) + (Ht + Gt R)2 2 ∆t = where the last inequality is from Assumption 2. Since our drift-plus-penalty algorithm is a mirror descent version, we need the following lemma, which is obtained by substituting y = xt , x∗ = xt+1 , and f (x) = (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ x into (Wei et al., 2020, Lemma 2.1). 46 Stochastic-Constrained Stochastic Optimization with Markovian Data Lemma 20 (Wei et al., 2020, Equation (22)) For any x ∈ X and t ≥ 1, (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ (xt+1 − xt ) + αt D(xt+1 , xt ) ≤ (Vt ∇ft (xt ) + Qt ∇gt (xt ))⊤ (x − xt ) + αt D(x, xt ) − αt D(x, xt+1 ). Recall that ft and gt for the known mixing time case in Section 3 correspond to one sample and are assumed to be convex. In contrast, ft and gt for the unknown mixing time setting in Section 4 come from the MLMC estimation scheme with multiple data samples and thus are not necessarily convex. Nevertheless, we use the fact that Et−1 [ft ] and Et−1 [gt ] are convex. Based on this, we deduce the following lemma. Lemma 21 Suppose that Et−1 [ft ] = Et−1 [fˆt ] and Et−1 [gt ] = Et−1 [ĝt ] where fˆt and ĝt are convex functions and that Et−1 [∇ft ] = Et−1 [∇fˆt ] and Et−1 [∇gt ] = Et−1 [∇ĝt ]. Then for any x ∈ X, " T # " T #   T X X X Qt αT 2 E ft (xt ) − ft (x) ≤ E gt (x) + E R V VT t=1 t=1 t=1 t # " T # " T X (Ht + Gt R)2 X Vt F 2 1 t + E . +E 4αt 2 Vt t=1 t=1 Proof Dividing both sides of the inequality given in Theorem 20 by Vt and addding ft (xt ) + Qt gt (xt )/Vt to both sides, we get  ⊤ Qt Qt αt ft (xt ) + gt (xt ) + ∇ft (xt ) + ∇gt (xt ) (xt+1 − xt ) + D(xt+1 , xt ) Vt Vt Vt  ⊤ Qt Qt αt αt ≤ ft (xt ) + gt (xt ) + ∇ft (xt ) + ∇gt (xt ) (x − xt ) + D(x, xt ) − D(x, xt+1 ). Vt Vt Vt Vt Here, the left-hand side is bounded below by ft (xt ) + ∇ft (xt )⊤ (xt+1 − xt ) + αt ∆t 1 D(xt+1 , xt ) + − (Ht + Gt R)2 Vt Vt 2Vt by Lemma 19. Moreover, we have αt αt ∇ft (xt )⊤ (xt+1 − xt ) + D(xt+1 , xt ) ≥ −Ft ∥xt+1 − xt ∥ + ∥xt+1 − xt ∥2 Vt Vt   2 Vt Ft αt Vt Ft 2 =− + ∥xt+1 − xt ∥ − 4αt Vt 2αt 2 Vt Ft ≥− 4αt where the first inequality holds by 2-strong convexity of Φ with respect to ∥ · ∥. Hence, we deduce that Vt Ft2 ∆t 1 + − (Ht + Gt R)2 4αt Vt 2Vt  ⊤ Qt αt αt Qt (x − xt ) + D(x, xt ) − D(x, xt+1 ). ≤ ft (xt ) + gt (xt ) + ∇ft (xt ) + ∇gt (xt ) Vt Vt Vt Vt ft (xt ) − 47 Yeongjong Kim and Dabeen Lee For the right-hand side of this inequality, consider T X αt   T X α1 αT αt αt−1 (D(x, xt ) − D(x, xt+1 )) = D(x, xt ) + − − D(x, xt+1 ) D(x, xt ) V V1 Vt Vt−1 VT t=1 t t=2   T α1 2 X 2 αt αt−1 ≤ R + R − V1 Vt Vt−1 t=2 αT 2 = R VT where the inequality holds since {αt /Vt }Tt=1 is a non-decreasing sequence. Furthermore, T X ∆t Vt t=1 T T Q2 1X 1 2 Q2 1X 2 (Qt+1 − Q2t ) = − 1 + T +1 + Qt 2 Vt 2V1 2VT 2 = t=1  t=2 1 Vt−1 − 1 Vt  ≥− Q21 = 0, 2V1 where the inequality holds since the sequence {Vt }Tt=1 is non-negative and non-decreasing. Combining these inequalities, we deduce that T X ft (xt ) − t=1 ≤ T  X ft (xt ) + ∇ft (xt )⊤ (x − xt )  t=1 T X Q  t=1 Vt t (10) T T t=1 t=1 αT 2 X Vt Ft2 1 X (Ht + Gt R)2 gt (xt ) + ∇gt (xt )⊤ (x − xt ) + R + + . VT 4αt 2 Vt  Recall that Et−1 [ft ] = Et−1 [fˆt ] and Et−1 [gt ] = Et−1 [ĝt ] where fˆt and ĝt are convex and that Et−1 [∇ft ] = Et−1 [∇fˆt ] and Et−1 [∇gt ] = Et−1 [∇ĝt ]. Then it follows that Et−1 [ft (xt )] + Et−1 [∇ft (xt )]⊤ (x − xt ) = Et−1 [fˆt (xt )] + Et−1 [∇fˆt (xt )]⊤ (x − xt ) ≤ Et−1 [fˆt (x)] = Et−1 [ft (x)] where the second inequality holds because fˆt is convex and xt is Ft−1 -measurable. Likewise, we deduce that Et−1 [gt (xt )] + Et−1 [∇gt (xt )]⊤ (x − xt ) ≤ Et−1 [gt (x)]. Taking the expectations of both sides of (10), we obtain the inequality of this lemma, as required. Next, we state a lemma that will be useful to provide an upper bound on the constraint violation. Lemma 22 Algorithm 3 achieves ∥xt+1 − xt ∥ ≤ 1 (Vt Ft + Qt Gt ), 2αt T X gt (xt ) ≤ QT +1 + t=1 48 T X Gt t=1 2αt (Vt Ft + Qt Gt ). Stochastic-Constrained Stochastic Optimization with Markovian Data Proof As Qt+1 ≥ Qt + gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ), it follows that gt (xt ) ≤ Qt+1 − Qt − ∇gt (xt )⊤ (xt+1 − xt ) ≤ Qt+1 − Qt + Gt ∥xt+1 − xt ∥. On the other hand, if we set x = xt for the inequaity of Lemma 20, we get ⊤ αt D(xt+1 , xt ) + αt D(xt , xt+1 ) ≤ Vt ∇ft (xt ) + Qt ∇gt (xt ) (xt − xt+1 ). Here, the left-hand side is greater or equal to 2αt ∥xt+1 − xt ∥2 while the right-hand side is less or equal to (Vt Ft + Qt Gt )∥xt+1 − xt ∥. Therefore, it follows that ∥xt+1 − xt ∥ ≤ 1 (Vt Ft + Qt Gt ), 2αt (11) which implies T X t=1 gt (xt ) ≤ QT +1 + T X Gt t=1 2αt (Vt Ft + Qt Gt ), as required. To bound the constraint violation, we still need to bound the virtual queue size QT +1 . We also need the following lemma. Lemma 23 For any x ∈ X ,   ∆t ≤ Ht2 + R2 G2t + Qt gt (xt ) + ∇gt (xt )⊤ (x − xt ) + Vt RFt + αt (D(x, xt ) − D(x, xt+1 ). Proof By Lemma 19,   1 ∆t ≤ Qt gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt ) + (Ht + RGt )2   2 ⊤ ≤ Qt gt (xt ) + ∇gt (xt ) (xt+1 − xt ) + Ht2 + R2 G2t where the last inequality comes from the fact that (A + B)2 ≤ 2(A2 + B 2 ). Here, using Lemma 20, the right-hand side can be further bounded above as follows. Qt (gt (xt ) + ∇gt (xt )⊤ (xt+1 − xt )) + Ht2 + R2 G2t  ≤ Qt gt (xt ) + ∇gt (xt )⊤ (x − xt ) + Vt ∇ft (xt )⊤ (x − xt ) − Vt ∇ft (xt )⊤ (xt+1 − xt )  − αt D(xt+1 , xt ) + αt D(x, xt ) − D(x, xt+1 ) + Ht2 + R2 G2t . Moreover, it follows from the Cauchy-Schwarz inequality that Vt ∇ft (xt )⊤ (x − xt ) − Vt ∇ft (xt )⊤ (xt+1 − xt ) = Vt ∇ft (xt )⊤ (x − xt+1 ) ≤ Vt RFt by Cauchy-Schwarz inequality. Then we have proved the lemma, as desired. Based on Lemma 23, we may provide the following bound on the virtual queue size. 49 Yeongjong Kim and Dabeen Lee Lemma 24 For any x ∈ X , s ∈ [T ], v u s s X u X  Qs+1 ≤ t2 Ht2 + R2 G2t + Vt RFt + 2 Qt (gt (xt ) + ∇gt (xt )⊤ (x − xt )) + 2R2 αs . t=1 t=1 Proof From Lemma 23, we get s Q2s+1 X = ∆t 2 t=1 ≤ s X    Ht2 + R2 G2t + Qt gt (xt ) + ∇gt (xt )⊤ (x − xt ) + Vt RFt t=1 + α1 D(x, x1 ) + s X D(x, xt )(αt − αt−1 ) − αs D(x, xt+1 ). t=2 Since αt is non-decreasing and D(x, xt ) ≤ R2 , it follows that α1 D(x, x1 ) + s X D(x, xt )(αt − αt−1 ) − αs D(x, xt+1 ) ≤ R2 αs . t=2 This implies the desired bound on Qs+1 . Appendix B. Sum of Sequences In this section, we consider some series of numbers and provide bounds on their partial sums ∞ to make Psour paper self-contained. Given a sequence {xt }t=1 of numbers, we use notation Xs := t=1 xt to denote its partial sums. Lemma 25 If f : R+ → R+ is continuous and non-increasing, then T X Z XT f (Xt )xt ≤ x1 f (X1 ) + Z XT f (x)dx ≤ X1 t=1 f (x)dx 0 for any nonnegative x1 , . . . , xT . Proof By considering the area between f (x) and the x-axis over the interval [Xt−1 , Xt ] of R Xt length xt , we have f (Xt )xt ≤ Xt−1 f (x)dx since f is non-increasing. Then it follows that T X f (Xt )xt = x1 f (X1 ) + t=1 T X Z XT f (Xt )xt ≤ x1 f (X1 ) + f (x)dx ≤ X1 t=2 Z XT f (x)dx, 0 as required. As a consequence of Lemma 25, we deduce the following list of bounds on series. 50 Stochastic-Constrained Stochastic Optimization with Markovian Data Corollary 26 (Auer et al., 2002) T X p x √ t ≤ 2 XT . Xt t=1 Corollary 27 T X xt t=1 Xt ≤ 1 + logX1 (XT ). Next we consider the following. Lemma 28 If f : R+ → R+ is continuous and non-decreasing, then T X Z XT +1 f (Xt )xt+1 ≤ f (x)dx X1 t=1 for any nonnegative x1 , . . . , xT . Proof By considering the area between f (x) and the x-axis over the interval [Xt , Xt+1 ] of RX length xt+1 , we have f (Xt )xt+1 ≤ Xtt+1 f (x)dx since f is non-decreasing. Then it follows that T X f (Xt )xt+1 = x2 f (X1 ) + t=1 T X Z XT +1 f (Xt )xt+1 ≤ x2 f (X1 ) + Z XT +1 f (x)dx ≤ X2 t=2 f (x)dx, X1 as required. Lemma 28 implies the following. Corollary 29 For q > 0, T X t=1 tq ≤  1 (T + 1)q+1 − 1 . q+1 Moreover, when each xt is bounded by some fixed constant, we can deduce the following result. For ease of P notation, we start a sequence with x0 and, with abuse of notation, define partial sum Xs = st=0 xt with x0 = X0 = δ. Lemma 30 If 0 ≤ xt ≤ C for t = 0, . . . , T for some fixed constant C and f : R++ → R+ is continuous and non-increasing, then T X Z max{δ,XT −C} f (Xt−1 )xt ≤ Cf (δ) + f (x)dx. δ t=1 Proof If we consider the area between f (x) and the x-axis over the interval [Xt−1 , Xt ], we R Xt have f (Xt−1 )xt ≥ Xt−1 f (x)dx in which the inequality direction is the opposite of what we 51 Yeongjong Kim and Dabeen Lee want. However, since xt ≤ C, we can use the idea of translation by C in the x-axis direction in the following way. Let ( f (δ), x ∈ (−∞, δ], f˜(x) = f (x), x ∈ (δ, ∞). Then the graph of the translation f˜(x − C) is above the squares of height f (Xt−1 ) on the interval [Xt−1 , Xt ]. Thus, ( Z XT T X (XT − δ)f (δ) ≤ Cf (δ), if XT ≤ δ + C, ˜ R XT −C f (Xt−1 )xt ≤ f (x − C)dx ≤ Cf (δ) + δ f (x)dx, if XT > δ + C X0 t=1 which implies the desired statement of this lemma. As a corollary of Lemma 30 with f (x) = x−γ , we deduce the following inequality. Corollary 31 If xt ≤ C, 0 < γ ̸= 1, then T X −γ Xt−1 xt ≤ Cδ −γ + t=1  1 max{δ, XT − C}1−γ − δ 1−γ . 1−γ Appendix C. Online Convex Optimization with Adversarial Losses and Constraints In this section, we show the performance of Algorithm 2, which is an AdaGrad-style variant of the drift-plus-penalty algorithm for online convex optimization with adversarial loss and constraint functions. To deal with adversarial constraint functions, we set the parameters Vt and αt differently as follows. F2 at := t + R2 G2t + Ht2 , 4 St := t X as s=1 and then set parameters as Stβ St , αt = 2 , (12) R R for some 0 < β ≤ 1/2. One distinction from (4) is the presence of additional parameter δ, and another difference is that Vt and αt are defined with St , not St−1 . We now assume that the convex constraint functions g1 , . . . , gT as well as the convex loss functions f1 , . . . , fT are chosen adversarially. Following Neely and Yu (2017), we set the benchmark x◦ as an optimal solution to Vt = min T X ft (x) subject to gt (x) ≤ 0 for t = 1, . . . , T . t=1 Then the goal is to obtain upper bounds on Regret(T ) = T X t=1 ft (xt ) − T X ◦ ft (x ), t=1 Violation(T ) = T X t=1 52 gt (xt ) Stochastic-Constrained Stochastic Optimization with Markovian Data in sublinear orders of T by properly choosing our inputs xt . For this, we need the following theorem. Theorem 32 Algorithm 2 with Vt and α set as in (12) guarantees that     1/2 β/2+1/4 Regret(T ) = O ST1−β , Violation(T ) = O ST + T 1/4 ST . Proof Applying Lemma 21 with fˆt = ft , ĝt = gt , and x = x◦ , we obtain T T t=1 t=1 αT 2 X Vt Ft2 1 X (Ht + Gt R)2 Regret(T ) ≤ R + + . VT 4αt 2 Vt The first term of the right hand side is RST1−β = O(ST1−β ), and the second term satisfies T X Vt F 2 t t=1 4αt = T T t=1 t=1   X β−1 R R X β−1 2 St at ≤ STβ = O STβ , S t Ft ≤ R 4 β where the last inequality follows from Lemma 25. The third term satisfies T X (Ht + Gt R)2 t=1 2Vt ≤R T X R 2 G2 + H 2 t t=1 t Stβ ≤R T X at ≤ β t=1 St   R ST1−β = O ST1−β . 1−β Combining these two inequalities, we get   Regret(T ) = O ST1−β . Next, we prove the second part of the theorem. By Lemma 22, we have T X gt (xt ) ≤ QT +1 + t=1 T X Gt t=1 2αt (Vt Ft + Qt Gt ). If we apply Lemma 24 with x = x◦ , we obtain v u T u X  QT +1 ≤ t2 Ht2 + R2 G2t + Vt RFt + 2R2 αT t=1 v v u T u T u X u X p ≤ t2 RVt Ft + t2 (R2 G2t + Ht2 ) + 2R2 αT . | {z } t=1 t=1 (c) | {z } | {z } (a) (b) ◦ where the√first inequality follows from the convexity of √ gt and gt (x ) ≤ 0. Here, term (c) is equal to 2ST , and term (b) is less than or equal to 2ST . For term (a), we have v u T T T u X X X √ β+1/2 β β β (a)2 = 2 St Ft ≤ 2ST Ft ≤ 2ST tT Ft2 ≤ 4 T ST , t=1 t=1 t=1 53 Yeongjong Kim and Dabeen Lee where the second inequality holds by the power mean inequality. Thus,   √ 1/2 β/2+1/4 1/2 β/2+1/4 QT +1 ≤ (a) + (b) + (c) ≤ 2 2ST + 2T 1/4 ST = O ST + T 1/4 ST . (13) We also have that T X Vt Ft Gt t=1 2αt = T X Stβ−1 RFt Gt /2 t=1 ≤ ≤ T X t=1 T X Stβ−1 (Ft2 /4 + R2 G2t )/2 Stβ−1 at /2 ≤ t=1 STβ 2β   = O STβ , where the last inequality follows from Lemma 25. Lastly, T X G2 T T X √ G2t 1/2 X G2t 1/4 β/2+1/4 Qt ≤ 2 St + T St 2αt αt αt t t=1 t=1 ≤ √ 2 t=1 T X at + T 1/4 1/2 T X at 3/4−β/2 t=1 St t=1 St p T 1/4 β/2+1/4 ≤ 2 2ST + S β/2 + 1/4 T   1/2 β/2+1/4 = O ST + T 1/4 ST , where the first inequality follows from (13) and the last inequality follows from Lemma 25. Combining the results, we get   1/2 β/2+1/4 Violation(T ) = O ST + T 1/4 ST , as required. Appendix D. Proof of the Time-Varying Drift Lemma In this section, we prove Lemma 14 for the case of time-varying parameter θt0 (t). We closely follow the proof of (Yu et al., 2017, Lemma 5). 2 ) and ρ = 1 − ζ 2 /(8δ 2 ) = 1 − rt ζ/2. Then Lemma 33 Let r = ζ/(4t0 δmax 0 max h i ert0 δmax E erZ(t) ≤ erθ(t) 1−ρ for all t ≥ 0. 54 Stochastic-Constrained Stochastic Optimization with Markovian Data Proof Since 0 < ζ < δmax , we have 0 < ρ < 1 < erδmax . Define η(t) = Z(t + t0 ) − Z(t). Note that |η(t)| ≤ t0 δmax for all t ≥ 0 which implies that |rη(t)| ≤ ζ/(4δmax ) ≤ 1. Then,     1 rZ(t+t0 ) rZ(t) rη(t) rZ(t) 2 2 2 rZ(t) 1 + rη(t) + rt0 ζ e =e e ≤e 1 + rη(t) + 2r t0 δmax = e (14) 2 where the inequality follows from the fact that ex ≤ 1 + x + 2x2 for |x| ≤ 1, |rη(t)| ≤ 1, and 2 ). |η(t)| ≤ t0 δmax while the equality follows by substituting r = ζ/(4t0 δmax Next, we consider the cases Z(t) ≥ θ(t) and Z(t) < θ(t), separately. First, consider the case Z(t) ≥ θ(t). Taking the conditional expectation of each side of (14) gives us the following.   h i 1 rZ(t+t0 ) rZ(t) E e | Z(t) ≤ E e (1 + rη(t) + rt0 ζ) | Z(t) 2   1 ≤ erZ(t) 1 − rt0 ζ + rt0 ζ 2   rt0 ζ = erZ(t) 1 − 2 = ρerZ(t) where the inequality follows from the fact that E[Z(t + t0 ) − Z(t)|F(t)] ≤ −t0 ζ when Z(t) ≥ θ(t) while the second equality follows from the fact that ρ = 1 − rt0 ζ/2. Likewise, for the case Z(t) < θ(t), we deduce that h i h i h i E erZ(t+t0 ) | Z(t) = E erZ(t) erη(t) | Z(t) = erZ(t) E erη(t) | Z(t) ≤ ert0 δmax erZ(t) , where the inequality follows from the fact that η(t) ≤ t0 δmax . Putting the two cases together, we deduce that h i E erZ(t+t0 ) h i h i = P(Z(t) ≥ θ(t))E erZ(t+t0 ) | Z(t) ≥ θ(t) + P(Z(t) < θ(t))E erZ(t+t0 ) | Z(t) < θ(t) h i h i ≤ ρE erZ(t) | Z(t) ≥ θ(t) P(Z(t) ≥ θ(t)) + ert0 δmax E erZ(t) | Z(t) < θ(t) P(Z(t) < θ(t)) h i   h i = ρE erZ(t) + ert0 δmax − ρ E erZ(t) | Z(t) < θ(t) P(Z(t) < θ(t)) h i   ≤ ρE erZ(t) + ert0 δmax − ρ erθ(t) h i ≤ ρE erZ(t) + ert0 δmax erθ(t) where the first inequality follows from the analysis of the two separate cases and the second inequality follows from the fact that ert0 δmax > ρ. Then we argue by induction to prove the statement of this lemma. We first consider the base case t ∈ {0, 1, . . . , t0 }. Since Z(t) ≤ tδmax for all t ≥ 0, it follows that E[erZ(t) ] ≤ ertδmax ≤ ert0 δmax ≤ 55 ert0 δmax rθ(t) e 1−ρ Yeongjong Kim and Dabeen Lee for all t ∈ {1, . . . , t0 }, where the last inequality follows because erθ(t) /(1 − ρ) ≥ 1. Next we assume that the inequality holds for all t ∈ {0, 1, . . . , s} with some s ≥ t0 and consider iteration t = s + 1.Note that h i h i E erZ(s+1) ≤ ρE erZ(s+1−t0 ) + ert0 δmax erθ(s+1−t0 ) ≤ρ ert0 δmax rθ(s+1−t0 ) e + ert0 δmax erθ(s+1−t0 ) 1−ρ ert0 δmax rθ(s+1−t0 ) e 1−ρ ert0 δmax rθ(s+1) e ≤ 1−ρ ≤ where the second inequality comes from the induction hypothesis by noting that 0 ≤ τ + 1 − t0 ≤ τ while the last inequality follows from the fact that θ(t) is non-decreasing. Based on this lemma, we prove Lemma 14. Proof [Proof of Lemma 14] Note that erx is convex in x when r > 0. By Jensen’s inequality, erE[Z(t)] ≤ E[erZ(t) ] ≤ er(θ(t)+t0 δmax ) 1−ρ where the inequality is implied by Lemma 33. Taking logarithm on both sides and dividing by r yields that E[Z(t)] ≤ θ(t) + t0 δmax + 2  1   8δmax  1 4δ 2 log = θ(t) + t0 δmax + t0 max log , 2 r 1−ρ ζ ζ where the equality holds because recalling that r = 4t0 δζ2 2 max and ρ = 1 − 8δζ2 max . Appendix E. Properties of the MLMC Estimator In this section, we prove Lemmas 4 and 5. Lemma 34 (Dorfman and Levy, 2022, Lemma A.6) Let h : X × S → Rk for some k ≥ 1. Suppose that there exists some constant L > 0 such that ∥h(x, ξ)∥ ≤ L for every (x, ξ) ∈ X × S, where the norm ∥ · ∥ satisfies ∥ · ∥ ≤ η∥ · ∥2 for some η > 0. We denote by N h̄(x) := Eξ∼µ [h(x, ξ)], hN t (x) := 1 X (i) h(x, ξt ). N i=1 56 Stochastic-Constrained Stochastic Optimization with Markovian Data Suppose that x is Ft−1 measurable and N ≤ A for some A ∈ N. If 2τmix ⌈2 log A⌉ ≤ N , then r  p τmix ⌈2 log A⌉  1 + log(τmix ⌈2 log A⌉N ) N 6Lητmix ⌈2 log A⌉ 4Lη + + , N N   576L2 η 2 τmix ⌈2 log A⌉ 2 Et−1 ∥hN ≤ (1 + log(τmix ⌈2 log A⌉N )) t (x) − h̄(x)∥ N 2 ⌈2 log A⌉2 72L2 η 2 τmix 8L2 η 2 + + . N2 N Et−1 ∥hN t (x) − h̄(x)∥ ≤ 12Lη   We point out that (Dorfman and Levy, 2022, Lemma A.6) was originally stated for the ℓ2 norm, but the statement holds for any norm over a finite-dimensional vector space assuming that ∥ · ∥ ≤ η∥ · ∥2 for some fixed constant η > 0 and O hides the dependence on η. In the following lemma, we simplify the upper bounds of Lemma 34. Lemma 35 Let h : X × S → Rk for some k ≥ 1. Suppose that there exists some constant L > 0 such that ∥h(x, ξ)∥ ≤ L for every (x, ξ) ∈ X × S, where the norm ∥ · ∥ satisfies ∥ · ∥ ≤ η∥ · ∥2 for some η > 0. We denote by N h̄(x) := Eξ∼µ [h(x, ξ)], hN t (x) := 1 X (i) h(x, ξt ). N i=1 Suppose that x is Ft−1 measurable and N ≤ A for some A ∈ N. Then r  p τmix ⌈2 log A⌉  2 + log(τmix ⌈2 log A⌉N ) , N 2 max(1, η)2 τ  N  576L ⌈2 log A⌉ mix Et−1 ∥ht (x) − h̄(x)∥2 ≤ (2 + log(τmix ⌈2 log A⌉N )). N   Et−1 ∥hN t (x) − h̄(x)∥ ≤ 12L max(1, η) Proof First, we consider the case where 2τmix ⌈2 log A⌉ > N . Then by the triangle inequality, r  N  2τmix ⌈2 log A⌉ Et−1 ∥ht (x) − h̄(x)∥ ≤ 2L < 2L rN  p τmix ⌈2 log A⌉  ≤ 12L max(1, η) 2 + log(τmix ⌈2 log A⌉N ) , N  N  2τmix ⌈2 log A⌉ Et−1 ∥ht (x) − h̄(x)∥2 ≤ 4L2 < 4L2 N 576L2 max(1, η)2 τmix ⌈2 log A⌉ ≤ (2 + log(τmix ⌈2 log A⌉N )). N 57 Yeongjong Kim and Dabeen Lee Next, we consider the case where 2τmix ⌈2 log A⌉ ≤ N . By Lemma 34, r  p  N  τmix ⌈2 log A⌉  Et−1 ∥ht (x) − h̄(x)∥ ≤ 12Lη 1 + log(τmix ⌈2 log A⌉N ) N 6Lητmix ⌈2 log A⌉ 4Lη + + , N N   576L2 η 2 τmix ⌈2 log A⌉ 2 (1 + log(τmix ⌈2 log A⌉N )) Et−1 ∥hN ≤ t (x) − h̄(x)∥ N 2 ⌈2 log A⌉2 72L2 η 2 τmix 8L2 η 2 + + . N2 N Since τmix ⌈2 log A⌉/N ≤ 21 , we deduce that 6Lητmix ⌈2 log A⌉ 4Lη 12Lητmix ⌈2 log A⌉ + ≤ ≤ 12Lη N N N r τmix ⌈2 log A⌉ N and that 2 ⌈2 log A⌉2 72L2 η 2 τmix 8L2 η 2 72L2 η 2 τmix ⌈2 log A⌉ 8L2 η 2 τmix ⌈2 log A⌉ + ≤ + N2 N N N 576L2 η 2 τmix ⌈2 log A⌉ ≤ . N As a result, we obtain r  p τmix ⌈2 log A⌉  2 + log(τmix ⌈2 log A⌉N ) Nr  p τmix ⌈2 log A⌉  2 + log(τmix ⌈2 log A⌉N ) , ≤ 12L max(1, η) N 2 2  N  576L η τmix ⌈2 log A⌉ (2 + log(τmix ⌈2 log A⌉N )) Et−1 ∥ht (x) − h̄(x)∥2 ≤ N 576L2 max(1, η)2 τmix ⌈2 log A⌉ ≤ (2 + log(τmix ⌈2 log A⌉N )), N   Et−1 ∥hN t (x) − h̄(x)∥ ≤ 12Lη as required. Proof [Proof of Lemma 4] We first argue that h j i max Et−1 [ft (x)] = Et−1 ft2 (x) for any x and t. Note that max h j i h j i  1  jX j−1 max Et−1 [ft ] = Et−1 ft + P (Jt = j) 2j Et−1 ft2 − ft2 = Et−1 ft2 j=1 as P (Jt = j) = 1/2j Similarly, we can show that h j i h i jmax max , Et−1 [gt ] = Et−1 gt2 , Et−1 [∇ft ] = Et−1 ∇ft2 58 h i jmax Et−1 [∇gt ] = Et−1 ∇gt2 Stochastic-Constrained Stochastic Optimization with Markovian Data holds for any x and t. For the second part, we have i h i h 2 + 2H 2 E |gt |2 ≤ 2E gt − gt1 since ∥x + y∥2 ≤ (∥x∥ + ∥y∥)2 ≤ 2∥x∥2 + 2∥y∥2 and gt1 ≤ H. Note that   jX   max max i jX h j j−1 2 j j−1 2 2 = P(Jt = j)22j E gt2 − gt2 = 2j E gt2 − gt2 E gt − gt1 j=1 j=1 because P (Jt = j) = 1/2j . Here we can bound the right-most side based on the following.       τ  2 2 2 mix 2j 2j−1 2j 2j−1 E gt − gt ≤ 2E gt − ḡ + 2E gt − ḡ = Õ 2j where the last inequality follows from Lemma 35. Then it follows that i h 2 ≤ Õ(jmax τmix ) = Õ(τmix ) E gt − gt1 where the last equality holds because jmax = O(log T ). For the last part, E[Nt ] = 1 + jX max P(Jt = j)(2j − 1) ≤ 1 + jmax ≤ 1 + 2 log2 T, j=1 as required. Proof [Proof of Lemma 5] By Assumption 2, we can apply Lemma 35 to gt , ∇gt , ∇ft . Assumptions 1, 2 implies that |ft (xt )| ≤ J for some J > 0. Thus, we can apply Lemma 35 to ft as well. Let L = max(F, G, H, J), A = T 2 , and ∥ · ∥∗ ≤ η∥ · ∥2 for some η > 0. By Lemma 35, the statement follows for q C(T ) = 576L2 max(1, η)2 ⌈4 log T ⌉(2 + log(τmix ⌈4 log T ⌉2jmax )). Since 2jmax = O(T 2 ), the order of C(T ) follows. References Z. Akhtar, A. S. Bedi, and K. Rajawat. Conservative stochastic optimization with expectation constraints. IEEE Transactions on Signal Processing, 69:3190–3205, 2021. doi: 10.1109/ TSP.2021.3082467. C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50(1):5–43, 2003. doi: 10.1023/A:1020281327116. URL https://doi.org/10.1023/A:1020281327116. 59 Yeongjong Kim and Dabeen Lee P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002. ISSN 0022-0000. doi: https://doi.org/10.1006/jcss.2001.1795. URL https://www.sciencedirect.com/ science/article/pii/S0022000001917957. G. Ayache, V. Dassari, and S. E. Rouayheb. Walk for learning: A random walk approach for federated learning from heterogeneous data. IEEE J.Sel. A. Commun., 41(4):929–940, apr 2023. ISSN 0733-8716. doi: 10.1109/JSAC.2023.3244250. URL https://doi.org/10. 1109/JSAC.2023.3244250. A. Benveniste, P. Priouret, and M. Métivier. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, Berlin, Heidelberg, 1990. ISBN 0387528946. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996. ISBN 1886529108. J. H. Blanchet and P. W. Glynn. Unbiased monte carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pages 3656–3667, 2015. doi: 10.1109/WSC.2015.7408524. V. S. Borkar. Stochastic Approximation A Dynamical Systems Viewpoint. Texts and Readings in Mathematics ; 48. Hindustan Book Agency, Gurgaon, 1st ed. 2008. edition, 2008. ISBN 93-86279-38-X. L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173. URL https: //doi.org/10.1137/16M1080173. J. Cao, R. Jiang, N. Abolfazli, E. Y. Hamedani, and A. Mokhtari. Projection-free methods for stochastic simple bilevel optimization with convex lower-level problem, 2023. L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with Fairness Constraints. In I. Chatzigiannakis, C. Kaklamanis, D. Marx, and D. Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018), volume 107 of Leibniz International Proceedings in Informatics (LIPIcs), pages 28:1–28:15, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ISBN 978-3-95977076-7. doi: 10.4230/LIPIcs.ICALP.2018.28. URL http://drops.dagstuhl.de/opus/ volltexte/2018/9032. L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi. Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 319–328, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560. 3287586. URL https://doi.org/10.1145/3287560.3287586. O. Chapelle, B. Schlkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010. ISBN 0262514125. 60 Stochastic-Constrained Stochastic Optimization with Markovian Data D. Dentcheva and A. Ruszczynski. Optimization with stochastic dominance constraints. SIAM Journal on Optimization, 14(2):548–566, 2003. doi: 10.1137/S1052623402420528. URL https://doi.org/10.1137/S1052623402420528. T. T. Doan. Finite-time analysis of markov gradient descent. IEEE Transactions on Automatic Control, 68(4):2140–2153, 2023. doi: 10.1109/TAC.2022.3172593. T. T. Doan, L. M. Nguyen, N. H. Pham, and J. Romberg. Convergence rates of accelerated markov gradient descent with applications in reinforcement learning, 2020. M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. Empirical risk minimization under fairness constraints. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 2796–2806, Red Hook, NY, USA, 2018. Curran Associates Inc. R. Dorfman and K. Y. Levy. Adapting to mixing time in stochastic optimization with Markovian data. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5429–5446. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/dorfman22a.html. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html. J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578, 2012. doi: 10.1137/110836043. URL https: //doi.org/10.1137/110836043. Y. Ermoliev. stochastic quasigradient methods and their application to system optimization †. Stochastics, 9(1-2):1–36, 1983. doi: 10.1080/17442508308833246. M. Even. Stochastic gradient descent under Markovian sampling schemes. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 9412–9439. PMLR, 23–29 Jul 2023. URL https: //proceedings.mlr.press/v202/even23a.html. J. Garcı́a, Fern, and o Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437–1480, 2015. URL http://jmlr.org/ papers/v16/garcia15a.html. M. B. Giles. Multilevel monte carlo methods. Acta Numerica, 24:259–328, 2015. doi: 10.1017/S096249291500001X. H. Guo, X. Liu, H. Wei, and L. Ying. Online convex optimization with hard constraints: Towards the best of two worlds and beyond. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=rwdpFgfVpvN. 61 Yeongjong Kim and Dabeen Lee H. Hendrikx. A principled framework for the design and analysis of token algorithms. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 470–489. PMLR, 25–27 Apr 2023. URL https://proceedings. mlr.press/v206/hendrikx23a.html. A. Jalilzadeh, F. Yousefian, and M. Ebrahimi. Stochastic approximation for estimating the price of stability in stochastic nash games, 2023. R. Jenatton, J. Huang, and C. Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 402–411, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/jenatton16.html. B. Johansson, M. Rabi, and M. Johansson. A simple peer-to-peer algorithm for distributed optimization in sensor networks. In 2007 46th IEEE Conference on Decision and Control, pages 4705–4710, 2007. doi: 10.1109/CDC.2007.4434888. B. Johansson, A. Speranzon, M. Johansson, and K. H. Johansson. On decentralized negotiation of optimal consensus. Automatica, 44(4):1175–1179, 2008. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2007.09.003. URL https://www.sciencedirect. com/science/article/pii/S0005109807003962. B. Johansson, M. Rabi, and M. Johansson. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3): 1157–1170, 2010. doi: 10.1137/08073038X. URL https://doi.org/10.1137/08073038X. A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011. doi: 10.1287/10-SSY011. URL https://doi.org/10.1287/10-SSY011. S. Kowshik, D. Nagaraj, P. Jain, and P. Netrapalli. Streaming linear system identification with reverse experience replay. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 30140–30152. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/ paper_files/paper/2021/file/fd2c5e4680d9a01dba3aada5ece22270-Paper.pdf. G. Lan and Z. Zhou. Algorithms for stochastic optimization with function or expectation constraints. Computational Optimization and Applications, 76(2):461–498, 2020. doi: 10.1007/s10589-020-00179-x. URL https://doi.org/10.1007/s10589-020-00179-x. D. Lee, N. Ho-Nguyen, and D. Lee. Projection-free online convex optimization with stochastic constraints, 2023. D. Levin and Y. Peres. Markov Chains and Mixing Times. MBK. American Mathematical Society, 2017. ISBN 9781470429621. URL https://books.google.com/books?id= f208DwAAQBAJ. 62 Stochastic-Constrained Stochastic Optimization with Markovian Data K. Levy. Online to offline conversions, universality and adaptive minibatch sizes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ ce5140df15d046a66883807d18d0264b-Paper.pdf. Q. Lin, S. Nadarajah, N. Soheili, and T. Yang. A data efficient and feasible level set method for stochastic convex optimization with expectation constraints. Journal of Machine Learning Research, 21(143):1–45, 2020. URL http://jmlr.org/papers/v21/19-1022.html. C. G. Lopes and A. H. Sayed. Incremental adaptive strategies over distributed networks. IEEE Transactions on Signal Processing, 55(8):4064–4077, 2007. doi: 10.1109/TSP.2007.896034. M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: Online convex optimization with long term constraints. Journal of Machine Learning Research, 13(81):2503–2528, 2012. URL http://jmlr.org/papers/v13/mahdavi12a.html. X. Mao, K. Yuan, Y. Hu, Y. Gu, A. H. Sayed, and W. Yin. Walkman: A communicationefficient random-walk algorithm for decentralized optimization. IEEE Transactions on Signal Processing, 68:2513–2528, 2020. doi: 10.1109/TSP.2020.2983167. D. Nagaraj, X. Wu, G. Bresler, P. Jain, and P. Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16666–16676. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ c22abfa379f38b5b0411bc11fa9bf92f-Paper.pdf. M. J. Neely and H. Yu. Online convex optimization with time-varying constraints, 2017. URL https://arxiv.org/abs/1702.04783. A. Nemirovski and A. Shapiro. Convex approximations of chance constrained programs. SIAM Journal on Optimization, 17(4):969–996, 2007. doi: 10.1137/050622328. URL https://doi.org/10.1137/050622328. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. doi: 10.1137/070704277. URL https://doi.org/10.1137/070704277. G. Pflug. The Interface Between Simulation and Optimization. 1996. B. Poljak and J. Tsypkin. Robust identification. Automatica, 16(1):53–63, 1980. ISSN 0005-1098. doi: https://doi.org/10.1016/0005-1098(80)90086-2. URL https://www. sciencedirect.com/science/article/pii/0005109880900862. B. Polyak. A general method of solving extremum problems. Doklady Akademii Nauk SSSR, 174(1):33–36, 1967. 63 Yeongjong Kim and Dabeen Lee M. Rabbat and R. Nowak. Distributed optimization in sensor networks. In Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004, pages 20–27, 2004. doi: 10.1145/984622.984626. S. S. Ram, A. Nedić, and V. V. Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 20(2):691–717, 2009a. doi: 10.1137/080726380. URL https://doi.org/10.1137/080726380. S. S. Ram, A. Nedić, and V. V. Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 20(2):691–717, 2009b. doi: 10.1137/080726380. URL https://doi.org/10.1137/080726380. P. Rigollet and X. Tong. Neyman-pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 12(86):2831–2855, 2011. URL http://jmlr.org/ papers/v12/rigollet11a.html. H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951. doi: 10.1214/aoms/1177729586. URL https://doi. org/10.1214/aoms/1177729586. R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2: 21–41, 2000. A. Roy, K. Balasubramanian, and S. Ghadimi. Constrained stochastic nonconvex optimization with state-dependent markov data. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23256–23270. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 93b11b5128ced940120f41ce9b216f39-Paper-Conference.pdf. A. Ruszczyński and W. Syski. A method of aggregate stochastic subgradients with on-line stepsize rules for convex stochastic programming problems, pages 113–131. Springer Berlin Heidelberg, Berlin, Heidelberg, 1986. ISBN 978-3-642-00927-3. doi: 10.1007/BFb0121128. URL https://doi.org/10.1007/BFb0121128. T. Sarkar and A. Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5610–5618. PMLR, 09–15 Jun 2019. URL https: //proceedings.mlr.press/v97/sarkar19a.html. C. Scott and R. Nowak. A neyman-pearson approach to statistical learning. IEEE Transactions on Information Theory, 51(11):3806–3819, 2005. doi: 10.1109/TIT.2005.856955. T. Sun, Y. Sun, and W. Yin. On markov chain gradient descent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/ 1371bccec2447b5aa6d96d2a540fb401-Paper.pdf. 64 Stochastic-Constrained Stochastic Optimization with Markovian Data T. Sun, Y. Sun, Y. Xu, and W. Yin. Markov chain block coordinate descent. Computational Optimization and Applications, 75(1):35–61, 2020. T. Sun, D. Li, and B. Wang. Adaptive random walk gradient descent for decentralized optimization. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20790–20809. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/sun22b.html. T. Sun, D. Li, and B. Wang. On the decentralized stochastic gradient descent with markov chain sampling. IEEE Transactions on Signal Processing, 71:2895–2909, 2023. doi: 10.1109/TSP.2023.3297053. P. Wang, Y. Lei, Y. Ying, and D.-X. Zhou. Stability and generalization for markov chain stochastic gradient methods. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 37735–37748. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ f61538f83b0f19f9306d9d801c15f41c-Paper-Conference.pdf. R. Ward, X. Wu, and L. Bottou. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6677–6686. PMLR, 09–15 Jun 2019. URL https://proceedings. mlr.press/v97/ward19a.html. X. Wei, H. Yu, and M. J. Neely. Online primal-dual mirror descent under stochastic constraints. Proc. ACM Meas. Anal. Comput. Syst., 4(2), jun 2020. doi: 10.1145/3392157. URL https://doi.org/10.1145/3392157. X. Xiao. Penalized stochastic gradient methods for stochastic convex optimization with expectation constraints. Technical report, September 2019. URL https:// optimization-online.org/?p=15973. Y. Yan and Y. Xu. Adaptive primal-dual stochastic gradient method for expectationconstrained convex stochastic programs. Mathematical Programming Computation, 14 (2):319–363, 2022. doi: 10.1007/s12532-021-00214-w. URL https://doi.org/10.1007/ s12532-021-00214-w. Z. Yang, Y. Lei, P. Wang, T. Yang, and Y. Ying. Simple stochastic and online gradient descent algorithms for pairwise learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 20160–20171. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ a87d27f712df362cd22c7a8ef823e987-Paper.pdf. S. Yao and B. Huang. Beyond parity: Fairness objectives for collaborative filtering. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and 65 Yeongjong Kim and Dabeen Lee R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/ paper/2017/file/e6384711491713d29bc63fc5eeb5ba4f-Paper.pdf. E. Yazdandoost Hamedani, A. Jalilzadeh, and N. S. Aybat. Randomized primal-dual methods with adaptive step sizes. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 11185–11212. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/yazdandoost-hamedani23a.html. X. Yi, X. Li, T. Yang, L. Xie, T. Chai, and K. Johansson. Regret and cumulative constraint violation analysis for online convex optimization with long term constraints. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11998–12008. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yi21b.html. H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/ file/da0d1111d2dc5d489242e60ebcbaf988-Paper.pdf. J. Yuan and A. Lamperski. Online convex optimization for cumulative constraints. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ 9cb9ed4f35cf7c2f295cc2bc6f732a84-Paper.pdf. M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi. Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research, 20(75): 1–42, 2019. URL http://jmlr.org/papers/v20/18-262.html. L. Zhang, Y. Zhang, J. Wu, and X. Xiao. Solving stochastic optimization with expectation constraints efficiently by a stochastic augmented lagrangian-type algorithm. INFORMS Journal on Computing, 34(6):2989–3006, 2022. doi: 10.1287/ijoc.2022.1228. URL https: //doi.org/10.1287/ijoc.2022.1228. L. Zhang, Y. Zhang, X. Xiao, and J. Wu. Stochastic approximation proximal method of multipliers for convex stochastic programming. Mathematics of Operations Research, 48 (1):177–193, 2023. doi: 10.1287/moor.2022.1257. URL https://doi.org/10.1287/moor. 2022.1257. R. Zhao. Accelerated stochastic algorithms for convex-concave saddle-point problems. Mathematics of Operations Research, 47(2):1443–1473, 2022. doi: 10.1287/moor.2021.1175. URL https://doi.org/10.1287/moor.2021.1175. Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra. Federated learning with non-iid data. 2018. doi: 10.48550/ARXIV.1806.00582. URL https://arxiv.org/abs/1806. 00582. 66