Abstract
We introduce and analyze a new family of first-order optimization algorithms which generalizes and unifies both mirror descent and dual averaging. Within the framework of this family, we define new algorithms for constrained optimization that combines the advantages of mirror descent and dual averaging. Our preliminary simulation study shows that these new algorithms significantly outperform available methods in some situations.







Similar content being viewed by others
Notes
We also refer to [27, Appendix C] for a discussion comparing MD and DA.
In its general form [35], the DA algorithm allows for a time-variable regularizer. For the sake of clarity, we consider here the simple case of time-invariant regularizers which already captures some essential differences between MD and DA.
With some terminological abuse, we say that g is strongly convex when it is strongly convex with modulus 1.
In the case of compact \({{\mathcal {X}}}\) one can take \(\Omega _X=\left[ \max _{x\in {{\mathcal {X}}}}2D_h(x,x_1;\ {\vartheta }_1)\right] ^{1/2}\). Note that in this case due to strong convexity of \(D_h(\cdot ,x_1,{\vartheta }_1)\) one has \(\Omega _{{\mathcal {X}}}\geqslant \max _{x\in {{\mathcal {X}}}}\Vert x-x_1\Vert \).
APDD and IPDD algorithms should be seen as mere examples having nothing special which sets them apart from other possible UMD implementations.
The second parameter in the definition of the k-\(\ell \)-ADPP corresponds to the \(\ell \)-step ahead computation of the objective when determining the choice of update every k steps of the algorithm.
The domain of a convex function is convex, and therefore \({\mathcal {D}}_F={\text {int}}{\text {dom}}F\) is convex as the interior of a convex set.
References
Audibert, J.Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: Proceedings of the 22nd Annual Conference on Learning Theory (COLT), pp. 217–226 (2009)
Audibert, J.Y., Bubeck, S.: Regret bounds and minimax policies under partial monitoring. J. Mac. Learn. Res. 11, 2785–2836 (2010)
Audibert, J.Y., Bubeck, S., Lugosi, G.: Regret in online combinatorial optimization. Math. Oper. Res. 39(1), 31–45 (2013)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Bubeck, S.: Introduction To Online Optimization: Lecture Notes. Princeton University, Princeton, NJ (2011)
Bubeck, S.: Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning 8(3–4), 231–357 (2015)
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Mach. Learn. 5(1), 1–122 (2012)
Bubeck, S., Cesa-Bianchi, N., Kakade, S.M.: Towards minimax policies for online linear optimization with bandit feedback. In: JMLR: Workshop and Conference Proceedings (COLT), vol. 23, pp. 41.1–41.14 (2012)
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
Cohen, A., Hazan, T., Koren, T.: Tight bounds for bandit combinatorial optimization. In: Proceedings of Machine Learning Research (COLT 2017) vol. 65, pp. 1–14. (2017)
Cox, B., Juditsky, A., Nemirovski, A.: Dual subgradient algorithms for large-scale nonsmooth learning problems. Math. Program. 148(1–2), 143–180 (2014)
Dasgupta, S., Telgarsky, M.J.: Agglomerative Bregman clustering. In: Proceedings of the 29th International Conference on Machine Learning (ICML 12), pp. 1527–1534 (2012)
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(Jan), 165–202 (2012)
Duchi, J.C., Agarwal, A., Wainwright, M.J.: Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Trans. Autom. Control. 57(3), 592–606 (2012)
Duchi, J.C., Ruan, F.: Asymptotic optimality in stochastic optimization. The Annals of Statistics (to appear)
Flammarion, N., Bach, F.: Stochastic composite least-squares regression with convergence rate O(1/n). In: Proceedings of Machine Learning Research (COLT 2017), vol. 65, pp. 1–44 (2017)
Hazan, E.: The convex optimization approach to regret minimization. In: S.N. S. Sra, S. Wrigh (eds.) Optimization for Machine Learning, pp. 287–303. MIT press (2012)
Juditsky, A., Nemirovski, A.: First order methods for nonsmooth convex large-scale optimization, II: utilizing problems structure. Optimization for Machine Learning 30(9), 149–183 (2011)
Juditsky, A., Rigollet, P., Tsybakov, A.B.: Learning by mirror averaging. Ann. Stat. 36(5), 2183–2206 (2008)
Juditsky, A.B., Nazin, A.V., Tsybakov, A.B., Vayatis, N.: Recursive aggregation of estimators by the mirror descent algorithm with averaging. Probl. Inf. Transm. 41(4), 368–384 (2005)
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1), 365–397 (2012)
Lee, S., Wright, S.J.: Manifold identification in dual averaging for regularized stochastic online learning. J. Mach. Learn. Res. 13(Jun), 1705–1744 (2012)
McMahan, B.: Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 525–533 (2011)
McMahan, H.B.: A survey of algorithms and analysis for adaptive online learning. J. Mac. Learn. Res. 18(1), 3117–3166 (2017)
Nazin, A.V.: Algorithms of inertial mirror descent in convex problems of stochastic optimization. Autom. Remote. Control. 79(1), 78–88 (2018)
Nemirovski, A.: Efficient methods for large-scale convex optimization problems. Ekonomika i Matematicheskie Metody 15 (1979)
Nemirovski, A.: Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, UK (1983)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109(2–3), 319–344 (2007)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Nesterov, Y., Shikhman, V.: Quasi-monotone subgradient methods for nonsmooth convex minimization. J. Optim. Theory Appl. 165(3), 917–940 (2015)
Rakhlin, A., Tewari, A.: Lecture notes on online learning (2009)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton, NJ (1970)
Shalev-Shwartz, S.: Online learning: Theory, algorithms, and applications. Ph.D. thesis, The Hebrew University of Jerusalem, Israel (2007)
Shalev-Shwartz, S.: Online learning and online convex optimization. Foundations and Trends in Machine Learning 4(2), 107–194 (2011)
Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML) (2003)
Acknowledgements
The authors are grateful to Roberto Cominetti, Cristóbal Guzmán, Nicolas Flammarion and Sylvain Sorin for inspiring discussions and suggestions. A. Juditsky was supported by MIAI @ Grenoble Alpes (ANR-19-P3IA-0003). J. Kwon was supported by a public grant as part of the “Investissement d’avenir” project (ANR-11-LABX-0056-LMH), LabEx LMH.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Convex analysis tools
Definition 9
(Lower-semicontinuity) A function \(g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is lower-semicontinuous if for all \(c\in {\mathbb {R}}\), the sublevel set \(\left\{ x\in {\mathbb {R}}^n\,:f(x)\leqslant c \right\} \) is closed.
One can easily check that the sum of two lower-semicontinuous functions is lower-semicontinuous. Continuous functions and characteristic functions \(I_{{{\mathcal {X}}}}\) of closed sets \({{\mathcal {X}}}\subset {\mathbb {R}}^n\) are examples of lower-semicontinuous functions.
Definition 10
(Strong-convexity) Let \(g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\), \(\left\| \,\cdot \,\right\| _{}\) be a norm in \({\mathbb {R}}^n\) and \(K > 0\). Function g is said to be strongly convex with modulus \(\kappa \) with respect to norm \(\left\| \,\cdot \,\right\| _{}\) if for all \(x,x'\in {\mathbb {R}}^n\) and \(\lambda \in \left[ 0,1 \right] \),
Proposition 12
(Theorem 23.5 in [38]) Let \(g:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a lower-semicontinuous convex function with nonempty domain. Then for all \(x,y\in {\mathbb {R}}^n\), the following statements are equivalent.
-
(i)
\(x\in \partial g^*(y)\);
-
(ii)
\(y\in \partial g(x)\);
-
(iii)
\(\langle y | x \rangle =g(x)+g^*(y)\);
-
(iv)
\(x\in {{\,\mathrm{Arg\,max}\,}}_{x'\in {\mathbb {R}}^n}\left\{ \langle y | x' \rangle - g(x')\right\} \);
-
(v)
\(y\in {{\,\mathrm{Arg\,max}\,}}_{y'\in {\mathbb {R}}^n}\left\{ \langle y' | x \rangle - g^*(y')\right\} \).
Postponed proofs
1.1 Proofs for Section 2
1.1.1 Proof of Proposition 1
Let \({\vartheta }\in {\mathbb {R}}^n\). By property (iii) from Definition 1, there exists \(x_1\in {\mathcal {D}}_F\) such that \(\nabla F(x_1)={\vartheta }\). Therefore, function \(\varphi _{{\vartheta }}:x\mapsto \langle {\vartheta }| x \rangle -F(x)\) is differentiable at \(x_1\) and \(\nabla \varphi _{{\vartheta }}(x_1)=0\). Moreover, \(\varphi _{{\vartheta }}\) is strictly concave as a consequence of property (i) from Definition 1. Therefore, \(x_1\) is the unique maximizer of \(\varphi _{{\vartheta }}\) and:
which proves property (i).
Besides, we have
where the first equivalence comes from Proposition 12. Point \(x_1\) being the unique maximizer of \(\varphi _{{\vartheta }}\), we have that \(\partial F^*({\vartheta })\) is a singleton. In other words, \(F^*\) is differentiable in \({\vartheta }\) and
First, the above (19) proves property (ii). Second, this equality combined with the equality from (18) gives the second identity from property (iv). Third, this proves that \(\nabla F^*({\mathbb {R}}^n)\subset {\mathcal {D}}_F\).
It remains to prove the reverse inclusion to get property (iii). Let \(x\in {\mathcal {D}}_F\). By property (ii) from Definition 1, F is differentiable in x. Consider
and all the above holds with this special point \({\vartheta }\). In particular, \(x_1=x\) by uniqueness of \(x_1\). Therefore (19) gives
and this proves \(\nabla F^*({\mathbb {R}}^n)\supset {\mathcal {D}}_F\) and thus property (iii). Combining (20) and (21) gives the first identity from property (iv).
1.1.2 Proof of Theorem 1
Let \(x_0\in {\mathcal {D}}_F\). By definition of the mirror map, F is differentiable at \(x_0\). Therefore, \(D_F(x,x_0)\) is well-defined for all \(x\in {\mathbb {R}}^n\).
For all real value \(\alpha \in {\mathbb {R}}\), consider the sublevel set \(S_{{{\mathcal {X}}}}(\alpha )\) of function \(x\mapsto D_F(x,x_0)\) associated with value \(\alpha \) and restricted to \({{\mathcal {X}}}\):
Inheriting properties from F, function \(D_F(\,\cdot \,,x_0)\) is lower-semicontinuous and strictly convex: consequently, the sublevel sets \(S_{{{\mathcal {X}}}}(\alpha )\) are closed and convex.
Let us also prove that the sublevel sets \(S_{{{\mathcal {X}}}}(\alpha )\) are bounded. For each value \(\alpha \in {\mathbb {R}}\), we write
and aim at proving that the latter set is bounded. By contradiction, let us suppose that there exists an unbounded sequence in \(S_{{\mathbb {R}}^n}(\alpha )\): let \((x_k)_{k\geqslant 1}\) be such that \(0<\left\| x_k-x_0 \right\| _{}\xrightarrow [k \rightarrow +\infty ]{}+\infty \) and \(D_F(x_k,x_0)\leqslant \alpha \) for all \(k\geqslant 1\). Using the Bolzano–Weierstrass theorem, there exists \(v\ne 0\) and a subsequence \((x_{\phi (k)})_{k\geqslant 1}\) such that
The point \(x_0+\frac{x_{\phi (k)}-x_0}{\left\| x_{\phi (k)}-x_0 \right\| }\) being a convex combination of \(x_0\) and \(x_{\phi (k)}\), we can write the corresponding convexity inequality for function \(D_F(\,\cdot \,,x_0)\):
where we used shorthand \(\lambda _k:=\left\| x_{\phi (k)}-x_0 \right\| ^{-1}\). For the first above inequality, we used \(D_F(x_0,x_0)=0\) and that \(D_F(x_{\phi (k)},x_0)\leqslant \alpha \) by definition of \((x_k)_{k\geqslant 1}\). Then, using the lower-semicontinuity of \(D_F(\,\cdot \,,x_0)\) and the fact that \(x_0+\lambda _k(x_{\phi (k)}-x_0) \xrightarrow [k \rightarrow +\infty ]{}x_0+v\), we have
The Bregman divergence of a convex function being nonnegative, the above implies \(D_F(x_0+v,x_0)=0\). Thus, function \(D_F(\,\cdot \,,x_0)\) attains its minimum (0) at two different points (at \(x_0\) and at \(x_0+v\)): this contradicts its strong convexity. Therefore, sublevel sets \(S_{{{\mathcal {X}}}}(\alpha )\) are bounded and thus compact.
We now consider the value \(\alpha _{\text {inf}}\) defined as
In other words, \(\alpha _{\text {inf}}\) is the infimum value of \(D_F(\,\cdot \,,x_0)\) on \({{\mathcal {X}}}\), and thus the only possible value for the minimum (if it exists). We know that \(\alpha _{\text {inf}} \geqslant 0\) because the Bregman divergence is always nonnegative. From the definition of the sets \(S_{{{\mathcal {X}}}}(\alpha )\), it easily follows that:
Naturally, the sets \(S_{{{\mathcal {X}}}}(\alpha )\) are increasing in \(\alpha \) with respect to the inclusion order. Therefore, \(S_{{{\mathcal {X}}}}(\alpha _{\text {inf}})\) is the intersection of a nested sequence of nonempty compact sets. It is thus nonempty as well by Cantor’s intersection theorem. Consequently, \(D_F(\,\cdot \,,x_0)\) does admit a minimum on \({{\mathcal {X}}}\), and the minimizer is unique because of the strong convexity.
Let us now prove that the minimizer \(x_*:=\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{x\in {{\mathcal {X}}}}D_F(x,\ x_0)\) also belongs to \({\mathcal {D}}_F\). Let us assume by contradiction that \(x_*\in {{\mathcal {X}}}{\setminus } {\mathcal {D}}_F\). By definition of the mirror map, \({{\mathcal {X}}}\cap {\mathcal {D}}_F\) is nonempty; let \(x_1\in {{\mathcal {X}}}\cap {\mathcal {D}}_F\). The set \({\mathcal {D}}_F\) being open by definition, there exists \(\varepsilon > 0\) such that the closed Euclidean ball \({\overline{B}}(x_1,\varepsilon )\) centered in \(x_1\) and of radius \(\varepsilon \) is a subset of \({\mathcal {D}}_F\). We consider the convex hull
which is clearly is a compact set.
Consider function G defined by:
so that \(x_*\) is the minimizer of G on \({{\mathcal {X}}}\). In particular, G is finite in \(x_*\). G inherits strict convexity, lower-semicontinuity, and differentiability on \({\mathcal {D}}_F\) from function F. G is continuous on the compact set \({\overline{B}}(x_1,\varepsilon )\) because G is convex on the open set \({\mathcal {D}}_F\supset {\overline{B}}(x_1,\varepsilon )\). Therefore, G is bounded on \({\overline{B}}(x_1,\varepsilon )\). Let us prove that G is also bounded on \({\mathcal {C}}\). Let \(x\in {\mathcal {C}}\). By definition of \({\mathcal {C}}\), there exists \(\lambda \in [0,1]\) and \(x'\in {\overline{B}}(x_1,\varepsilon )\) such that \(x=\lambda x_*+(1-\lambda )x'\). By convexity of G, we have:
We know that \(G(x_*)\) is finite and that \(G(x')\) is bounded for \(x'\in {\overline{B}}(x_1,\varepsilon )\). Therefore G is bounded on \({\mathcal {C}}\): let us denote \(G_{\text {max}}\) and \(G_{\text {min}}\) some upper and lower bounds for the value of G on \({\mathcal {C}}\).
Because \({{\mathcal {X}}}\) is a convex set, the segment \([x_*,x_1]\) (in other words the convex hull of \(\left\{ x_*,x_1 \right\} \)) is a subset of \({{\mathcal {X}}}\). Besides, let us prove that the set
is a subset of \({\mathcal {D}}_F\). Let \(x_{\lambda }:=(1-\lambda )x_*+\lambda x_1\) (with \(\lambda \in (0,1]\)) a point in the above set, and let us prove that it belongs to \({\mathcal {D}}_F\). By definition of the mirror map, we have \({{\mathcal {X}}}\subset {\text {cl}}{\mathcal {D}}_F\), and besides \(x_*\in {{\mathcal {X}}}\) by definition. Therefore, there exists a sequence \((x_k)_{k\geqslant 1}\) in \({\mathcal {D}}_F\) such that \(x_k\rightarrow x_*\) as \(k\rightarrow +\infty \). Then, we can write
Since \(x_k\rightarrow x_*\), for high enough k, the point \(x_1+(1-\lambda )\lambda ^{-1}(x_*-x_k)\) belongs to \({\overline{B}}(x_1,\varepsilon )\) and therefore to \({\mathcal {D}}_F\). Then, the point \(x_{\lambda }\) belongs to the convex setFootnote 11\({\mathcal {D}}_F\) as the convex combination of two points in \({\mathcal {D}}_F\). Therefore, \((x_*,x_1]\) is indeed a subset of \({\mathcal {D}}_F\).

G being differentiable on \({\mathcal {D}}_F\) by definition of the mirror map, the gradient of G exists at each point of \((x_*,x_1]\). Let us prove that \(\nabla G\) is bounded on \((x_*,x_1]\). Let \(x_{\lambda }\in (x_*,x_1]\), where \(\lambda \in (0,1]\) is such that
and let \(u\in {\mathbb {R}}^n\) such that \(\left\| u \right\| _2=1\). The point \(x_1+\varepsilon u\) belongs to \({\mathcal {C}}\) because it belongs to \({\overline{B}}(x_1,\varepsilon )\). The following point also belongs to convex set \({\mathcal {C}}\) as the convex combination of \(x_*\) and \(x_1+\varepsilon u\) which both belong to \({\mathcal {C}}\):
Let \(h\in (0,\varepsilon ]\). The following point also belongs to \({\mathcal {C}}\) as a convex combination of \(x_{\lambda }\) and the above point \(x_{\lambda }+\lambda \varepsilon u\):
Now using for G the convexity inequality associated with the convex combination from (23), we write:
where for the last line we used \(G(x_*)\leqslant G(x_{\lambda })\) which is true because \(x_{\lambda }\) belongs to \({{\mathcal {X}}}\) and \(x_*\) is by definition the minimizer of G on \({{\mathcal {X}}}\). Using the convexity inequality associated with the convex combination from (22), we also write
Combining (24) and (25) and dividing by \(h\lambda \), we get
Taking the limit as \(h\rightarrow 0^+\), we get that \(\langle \nabla G(x_{\lambda }) | u \rangle \leqslant (G_{\text {max}}-G_{\text {min}})/\varepsilon \). This being true for all vector u such that \(\left\| u \right\| _2=1\), we have
As a result, \(\nabla G\) is bounded on \((x_*,x_1]\).
Let us deduce that \(\partial G(x_*)\) is nonempty. The sequence \((\nabla G(x_{1/k}))_{k\geqslant 1}\) is bounded. Using the Bolzano–Weierstrass theorem, there exists a subsequence \((\nabla G(x_{1/\phi (k)}))_{k\geqslant 1}\) which converges to some vector \({\vartheta }_*\in {\mathbb {R}}^n\). For each \(k\geqslant 1\), the following is satisfied by convexity of G:
Taking the limsup on both sides for each \(x\in {\mathbb {R}}^n\) as \(k\rightarrow +\infty \), we get (because obviously \(x_{1/\phi (k)}\rightarrow x_*\)):
where the second inequality follows from the lower-semicontinuity of G. Consequently, \({\vartheta }_*\) belongs to \(\partial G(x_*)\).
But by definition of the mirror map \(\nabla F\) takes all possible values and so does \(\nabla G\), because it follows from the definition of G that \(\nabla G=\nabla F-\nabla F(x_0)\). Therefore, there exists a point \({\tilde{x}}\in {\mathcal {D}}_F\) (thus \({\tilde{x}} \ne x_*\)) such that \(\nabla G({\tilde{x}})={\vartheta }_*\). Considering the point \(x_{\text {mid}}=\frac{1}{2}(x_*+{\tilde{x}})\), we can write the following convexity inequalities:
We now add both inequalities and use the fact that \(x_{\text {mid}}-{\tilde{x}}=x_*-x_{\text {mid}}\) by definition of \(x_{\text {mid}}\) to get \(0\leqslant 2G(x_{\text {mid}})-G(x_*)-G({\tilde{x}})\), which can also be written
which contradicts the strong convexity of G. We conclude that \(x_*\in {\mathcal {D}}_F\).
1.1.3 Proof of Proposition 2
Let \({\vartheta }\in {\mathbb {R}}^n\). For each of the three assumptions, let us prove that \(h^*({\vartheta })\) is finite. This will prove that \({\text {dom}}h^*={\mathbb {R}}^n\).
-
(i)
Because \({\text {cl}}{\text {dom}}h={{\mathcal {X}}}\) by definition of a pre-regularizer, we have:
$$\begin{aligned} h^*({\vartheta })=\max _{x\in {\mathbb {R}}^n}\left\{ \left\langle {\vartheta } \vert x \right\rangle -h(x) \right\} =\max _{x\in {{\mathcal {X}}}}\left\{ \left\langle {\vartheta } \vert x \right\rangle -h(x) \right\} . \end{aligned}$$Besides, the function \(x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)\) is upper-semicontinuous and therefore attains a maximum on \({{\mathcal {X}}}\) because \({{\mathcal {X}}}\) is assumed to be compact. Therefore \(h^*({\vartheta })<+\infty \).
-
(ii)
Because \(\nabla h({\mathcal {D}}_h)={\mathbb {R}}^n\) by assumption, there exists \(x\in {\mathcal {D}}_h\) such that \(\nabla h(x)={\vartheta }\). Then, by Proposition 12, \(h^*({\vartheta })=\left\langle {\vartheta } \vert x \right\rangle -h(x)<+\infty \).
-
(iii)
The function \(x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)\) is strongly concave on \({\mathbb {R}}^n\) and therefore admits a maximum. Therefore, \(h^*({\vartheta })<+\infty \).
1.1.4 Proof of Proposition 3
Let \({\vartheta }\in {\mathbb {R}}^n\). Because \({\text {dom}}h^*={\mathbb {R}}^n\), the subdifferential \(\partial h^*({\vartheta })\) is nonempty—see e.g. [38, Theorem 23.4]. By Proposition 12, \(\partial h^*({\vartheta })\) is the set of maximizers of function \(x\mapsto \left\langle {\vartheta } \vert x \right\rangle -h(x)\), which is strictly concave. Therefore, the maximizer is unique and \(h^*\) is differentiable at \({\vartheta }\).
Let \(x\in {\mathcal {D}}_F\) and let us prove that \(\nabla F(x)\in \partial h(x)\). By convexity of F, the following is true
By definition of h, we obviously have \(h(x')\geqslant F(x')\) for all \(x'\in {\mathbb {R}}^n\), and \(h(x)=F(x)+I_{{{\mathcal {X}}}}(x)=F(x)\) because \(x\in {{\mathcal {X}}}\). Therefore, the following is also true
In other words, \(\nabla F(x)\in \partial f(x)\).
1.1.5 Proof of Proposition 4
h is strictly convex as the sum of two convex functions, one of which (F) is strictly convex. h is lower-semicontinuous as the sum of two lower-continuous functions.
Let us now prove that \({\text {cl}}{\text {dom}}h={{\mathcal {X}}}\). First, we write
Let \(x\in {\text {cl}}{\text {dom}}h={\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})\). There exists a sequence \((x_k)_{k\geqslant 1}\) in \({\text {dom}}F\cap {{\mathcal {X}}}\) such that \(x_k\rightarrow x\). In particular, each \(x_k\) belongs to closed set \({{\mathcal {X}}}\), and so does the limit: \(x\in {{\mathcal {X}}}\).
Conversely, let \(x\in {{\mathcal {X}}}\) and let us prove that \(x\in {\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})\) by constructing a sequence \((x_k)_{k\geqslant 1}\) in \({\text {dom}}F\cap {{\mathcal {X}}}\) which converges to x. By definition of the mirror map, we have \({{\mathcal {X}}}\subset {\text {cl}}{\mathcal {D}}_F\), where \({\mathcal {D}}_F:={\text {int}}{\text {dom}}F\). Therefore, there exists a sequence \((x_l')_{l\geqslant 1}\) in \({\mathcal {D}}_F\) such that \(x_l'\rightarrow x\) as \(l\rightarrow +\infty \). From the definition of the mirror map, we also have that \({{\mathcal {X}}}\cap {\mathcal {D}}_F\ne \emptyset \). Let \(x_0\in {{\mathcal {X}}}\cap {\mathcal {D}}_F\). In particular, \(x_0\) belongs \({\mathcal {D}}_F\) which is an open set by definition. Therefore, there exists a neighborhood \(U\subset {\mathcal {D}}_F\) of point \(x_0\). We now construct the sequence \((x_k)_{k\geqslant 1}\) as follows:
\(x_k\) belongs to \({{\mathcal {X}}}\) as the convex combination of two points in the convex set \({{\mathcal {X}}}\), and obviously converges to x. Besides, \(x_k\) can also be written, for any \(k,l\geqslant 1\),
where we set \(x_0^{(kl)}:=x_0+(k-1)(x-x_l')\). For a given \(k\geqslant 1\), we see that \(x_0^{(kl)}\rightarrow x_0\) as \(l\rightarrow +\infty \) because \(x_l'\rightarrow x\) by definition of \((x_l')_{l\geqslant 1}\). Therefore, for large enough l, \(x_0^{(kl)}\) belongs to the neighborhood U and therefore to \({\mathcal {D}}_F\). \(x_k\) then appears as the convex combination of \(x_l'\) and \(x_0^{(kl)}\) which both belong to the convex set \({\mathcal {D}}_F\subset {\text {dom}}F\). \((x_k)\) is thus a sequence in \({\text {dom}}F\cap {{\mathcal {X}}}\) which converges to x. Therefore, \(x\in {\text {cl}}({\text {dom}}F\cap {{\mathcal {X}}})\) and h is an \({{\mathcal {X}}}\)-pre-regularizer.
Finally, we have \(F\leqslant h\) by definition of h. One can easily check that this implies \(h^*\leqslant F^*\) and we know from Proposition 1 that \({\text {dom}}F^*={\mathbb {R}}^n\), in other words that \(F^*\) only takes finite values. Therefore, so does \(h^*\) and h is an \({{\mathcal {X}}}\)-regularizer.
1.2 Proofs for Section 4
1.2.1 Proof of Proposition 11
Let \(t\geqslant 2\). It follows from the definition of the iterates that \(x_t-y_t=(\nu _{t-1}^{-1}-1)(y_t-y_{t-1})\). Therefore, utilizing the convexity of f, we get
Besides this, for \(t=1\), we have \(\gamma _1\langle f'(y_1) | x_1-x_* \rangle \geqslant \gamma _1(f(y_1)-f_*)\) because \(x_1=y_1\) by definition. Then, summing over \(t=1,\dots ,T\), we obtain after simplifications:
Using the definition of coefficients \(\nu _t\), the above left-hand side simplifies to result in the inequality
Finally, because \((x_t,{\vartheta }_t)_{t\geqslant 1}\) is a sequence of UMD\((h,\xi )\) iterates with dual increments \(\xi :=(-\gamma _tf'(y_t))_{t\geqslant 1}\), the result then follows by applying inequality (8) from Corollary 2 and dividing by \(\sum _{t=1}^T\gamma _t\). \(\square \)
1.2.2 Proof of Theorem 2
First, observe that whenever \(\gamma _t\leqslant 1/L\), due to (11),
Thus,
by the strong convexity of \(D_h\). On the other hand, by (6) of Lemma 1, for any \(x\in {{\mathcal {X}}}\cap {\text {dom}}h\),
Consequently, \(\forall x\in {{\mathcal {X}}}\cap {\text {dom}}h\),
When applying the above inequality to \(x=x_t\) we conclude that
Finally, when setting \(x=x_*\), we obtain
which implies (13). \(\square \)
1.2.3 Proof of Theorem 3
We start with the following technical result.
Lemma 2
Assume that positive step-sizes \(\nu _t\in (0,1]\) and \(\gamma _t>0\) are such that the relationship
holds for all t which is certainly the case if \(\nu _t\gamma _t\leqslant L^{-1}\). Denote \(s_t=f(z_t)-f_*\); then
Proof of the lemma
Observe first that by construction,
By strong convexity of h, for \(\nu _t\gamma _t\leqslant L^{-1}\) we have
what is (27).
Next, observe that by (14a),
whence, by convexity of f,
When substituting the latter bound into (27) we get
or
Now, because \((x_t,{\vartheta }_t)_{t\geqslant 1}\) is a sequence of UMD iterates, by (6) of Lemma 1,
and we arrive at
what is (28). \(\square \)
Proof of the Theorem
Assume that \(\nu _t\) and \(\gamma _t\) satisfy
When summing (28) up from 1 to T we get
It is clear that the choice of \(\gamma _1=L^{-1}\), \(\nu _1=1\) and \(\nu _t=(\gamma _tL)^{-1}\) satisfies the relationship \(\gamma _t\nu _t\leqslant L^{-1}\). In this case, when choosing step-sizes \((\gamma _t)_{t\geqslant 1}\) to saturate recursively the last relation in (29), specifically,
we come to celebrated Nesterov step-sizes (15) which satisfy \(\gamma _t\nu _t^{-1}\geqslant {(t+1)^2\over 4L}\), and we arrive at (16). \(\square \)
Rights and permissions
About this article
Cite this article
Juditsky, A., Kwon, J. & Moulines, É. Unifying mirror descent and dual averaging. Math. Program. 199, 793–830 (2023). https://doi.org/10.1007/s10107-022-01850-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01850-3