Abstract
The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA. Numerical experiments on these problems illustrate the benefits of the proposed approach.












Similar content being viewed by others
Notes
Here, the subdifferential is formally obtained through the chain rule \(\partial f(x)=\nabla F(x)^*\partial h(F(x))\), where \(\partial h(\cdot )\) is the subdifferential in the sense of convex analysis.
Both the parameters \(\alpha _t\) and \(\beta \) must be properly chosen for these guarantees to take hold.
The authors of [57] provide a bound on L that scales with the Frobenius norm \(\sqrt{\Vert M_{\sharp }}\Vert _F\). We instead derive a sharper bound that scales as \(\sqrt{\Vert M_{\sharp }\Vert _\mathrm{op}}\). As a by-product, the linear rate of convergence for the subgradient method scales only with the condition number \(\sigma _1(M_{\sharp })/\sigma _r(M_{\sharp })\) instead of \(\Vert M_{\sharp }\Vert _F/\sigma _r(M_{\sharp })\).
The guarantees we develop in the symmetric setting are similar to those in the recent preprint [57], albeit we obtain a sharper bound on L; the two sets of results were obtained independently. The guarantees for the asymmetric setting are different and are complementary to each other: we analyze the conditioning of the basic problem formulation (1.2), while [57] introduces a regularization term \( \Vert X^\top X - YY^\top \Vert _F\) that improves the basin of attraction for the subgradient method by a factor of the condition number of \(M_{\sharp }\).
In the latter case, RIP is also called restricted uniform boundedness (RUB) [10].
with
By this we mean that the vectorized matrix \(\mathbf {vec}(P)\) is a \(\eta \)-sub-Gaussian random vector.
Recall that \(\Vert X\Vert _{2,\infty } = \max _{i \in [d]} \Vert X_{i \cdot }\Vert _2\) is the maximum row norm.
References
Ahmed, A., Recht, B., Romberg, J.: Blind deconvolution using convex programming. IEEE Transactions on Information Theory 60(3), 1711–1732 (2014)
Albano, P., Cannarsa, P.: Singularities of semiconcave functions in Banach spaces. In: Stochastic analysis, control, optimization and applications, Systems Control Found. Appl., pp. 171–190. Birkhäuser Boston, Boston, MA (1999)
Balcan, M.F., Liang, Y., Song, Z., Woodruff, D.P., Zhang, H.: Non-convex matrix completion and related problems via strong duality. Journal of Machine Learning Research 20(102), 1–56 (2019)
Bauch, J., Nadler, B.: Rank \(2r\) iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries. arXiv preprint arXiv:2002.01849 (2020)
Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)
Borwein, J., Lewis, A.: Convex analysis and nonlinear optimization. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, 3. Springer-Verlag, New York (2000). Theory and examples
Boucheron, S., Lugosi, G., Massart, P.: Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press (2013)
Burke, J.: Descent methods for composite nondifferentiable optimization problems. Math. Programming 33(3), 260–279 (1985). 10.1007/BF01584377.
Burke, J., Ferris, M.: A Gauss-Newton method for convex composite optimization. Math. Programming 71(2, Ser. A), 179–194 (1995). https://doi.org/10.1007/BF01585997.
Cai, T., Zhang, A.: ROP: matrix recovery via rank-one projections. Ann. Stat. 43(1), 102–138 (2015). 10.1214/14-AOS1267.
Candès, E., Eldar, Y., Strohmer, T., Voroninski, V.: Phase retrieval via matrix completion. SIAM J. Imaging Sci. 6(1), 199–225 (2013). 10.1137/110848074
Candès, E., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inform. Theory 61(4), 1985–2007 (2015). 10.1109/TIT.2015.2399924
Candes, E., Plan, Y.: Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory 57(4), 2342–2359 (2011)
Candes, E., Strohmer, T., Voroninski, V.: Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics 66(8), 1241–1274 (2013)
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? Journal of the ACM (JACM) 58(3), 1–37 (2011)
Candés, E.J., Recht, B.: Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(6), 717 (2009). 10.1007/s10208-009-9045-5
Candès, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory 56(5), 2053–2080 (2010)
Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Willsky, A.S.: Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization 21(2), 572–596 (2011). 10.1137/090761793
Charisopoulos, V., Davis, D., Díaz, M., Drusvyatskiy, D.: Composite optimization for robust blind deconvolution. arXiv:1901.01624 (2019)
Chen, Y.: Incoherence-optimal matrix completion. IEEE Transactions on Information Theory 61(5), 2909–2923 (2015)
Chen, Y., Candès, E.: Solving random quadratic systems of equations is nearly as easy as solving linear systems. Comm. Pure Appl. Math. 70(5), 822–883 (2017)
Chen, Y., Chi, Y.: Harnessing structures in big data via guaranteed low-rank matrix estimation: Recent theory and fast algorithms via convex and nonconvex optimization. IEEE Signal Processing Magazine 35(4), 14–31 (2018)
Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Mathematical Programming (2019). https://doi.org/10.1007/s10107-019-01363-6
Chen, Y., Chi, Y., Goldsmith, A.: Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Trans. Inform. Theory 61(7), 4034–4059 (2015). 10.1109/TIT.2015.2429594
Chen, Y., Fan, J., Ma, C., Yan, Y.: Bridging Convex and Nonconvex Optimization in Robust PCA: Noise, Outliers, and Missing Data. arXiv e-prints arXiv:2001.05484 (2020)
Chen, Y., Jalali, A., Sanghavi, S., Caramanis, C.: Low-rank matrix recovery from errors and erasures. IEEE Transactions on Information Theory 59(7), 4324–4337 (2013)
Chen, Y., Wainwright, M.J.: Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv:1509.03025 (2015)
Chi, Y., Lu, Y., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: An overview. arXiv:1809.09573 (2018)
Davenport, M., Romberg, J.: An overview of low-rank matrix recovery from incomplete observations. IEEE J. Selected Top. Signal Process. 10(4), 608–622 (2016). 10.1109/JSTSP.2016.2539100
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization 29(1), 207–239 (2019)
Davis, D., Drusvyatskiy, D., MacPhee, K., Paquette, C.: Subgradient methods for sharp weakly convex functions. J. Optim. Theory Appl. 179(3), 962–982 (2018). 10.1007/s10957-018-1372-8
Davis, D., Drusvyatskiy, D., Paquette, C.: The nonsmooth landscape of phase retrieval. To appear in IMA J. Numer. Anal., arXiv:1711.03247 (2017)
Díaz, M.: The nonsmooth landscape of blind deconvolution. arXiv preprint arXiv:1911.08526 (2019)
Ding, L., Chen, Y.: Leave-one-out approach for matrix completion: Primal and dual analysis. IEEE Trans. Inf. Theory (2020). https://doi.org/10.1109/TIT.2020.2992769
Drusvyatskiy, D., Lewis, A.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018). 10.1287/moor.2017.0889
Drusvyatskiy, D., Paquette, C.: Efficiency of minimizing compositions of convex functions and smooth maps. Math. Prog. pp. 1–56 (2018)
Duchi, J., Ruan, F.: Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval. IMA J. Inf. Inference (2018). https://doi.org/10.1093/imaiai/iay015
Duchi, J., Ruan, F.: Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28(4), 3229–3259 (2018)
Eldar, Y., Mendelson, S.: Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. 36(3), 473–494 (2014). 10.1016/j.acha.2013.08.003
Fazel, M.: Matrix rank minimization with applications. Ph.D. thesis, Stanford University (2002)
Fletcher, R.: A model algorithm for composite nondifferentiable optimization problems. Math. Programming Stud. (17), 67–76 (1982). https://doi.org/10.1007/bfb0120959. Nondifferential and variational techniques in optimization (Lexington, Ky., 1980)
Ge, R., Jin, C., Zheng, Y.: No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 1233–1242. PMLR, International Convention Centre, Sydney, Australia (2017)
Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, R. Garnett (eds.) Advances in Neural Information Processing Systems 29, pp. 2973–2981. Curran Associates, Inc. (2016)
Goffin, J.: On convergence rates of subgradient optimization methods. Math. Programming 13(3), 329–347 (1977). 10.1007/BF01584346
Goldstein, T., Studer, C.: Phasemax: Convex phase retrieval via basis pursuit. IEEE Transactions on Information Theory 64(4), 2675–2689 (2018). 10.1109/TIT.2018.2800768
Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57(3), 1548–1566 (2011)
Hardt, M.: Understanding alternating minimization for matrix completion. In: Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS ’14, p. 651–660. IEEE Computer Society, USA (2014). https://doi.org/10.1109/FOCS.2014.75
Hardt, M., Wootters, M.: Fast matrix completion without the condition number. In: M.F. Balcan, V. Feldman, C. Szepesvári (eds.) Proceedings of The 27th Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 35, pp. 638–678. PMLR, Barcelona, Spain (2014)
Hsu, D., Kakade, S.M., Zhang, T.: Robust matrix decomposition with sparse corruptions. IEEE Transactions on Information Theory 57(11), 7221–7234 (2011)
Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, p. 665–674. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2488608.2488693
Keshavan, R., Montanari, A., Oh, S.: Matrix completion from noisy entries. In: Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 952–960. Curran Associates, Inc. (2009)
Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from a few entries. IEEE Transactions on Information Theory 56(6), 2980–2998 (2010)
Klein, T., Rio, E.: Concentration around the mean for maxima of empirical processes. The Annals of Probability 33(3), 1060–1077 (2005)
Lewis, A., Wright, S.: A proximal method for composite minimization. Math. Program. 158(1-2, Ser. A), 501–546 (2016). https://doi.org/10.1007/s10107-015-0943-9
Li, X.: Compressed sensing and matrix completion with constant proportion of corruptions. Constr. Approximation 37(1), 73–99 (2013)
Li, X., Ling, S., Strohmer, T., Wei, K.: Rapid, robust, and reliable blind deconvolution via nonconvex optimization. arXiv:1606.04933 (2016)
Li, X., Zhu, Z., So, A.C., Vidal, R.: Nonconvex robust low-rank matrix recovery. arXiv:1809.09237 (2018)
Li, Y., Ma, C., Chen, Y., Chi, Y.: Nonconvex matrix factorization from rank-one measurements. arXiv:1802.06286 (2018)
Li, Y., Sun, Y., Chi, Y.: Low-rank positive semidefinite matrix recovery from corrupted rank-one measurements. IEEE Transactions on Signal Processing 65(2), 397–408 (2016)
Ling, S., Strohmer, T.: Self-calibration and biconvex compressive sensing. Inverse Probl. 31(11), 115002, 31 (2015). https://doi.org/10.1088/0266-5611/31/11/115002
Ma, C., Wang, K., Chi, Y., Chen, Y.: Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In: J. Dy, A. Krause (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 3345–3354. PMLR, Stockholmsmässan, Stockholm Sweden (2018)
Mendelson, S.: A remark on the diameter of random sections of convex bodies. In: Geometric aspects of functional analysis, Lecture Notes in Math., vol. 2116, pp. 395–404. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09477-9_25
Mendelson, S.: Learning without concentration. J. ACM 62(3), Art. 21, 25 (2015). https://doi.org/10.1145/2699439
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I: Basic Theory. Grundlehren der mathematischen Wissenschaften, Vol 330, Springer, Berlin (2006)
Negahban, S., Ravikumar, P., Wainwright, M., Yu, B.: A unified framework for high-dimensional analysis of \(M\)-estimators with decomposable regularizers. Statist. Sci. 27(4), 538–557 (2012). 10.1214/12-STS400
Netrapalli, P., Niranjan, U., Sanghavi, S., Anandkumar, A., Jain, P.: Non-convex robust PCA. In: Advances in Neural Information Processing Systems, pp. 1107–1115 (2014)
Nurminskii, E.: The quasigradient method for the solving of the nonlinear programming problems. Cybernetics 9(1), 145–150 (1973). 10.1007/BF01068677
Parikh, N., Boyd, S.: Block splitting for distributed optimization. Mathematical Programming Computation 6(1), 77–102 (2014)
Poliquin, R., Rockafellar, R.: Prox-regular functions in variational analysis. Trans. Amer. Math. Soc. 348, 1805–1838 (1996)
Recht, B.: A simpler approach to matrix completion. Journal of Machine Learning Research 12(104), 3413–3430 (2011)
Recht, B., Fazel, M., Parrilo, P.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010). 10.1137/070697835
Rockafellar, R.: Favorable classes of Lipschitz-continuous functions in subgradient optimization. In: Progress in nondifferentiable optimization, IIASA Collaborative Proc. Ser. CP-82, vol. 8, pp. 125–143. Int. Inst. Appl. Sys. Anal., Laxenburg (1982)
Rockafellar, R., Wets, R.B.: Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin (1998)
Rolewicz, S.: On paraconvex multifunctions. In: Third Symposium on Operations Research (Univ. Mannheim, Mannheim, 1978), Section I, Operations Res. Verfahren, vol. 31, pp. 539–546. Hain, Königstein/Ts. (1979)
Rudelson, M., Vershynin, R.: Small ball probabilities for linear images of high-dimensional distributions. International Mathematics Research Notices 2015(19), 9594–9617 (2014)
Shechtman, Y., Eldar, Y., Cohen, O., Chapman, H., Miao, J., Segev, M.: Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Processing Magazine 32(3), 87–109 (2015). 10.1109/MSP.2014.2352673
Sun, R., Luo, Z.Q.: Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory 62(11), 6535–6579 (2016)
Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., Recht, B.: Low-rank solutions of linear matrix equations via Procrustes flow. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48, ICML’16, pp. 964–973. JMLR.org (2016)
Vershynin, R.: High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press (2018)
Yi, X., Park, D., Chen, Y., Caramanis, C.: Fast algorithms for robust pca via gradient descent. In: Advances in neural information processing systems, pp. 4152–4160 (2016)
Zheng, Q., Lafferty, J.: Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent. arXiv e-prints arXiv:1605.07051 (2016)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Thomas Strohmer.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research of Y. Chen was supported by the NSF 1657420 and 1704828 Grants. Research of D. Davis was supported by an Alfred P. Sloan Research Fellowship. Research of D. Drusvyatskiy was supported by the NSF DMS 1651851 and CCF 1740551 awards.
Appendices
A Proofs in Sect. 5
In this section, we prove rapid local convergence guarantees for the subgradient and prox-linear algorithms under regularity conditions that hold only locally around a particular solution. We will use the Euclidean norm throughout this section; therefore, to simplify the notation, we will drop the subscript two. Thus, \(\Vert \cdot \Vert \) denotes the \(\ell _2\) on a Euclidean space \(\mathbf {E}\) throughout.
We will need the following quantitative version of Lemma 5.1.
Lemma A.1
Suppose Assumption C holds and let \(\gamma \in (0,2)\) be arbitrary. Then, for any point \(x\in B_{\epsilon /2}(\bar{x})\cap \mathcal {T}_{\gamma }\backslash \mathcal {X}^*\), the estimate holds:
Proof
Consider any point \(x\in B_{\epsilon /2}(\bar{x})\) satisfying \(\mathrm{dist}(x,\mathcal {X}^*)\le \gamma \frac{\mu }{\rho }\). Let \(x^*\in \mathrm {proj}_{\mathcal {X}^*}(x)\) be arbitrary and note \(x^*\in B_{\epsilon }(\bar{x})\). Thus, for any \(\zeta \in \partial f(x)\) we deduce
Therefore, we deduce the lower bound on the subgradients \(\Vert \zeta \Vert \ge \mu -\frac{\rho }{2}\cdot \mathrm{dist}(x,\mathcal {X}^*)\ge \left( 1-\tfrac{\gamma }{2}\right) \mu ,\) as claimed. \(\square \)
1.1 A.1 Proof of Theorem 5.6
Let k be the first index (possibly infinite) such that \(x_k\notin B_{\epsilon /2}(\bar{x})\). We claim that (5.4) holds for all \(i<k\). We show this by induction. To this end, suppose (5.4) holds for all indices up to \(i-1\). In particular, we deduce \(\mathrm{dist}(x_{i},\mathcal {X}^*)\le \mathrm{dist}(x_{0},\mathcal {X}^*)\le \frac{\mu }{2\rho }\). Let \(x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_i)\) and note \(x^*\in B_{\epsilon }(\bar{x})\), since
Thus, we deduce
Here, the estimate (A.1) follows from the fact that the projection \(\mathrm {proj}_Q(\cdot )\) is nonexpansive, (A.2) uses local weak convexity, (A.4) follow from the estimate \(\mathrm{dist}(x_i,\mathcal {X}^*)\le \frac{\mu }{2\rho }\), while (A.3) and (A.5) use local sharpness. We therefore deduce
Thus, (5.4) holds for all indices up to \(k-1\). We next show that k is infinite. To this end, observe
where (A.7) follows by Lemma A.1 with \(\gamma = 1/2\), the bound in (A.8) follows by (A.6) and the assumption on \(\mathrm{dist}(x_0, \mathcal {X}^*),\) finally (A.9) holds thanks to (A.6). Thus, applying the triangle inequality we get the contradiction \(\Vert x_k-\bar{x}\Vert \le \epsilon /2\). Consequently, all the iterates \(x_k\) for \(k=0,1,\ldots , \infty \) lie in \(B_{\epsilon /2}(\bar{x})\) and satisfy (5.4).
Finally, let \(x_{\infty }\) be any limit point of the sequence \(\{x_i\}\). We then successively compute
This completes the proof.
1.2 Proof of Theorem 5.7
Fix an arbitrary index k and observe
Hence, we conclude the uniform bound on the iterates:
and the linear rate of convergence
where \(x_{\infty }\) is any limit point of the iterate sequence.
Let us now show that the iterates do not escape \(B_{\epsilon /2}(\bar{x})\). To this end, observe
We must therefore verify the estimate \(\tfrac{\lambda }{1-q}{\le } \tfrac{\epsilon }{4}\), or equivalently \(\gamma {\le } \frac{\epsilon \rho L(1-\gamma )\tau ^2}{4\mu ^2(1+\sqrt{1-(1-\gamma ) \tau ^2})}.\) Clearly, it suffices to verify \(\gamma \le \frac{\epsilon \rho (1-\gamma )}{4L},\) which holds by the definition of \(\gamma \). Thus, all the iterates \(x_k\) lie in \(B_{\epsilon /2}(\bar{x})\). Moreover, \(\tau \le \sqrt{\frac{1}{2}} \le \sqrt{\frac{1}{2-\gamma }}\), the rest of the proof is identical to that in [31, Theorem 5.1].
1.3 A.3 Proof of Theorem 5.8
Fix any index i such that \(x_i\in B_{\epsilon }(\bar{x})\) and let \(x\in \mathcal {X}\) be arbitrary. Since the function \(z\mapsto f_{x_i}(z)+\frac{\beta }{2}\Vert z-x_i\Vert ^2\) is \(\beta \)-strongly convex and \(x_{i+1}\) is its minimizer, we deduce
Setting \(x=x_i\) and appealing to approximation accuracy, we obtain the descent guarantee
In particular, the function values are decreasing along the iterate sequence. Next choosing any \(x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_i)\) and setting \(x=x^*\) in (A.10) yields
Appealing to approximation accuracy and lower-bounding \(\frac{\beta }{2}\Vert x_{i+1}-x^*\Vert ^2\) by zero, we conclude
Using sharpness, we deduce the contraction guarantee
where the last inequality uses the assumption \(f(x_0)-\min _{\mathcal {X}} f\le \frac{\mu ^2}{2\beta }\). Let \(k>0\) be the first index satisfying \(x_{k}\notin B_{\epsilon }(\bar{x})\). We then deduce
where (A.14) follows from (A.11) and (A.15) follows from (A.13). Thus, we conclude \(\Vert x_k-\bar{x}\Vert \le \epsilon \), which is a contradiction. Therefore, all the iterates \(x_k\), for \(k=0,1,\ldots , \infty \), lie in \(B_{\epsilon }(\bar{x})\). Combining this with (A.12) and sharpness yields the claimed quadratic converge guarantee
Finally, let \(x_{\infty }\) be any limit point of the sequence \(\{x_i\}\). We then deduce
where (A.16) follows from (A.13). The theorem is proved.
B Proofs in Sect. 6
1.1 B.1 Proof of Lemma 6.3
In order to prove that the assumption in each case, we will prove a stronger “small-ball condition” [62, 63], which immediately implies the claimed lower bounds on the expectation by Markov’s inequality. More precisely, we will show that there exist numerical constants \(\mu _0,p_0>0\) such that
-
1.
(Matrix Sensing)
$$\begin{aligned} \inf _{\begin{array}{c} M: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}(|\langle P,M\rangle | \ge \mu _0) \ge p_0, \end{aligned}$$ -
2.
(Quadratic Sensing I)
$$\begin{aligned} \inf _{\begin{array}{c} M\in \mathcal {S}^d: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}(|p^\top M p| \ge \mu _0) \ge p_0, \end{aligned}$$ -
3.
(Quadratic Sensing II)
$$\begin{aligned} \inf _{\begin{array}{c} M\in \mathcal {S}^d: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}\big ( |p^\top M p- \tilde{p}^\top M \tilde{p}| \ge \mu _0\big ) \ge p_0, \end{aligned}$$ -
4.
(Bilinear Sensing)
$$\begin{aligned} \inf _{\begin{array}{c} M: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}(|p^\top M q| \ge \mu _0) \ge p_0. \end{aligned}$$
These conditions immediately imply Assumptions C-F. Indeed, by Markov’s inequality, in the case of matrix sensing we deduce
The same reasoning applies to all the other problems.
Matrix sensing Consider any matrix M with \(\Vert M\Vert _F =1.\) Then, since \(g := \langle P, M\rangle \) follows a standard normal distribution, we may set \(\mu _0\) to be the median of |g| and \(p_0= 1/2\) to obtain
Quadratic Sensing I Fix a matrix M with \(\text {Rank }\,M \le 2r\) and \(\Vert M\Vert _F=1\). Let \(M = UDU^\top \) be an eigenvalue decomposition of M. Using the rotational invariance of the Gaussian distribution, we deduce
where \({\mathop {=}\limits ^{ d }}\) denotes equality in distribution. Next, let z be a standard normal variable. We will now invoke Proposition F.2. Let \(C>0\) be the numerical constant appearing in the proposition. Notice that the function \(\phi :\mathbf{R}_+ \rightarrow \mathbf{R}\) given by
is continuous and strictly increasing, and it satisfies \(\phi (0)= 0\) and \(\lim _{t \rightarrow \infty } \phi (t) = 1.\) Hence, we may set \(\mu _0= \phi ^{-1}(\min \{1/2C,1/2\})\). Proposition F.2 then yields
By taking the supremum of both sides of the inequality we conclude that Assumption D holds with \(\mu _0\) and \(p_0= 1/2.\)
Quadratic sensing II Let \(M = UDU^\top \) be an eigenvalue decomposition of M. Using the rotational invariance of the Gaussian distribution, we deduce
where the last relation follows since \(\left( p_k - \tilde{p}_k\right) ,\left( p_k + \tilde{p}_k\right) \) are independent standard normal random variables with mean zero and variance two. We will now invoke Proposition F.2. Let \(C>0\) be the numerical constant appearing in the proposition. Let z and \( \tilde{z} \) be independent standard normal variables. Notice that the function \(\phi :\mathbf{R}_+ \rightarrow \mathbf{R}\) given by
is continuous, strictly increasing, satisfies \(\phi (0)= 0\) and approaches one at infinity. Defining \(\mu _0= \phi ^{-1}(\min \{1/2C,1/2\})\) and applying Proposition F.2, we get
By taking the supremum of both sides of the inequality we conclude that Condition E holds with \(\mu _0\) and \(p_0= 1/2.\)
We omit the details for the bilinear case, which follow by similar arguments.
1.2 B.2 Proof of Theorem 6.4
The proofs in this section rely on the following proposition, which shows that that pointwise concentration imply uniform concentration. We defer the proof to Appendix B.3.
Proposition B.1
Let \(\mathcal {A}: \mathbf{R}^{d_1 \times d_2} \rightarrow \mathbf{R}^m\) be a random linear mapping with property that for any fixed matrix \(M \in \mathbf{R}^{d_1 \times d_2}\) of rank at most 2r with norm \(\Vert M\Vert _F =1\) and any fixed subset of indices \(\mathcal {I}\subseteq \{1, \dots , m\}\) satisfying \(|\mathcal {I}| < m/2\), the following hold:
-
(1)
The measurements \(\mathcal {A}(M)_1, \dots , \mathcal {A}(M)_m\) are i.i.d.
-
(2)
RIP holds in expected value:
$$\begin{aligned} \alpha \le \mathbb {E}| \mathcal {A}(M)_i | \le \beta (r) \qquad \text {for all } i \in \{1, \dots , m\} \end{aligned}$$(B.1)where \(\alpha > 0\) is a universal constant and \(\beta \) is a positive-valued function that could potentially depend on the rank of M.
-
(3)
There exist a universal constant \(K>0\) and a positive-valued function c(m, r) such that for any \(t \in [0, K]\) the deviation bound
$$\begin{aligned} \frac{1}{m}\left| \Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1 - \mathbb {E}\big [\Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1\big ] \right| \le t \end{aligned}$$(B.2)holds with probability at least \(1-2\exp (-t^2c(m,r)).\)
Then, there exist universal constants \(c_1, \dots , c_6 > 0\) depending only on \(\alpha \) and K such that if \(\mathcal {I}\subseteq \{1, \dots , m\}\) is a fixed subset of indices satisfying \(|\mathcal {I}| < m/2\) and
then with probability at least \(1-4\exp \left( -c_3(1-2|\mathcal {I}|/m)^2 c(m,r)\right) \) every matrix \(M \in \mathbf{R}^{d_1 \times d_2}\)of rank at most 2r satisfies
and
Due to scale invariance of the above result, we need only verify it in the case that \(\Vert M\Vert _F = 1\). We implicitly use this observation below.
1.2.1 B.2.1 Part 1 of Theorem 6.4 (Matrix sensing)
Lemma B.2
The random variable \(|\langle P, M\rangle |\) is sub-Gaussian with parameter \(C\eta .\) Consequently,
Moreover, there exists a universal constant \(c> 0\) such that for any \(t \in [0, \infty )\) the deviation bound
holds with probability at least \(1-2\exp \left( - \frac{ct^2}{\eta ^2}m\right) .\)
Proof
Condition C immediately implies the lower bound in (B.5). To prove the upper bound, first note that by assumption we have
This bound has two consequences, first \(\langle P, M\rangle \) is a sub-Gaussian random variable with parameter \(\eta \) and second \(\mathbb {E}|\langle P,M\rangle | \lesssim \eta \) [79, Proposition 2.5.2]. Thus, we have proved (B.5).
To prove the deviation bound (B.6), we introduce the random variables
Since \(|\langle P_i, M\rangle |\) is sub-Gaussian, we have \(\Vert Y_i\Vert _{\psi _2} \lesssim \eta \) for all i, see [79, Lemma 2.6.8]. Hence, Hoeffding’s inequality for sub-Gaussian random variables [79, Theorem 2.6.2] gives the desired upper bound on \(\mathbb {P}\left( \frac{1}{m} \left| \sum _{i=1}^m Y_i \right| \ge t \right) .\) \(\square \)
Applying Proposition B.1 with \(\beta (r) \asymp \eta \) and \(c(m,r) \asymp m/\eta ^2\) now yields the result. \(\square \)
1.2.2 B.2.2 Part 2 of Theorem 6.4 (Quadratic sensing I)
Lemma B.3
The random variable \(|p^\top M p|\) is sub-exponential with parameter \(\sqrt{2r} \eta ^2.\) Consequently,
Moreover, there exists a universal constant \(c> 0\) such that for any \(t \in [0, \sqrt{2r} \eta ]\) the deviation bound
holds with probability at least \(1-2\exp \left( - \frac{ct^2}{\eta ^4}m/r\right) .\)
Proof
Condition D gives the lower bound in (B.7). To prove the upper bound, first note that \(M = \sum _{k=1}^{2r} \sigma _k u_k u_k^\top \) where \(\sigma _k\) and \(u_k\) are the kth singular values and vectors of M, respectively. Hence,
where the first inequality follows since \(\Vert \cdot \Vert _{\psi _1}\) is a norm, the second one follows since \(\Vert XY\Vert _{\psi _1} \le \Vert X\Vert _{\psi _2}\Vert Y\Vert _{\psi _2}\) [79, Lemma 2.7.7], and the third inequality holds since \(\Vert \sigma \Vert _1 \le \sqrt{2r}\Vert \sigma \Vert _2\). This bound has two consequences, first \(p^\top M p\) is a sub-exponential random variable with parameter \(\sqrt{r} \eta ^2\) and second \(\mathbb {E}p^\top M p\le \sqrt{2r} \eta ^2\) [79, Exercise 2.7.2]. Thus, we have proved (B.7).
To prove the deviation bound (B.8), we introduce the random variables
Since \(p^\top M p\) is sub-exponential, we have \(\Vert Y_i\Vert _{\psi _1} \lesssim \sqrt{r} \eta ^2\) for all i, see [79, Exercise 2.7.10]. Hence, Bernstein inequality for sub-exponential random variables [79, Theorem 2.8.2] gives the desired upper bound on \(\mathbb {P}\left( \frac{1}{m} \left| \sum _{i=1}^m Y_i \right| \ge t \right) .\) \(\square \)
Applying Proposition B.1 with \(\beta (r) \asymp \sqrt{r}\eta ^2\) and \(c(m,r) \asymp m/{\eta ^4}r\) now yields the result. \(\square \)
1.2.3 B.2.3 Part 3 of Theorem 6.4 (Quadratic sensing II)
Lemma B.4
The random variable \(|p^\top M p- \tilde{p}^\top M \tilde{p}|\) is sub-exponential with parameter \(C\eta ^2.\) Consequently,
Moreover, there exists a universal constant \(c> 0\) such that for any \(t \in [0, \eta ^2]\) the deviation bound
holds with probability at least \(1-2\exp \left( - \frac{ct^2}{\eta ^4}m\right) .\)
Proof
Condition E implies the lower bound in (B.9). To prove the upper bound, we will show that \(\Vert |p^\top M p- \tilde{p}^\top M \tilde{p}^\top |\Vert _{\psi _1} \le \eta ^2\). By definition of the Orlicz norm \(\Vert |X|\Vert _{\psi _1} = \Vert X\Vert _{\psi _1}\) for any random variable X, hence without loss of generality we may remove the absolute value. Recall that \(M = \sum _{k=1}^{2r} \sigma _k u_k u_k^\top \) where \(\sigma _k\) and \(u_k\) are the kth singular values and vectors of M, respectively. Hence, the random variable of interest can be rewritten as
By assumption the random variables \(\langle u_k, p\rangle \) are \(\eta \)-sub-Gaussian, this implies that \(\langle u_k,p\rangle ^2\) are \(\eta ^2\)-sub-exponential, since \(\Vert \langle u_k, p\rangle ^2\Vert _{\psi _1} \le \Vert \langle u_k, p\rangle \Vert _{\psi _2}^2\).
Recall the following characterization of the Orlicz norm for mean-zero random variables
where the \(Q \asymp \tilde{Q},\) see [79, Proposition 2.7.1]. To prove that the random variable (B.11) is sub-exponential we will exploit this characterization. Since each inner product squared \(\langle u_k,p\rangle ^2\) is sub-exponential, the equivalence implies the existence of a constant \(c>0\) for which the uniform bound
holds. Let \(\lambda \) be an arbitrary scalar with \(|\lambda |\le 1/c\eta ^4\), then by expanding the moment generating function of (B.11) we get
where the inequality follows by (B.13) and the last relation follows since \(\sigma \) is unit norm. Combining this with (B.12) gives
This bound has two consequences, first \(|p^\top M p- \tilde{p}^\top M \tilde{p}^\top |\) is a sub-exponential random variable with parameter \(C\eta ^2\) and second \(\mathbb {E}|p^\top M p- \tilde{p}^\top M \tilde{p}^\top | \le C \eta ^2\) [79, Exercise 2.7.2]. Thus, we have proved (B.9).
To prove the deviation bound (B.10) we introduce the random variables
The sub-exponentiality of \(\mathcal {A}(M)_i\) implies \(\Vert Y_i\Vert _{\psi _1} \lesssim \eta ^2\) for all i, see [79, Exercise 2.7.10]. Hence, Bernstein inequality for sub-exponential random variables [79, Theorem 2.8.2] gives the desired upper bound on \(\mathbb {P}\left( \frac{1}{m} \left| \sum _{i=1}^m Y_i \right| \ge t \right) .\) \(\square \)
Applying Proposition B.1 with \(\beta (r) \asymp \eta ^2\) and \(c(m,r) \asymp m/{\eta ^4}\) now yields the result. \(\square \)
1.2.4 B.2.4 Part 4 of Theorem 6.4 (Bilinear sensing)
Lemma B.5
The random variable \(|p^\top M q|\) is sub-exponential with parameter \(C\eta ^2.\) Consequently,
Moreover, there exists a universal constant \(c> 0\) such that for any \(t \in [0, \eta ^2]\) the deviation bound
holds with probability at least \(1-2\exp \left( - \frac{ct^2}{\eta ^4}m\right) .\)
Proof
As before the lower bound in (B.14) is implied by Condition F. To prove the upper bound, we will show that \(\Vert |p^\top M q|\Vert _{\psi _1} \le \eta ^2\). By definition of the Orlicz norm \(\Vert |X|\Vert _{\psi _1} = \Vert X\Vert _{\psi _1}\) for any random variable X, hence we may remove the absolute value. Recall that \(M = \sum _{k=1}^{2r} \sigma _k u_k v_k^\top \) where \(\sigma _k\) and \((u_k, v_k)\) are the kth singular values and vectors of M, respectively. Hence, the random variable of interest can be rewritten as
By assumption the random variables \(\langle p, u_k\rangle \) and \(\langle v_k,q\rangle \) are \(\eta \)-sub-Gaussian, this implies that \(\langle p,u_k\rangle \langle v_k,q\rangle \) are \(\eta ^2\)-sub-exponential.
To prove that the random variable (B.16) is sub-exponential, we will again use (B.12). Since each random variable \(\langle p,u_k\rangle \langle v_k,q\rangle \) is sub-exponential, the equivalence implies the existence of a constant \(c>0\) for which the uniform bound
holds. Let \(\lambda \) be an arbitrary scalar with \(|\lambda |\le 1/c\eta ^4\), then by expanding the moment generating function of (B.16) we get
where the inequality follows by (B.17) and the last relation follows since \(\sigma \) is unitary. Combining this with (B.12) gives
Thus, we have proved (B.14).
Once again, to show the deviation bound (B.15) we introduce the random variables
and apply Bernstein’s inequality for sub-exponential random variables [79, Theorem 2.8.2] to get the stated upper bound on \(\mathbb {P}\left( \frac{1}{m} \left| \sum _{i=1}^m Y_i \right| \ge t \right) .\) \(\square \)
Applying Proposition B.1 with \(\beta (r) \asymp \eta ^2\) and \(c(m,r) \asymp m/{\eta ^4}\) now yields the result. \(\square \)
1.3 B.3 Proof of Proposition B.1
Choose \(\epsilon \in (0,\sqrt{2})\) and let \(\mathcal {N}\) be the (\(\epsilon /\sqrt{2}\))-net guaranteed by Lemma F.1. Pick some \(t \in (0,K]\) so that (B.2) can hold, we will fix the value of this parameter later in the proof. Let \(\mathcal {E}\) denote the event that the following two estimates hold for all matrices in \(M\in \mathcal {N}\):
Throughout the proof, we will assume that the event \(\mathcal {E}\) holds. We will estimate the probability of \(\mathcal {E}\) at the end of the proof. Meanwhile, seeking to establish RIP, define the quantity
We aim first to provide a high probability bound on \(c_2\).
Let \(M \in S_{2r}\) be arbitrary and let \(M_\star \) be the closest point to M in \(\mathcal {N}\). Then, we have
where (B.20) follows from (B.19) and (B.21) follows from the triangle inequality. To simplify the third term in (B.21), using SVD, we deduce that there exist two orthogonal matrices \(M_1, M_2\) of rank at most 2r satisfying \(M - M_\star = M_1+M_2.\) With this decomposition in hand, we compute
where the second inequality follows from the definition of \(c_2\) and the estimate \(\Vert M_1\Vert _F + \Vert M_2\Vert _F \le \sqrt{2} \Vert (M_1, M_2)\Vert _F = \sqrt{2} \Vert M_1 + M_2\Vert _F.\) Thus, we arrive at the bound
As M was arbitrary, we may take the supremum of both sides of the inequality, yielding \(c_2\le \frac{1}{m}\sup _{M \in S_{2r}}\mathbb {E}\Vert \mathcal {A}(M)\Vert _1 + t+ 2c_2 \epsilon \). Rearranging yields the bound
Assuming that \(\epsilon \le 1/4\), we further deduce that
establishing that the random variable \(c_2\) is bounded by \(\bar{\sigma }\) in the event \(\mathcal {E}\).
Now let \(\hat{\mathcal {I}}\) denote either \(\hat{\mathcal {I}}=\emptyset \) or \(\hat{\mathcal {I}}=\mathcal {I}\). We now provide a uniform lower bound on \(\frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}} }(M)\Vert _1\). Indeed,
where (B.25) uses the forward and reverse triangle inequalities, (B.26) follows from (B.18), the estimate (B.27) follows from the forward and reverse triangle inequalities, and (B.28) follows from (B.22) and (B.24). Switching the roles of \(\mathcal {I}\) and \(\mathcal {I}^c\) in the above sequence of inequalities, and choosing \(\epsilon = t/4\bar{\sigma }\), we deduce
In particular, setting \(\hat{\mathcal {I}}=\emptyset \), we deduce
and therefore using (B.1), we conclude the RIP property
Next, let \(\hat{\mathcal {I}} = \mathcal {I}\) and note that
where the equality follows by assumption (1). Therefore, every \(M\in S_{2r}\) satisfies
Setting \(t=\frac{2}{3}\min \{\alpha , \alpha (1-2|\mathcal {I}|/m)/2\} = \frac{1}{3}\alpha (1-2|\mathcal {I}|/m)\) in (B.29) and (B.30), we deduce the claimed estimates (B.3) and (B.4). Finally, let us estimate the probability of \(\mathcal {E}\). Using the union bound and Lemma F.1 yields
where c(m, r) is the function guaranteed by assumption (3).
By (B.1), we get \(1/\epsilon = 4\bar{\sigma }/t \lesssim 2 + \beta (r)/(1 - 2|\mathcal {I}|/m)\). Then, we deduce
Hence, as long as \(c(m,r)\ge \frac{9c_1(d_1+d_2+1)r^2\ln \left( c_2+\frac{c_2\beta (r)}{1-2|\mathcal {I}|/m}\right) }{\alpha ^2 \left( 1-\frac{2|\mathcal {I}|}{m}\right) ^2}\), we can be sure
Proving the desired result. \(\square \)
C Proof in Sect. 7
1.1 C.1 Proof of Lemma 7.4
Define \(P(x,y)=a\Vert y-x\Vert ^2_2+b\Vert y-x\Vert _2\). Fix an iteration k and choose \(x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_k)\). Then, the estimate holds:
Rearranging and using the sharpness and approximation accuracy assumptions, we deduce
The result follows.
1.2 C.2 Proof of Theorem 7.6
First notice that for any y, we have \(\partial f(y) = \partial f_y(y)\). Therefore, since \(f_y\) is a convex function, we have that for all \(x, y \in \mathcal {X}\) and \(v \in \partial f(y)\), the bound
Consequently, given that \(\mathrm{dist}(x_i,\mathcal {X}^*)\le \gamma \cdot \frac{\mu - 2b}{2a}\), we have
Here, the estimate (C.2) follows from the fact that the projection \(\mathrm {proj}_\mathcal {X}(\cdot )\) is nonexpansive, (C.3) uses the bound in (C.1), (C.5) follow from the estimate \(\mathrm{dist}(x_i,\mathcal {X}^*)\le \gamma \cdot \frac{\mu - 2b}{2a}\), while (C.4) and (C.6) use local sharpness. The result then follows by the upper bound \(\Vert \zeta _i\Vert \le L\).
D Proofs in Sect. 8
1.1 D.1 Proof of Lemma 8.1
The inequality can be established using an argument similar to that for bounding the \( T_3 \) term in [27, Section 6.6]. We provide the proof below for completeness. Define the shorthand \( \varDelta _S := S-S_{\sharp }\) and \( \varDelta _X = X- X_{\sharp } \), and let \( e_j \in \mathbb {R}^d\) denote the j-th standard basis vector of \( \mathbb {R}^d \). Simple algebra gives
We claim that \( \Vert \varDelta _S e_j \Vert _1 \le 2\sqrt{k} \Vert \varDelta _S e_j \Vert _2\) for each \( j\in [d] \). To see this, fix any \( j\in [d] \) and let \( v := Se_j \), \( v^* := S_\sharp e_j \), and \( T := \text {support}(v^*). \) We have
Rearranging terms gives \( \Vert (v - v^*)_{T^c} \Vert _1 \le \Vert (v - v^*)_T \Vert _1 \), whence
where step the second inequality holds because \( |T| \le k \) by assumption. The claim follows from noting that \( v-v^* = \varDelta _S e_j \).
Using the claim, we get that
Using a similar argument and the fact that \( \Vert \varDelta _X \Vert _{2,\infty } \le \Vert X\Vert _{2,\infty } + \Vert X_{\sharp }\Vert _{2,\infty } \le 3\sqrt{\frac{\nu r}{d}} \), we obtain
Putting everything together, we have
The claim follows.
1.2 D.2 Proof of Theorem 8.6
Without loss of generality, suppose that x is closer to \(\bar{x}\) than to \(-\bar{x}\). Consider the following expression:
We now produce a few different lower bounds by testing against different V. In what follows, we set \(a = \sqrt{2} - 1\), i.e., the positive solution of the equation \(1-a^2 = 2a\).
Case 1: Suppose that
Then, set \(\bar{V} = \mathrm {sign}((x - \bar{x} )^\top \mathrm {sign}(\bar{x})) \cdot \mathrm {sign}(\bar{x})\mathrm {sign}(\bar{x})^\top \), to get
Case 2: Suppose that
Then, set \(\bar{V} = \mathrm {sign}(\mathrm {sign}(x - \bar{x} )^\top \bar{x}) \cdot \mathrm {sign}( x - \bar{x})\mathrm {sign}( x - \bar{x})^\top \), to get
Case 3: Suppose that
Define \(\bar{V} = \frac{1}{2}(\mathrm {sign}(\bar{x}(x - \bar{x})^\top ) + \mathrm {sign}((x - \bar{x}) \bar{x}^\top ))\). Observe that
and
Putting these two bounds together, we find that
Altogether, we find that
as desired.
1.3 D.3 Proof of Lemma 8.8
We start by stating a claim we will use to prove the lemma. Let us introduce some notation. Consider the set
Define the random variable
Claim
There exist constants \( c_2, c_3 > 0\) such that with probability at least \(1-\exp (-c_2 \log d)\)
Before proving this claim, let us show how it implies the theorem. Let
Set \(\varDelta _- = X - X_\sharp R\) and \(\varDelta _+ = X + X_\sharp R\). Notice that
Therefore, because \((\varDelta _+, \varDelta _-) \in S\) and
we have that
where the last line follows by Conjecture 8.7. This proves the desired result.
Proof of the Claim
Our goal is to show that the random variable Z is highly concentrated around its mean. We may apply the standard symmetrization inequality [7, Lemma 11.4] to bound the expectation \(\mathbb {E}Z\) as follows:
Observing that \(T_1\) and \(T_2\) can both be bounded by
where the final inequality follows from Bernstein’s inequality and a union bound, we find that
To prove that Z is well concentrated around \( \mathbb {E}Z\), we apply Theorem F.3. To apply this theorem, we set \(\mathcal {S}= S\) and define the collection \((Z_{ij,s})_{ij, s\in \mathcal {S}}\), where \(s = (\varDelta _+, \varDelta _-)\) by
We also bound
and
Therefore, due to Theorem F.3 there exists a constant \(c_1, c_2, c_3 > 0\) so that with \(t = c_2 \log d\), we have that with probability \(1-e^{-c_2\log d}\) that Z is upped bounded by
where the last line follows since by assumption \(\log d / d \lesssim \tau .\) \(\square \)
E Proofs in Sect. 9
1.1 E.1 Proof of Lemma 9.1
The proof follows the same strategy as [32, Theorem 6.1]. Fix \(x \in \widetilde{\mathcal {T}}_1\) and let \(\zeta \in \partial \tilde{f}(x)\). Then, for all y, we have, from Lemma 9.3, that
Therefore, the function
satisfies
Now, for some \(\gamma > 0\) to be determined momentarily, define
First-order optimality conditions and the sum rule immediately imply that
Thus,
Now we estimate \(\Vert x - \hat{x}\Vert _2\). Indeed, from the definition of \(\hat{x}\) we have
Consequently, we have \(\Vert x - \hat{x}\Vert \le 2\gamma \). Thus, setting \(\gamma = \sqrt{2\varepsilon /\rho }\) and recalling that \(\varepsilon \le \mu ^2/56\rho \) we find that
Likewise, we have
Therefore, setting \(L = \sup \left\{ \Vert \zeta \Vert _2:\zeta \in \partial f(x), \mathrm{dist}(x, \mathcal {X}^*) \le \frac{\mu }{\rho }, \mathrm{dist}(x, \mathcal {X}) \le 2\sqrt{\frac{\varepsilon }{\rho }}\right\} \), we find that
as desired.
1.2 E.2 Proof of Theorem 9.4
Let \(i \ge 0\), suppose \(x_i \in \widetilde{\mathcal {T}}_1\), and let \(x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_i)\). Notice that Lemma 9.2 implies \(\tilde{f}(x_i)-\min _{\mathcal {X}}f>0\). We successively compute
Here, the estimate (E.1) follows from the fact that the projection \(\mathrm {proj}_Q(\cdot )\) is nonexpansive, (E.2) uses Lemma 9.3, the estimate (E.4) follows from the assumption \(\epsilon <\frac{\mu }{14}\Vert x_k-x^*\Vert \), the estimate (E.5) follows from the estimate \(\Vert x_i-x^*\Vert \le \frac{\mu }{4\rho }\), while (E.3) and (E.6) use Lemma 9.2. We therefore deduce
Consequently, either we have \(\mathrm{dist}(x_{i+1}, \mathcal {X}^*) < \frac{14\varepsilon }{\mu }\) or \(x_{i+1} \in \widetilde{\mathcal {T}}_1\). Therefore, by induction, the proof is complete.
1.3 E.3 Proof of Theorem 9.6
Let \(i \ge 0\), suppose \(x_i \in \mathcal {T}_\gamma \), and let \(x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_i)\). Then,
Rearranging yields the result.
F Auxiliary Lemmas
Lemma F.1
(Lemma 3.1 in [13]) Let \(S_r := \left\{ X \in \mathbf{R}^{d_1 \times d_2} \mid \text {Rank }\,(X) \le r, \left\| X \right\| _F = 1\right\} \). There exists an \(\epsilon \)-net \(\mathcal {N}\) (with respect to \(\Vert \cdot \Vert _F\)) of \(S_r\) obeying
Proposition F.2
(Corollary 1.4 in [75]) Consider \(X_1, \dots , X_d\) real-valued random variables and let \(\sigma \in \mathbb {S}^{d-1}\) be a unit vector. Let \(t, p > 0\) such that
Then, the following holds
where \(C > 0\) is a universal constant.
Theorem F.3
(Talagrand’s Functional Bernstein for non-identically distributed variables [53, Theorem 1.1(c)]) Let \(\mathcal {S}\) be a countable index set. Let \(Z_{1},\ldots ,Z_{n}\) be independent vector-valued random variables of the form \(Z_{i}=(Z_{i,s})_{s\in \mathcal {S}}\). Let \(Z:=\sup _{s\in \mathcal {S}}\sum _{i=1}^{n}Z_{i,s}\). Assume that for all \(i\in [n]\) and \(s\in \mathcal {S}\), \(\mathbb {E}Z_{i,s}=0\) and \(\left| Z_{i,s}\right| \le b\). Let
Then, for each \(t>0\), we have the tail bound
Rights and permissions
About this article
Cite this article
Charisopoulos, V., Chen, Y., Davis, D. et al. Low-Rank Matrix Recovery with Composite Optimization: Good Conditioning and Rapid Convergence. Found Comput Math 21, 1505–1593 (2021). https://doi.org/10.1007/s10208-020-09490-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-020-09490-9
Keywords
- Restricted isometry property
- Matrix sensing
- Matrix completion
- Low-rank matrix recovery
- Subgradient
- Prox-linear algorithms