Low-Rank Matrix Recovery with Composite Optimization: Good Conditioning and Rapid Convergence

Charisopoulos, Vasileios; Chen, Yudong; Davis, Damek; Díaz, Mateo; Ding, Lijun; Drusvyatskiy, Dmitriy

doi:10.1007/s10208-020-09490-9

Low-Rank Matrix Recovery with Composite Optimization: Good Conditioning and Rapid Convergence

Published: 28 January 2021

Volume 21, pages 1505–1593, (2021)
Cite this article

Foundations of Computational Mathematics Aims and scope Submit manuscript

Vasileios Charisopoulos¹,
Yudong Chen¹,
Damek Davis¹,
Mateo Díaz²,
Lijun Ding¹ &
…
Dmitriy Drusvyatskiy³

2211 Accesses
1 Altmetric
Explore all metrics

Abstract

The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA. Numerical experiments on these problems illustrate the benefits of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convergence of projected subgradient method with sparse or low-rank constraints

Article 02 July 2024

Provably Accelerating Ill-Conditioned Low-Rank Estimation via Scaled Gradient Descent, Even with Overparameterization

Sparse Recovery: The Square of $\ell _1/\ell _2$ Norms

Article Open access 02 December 2024

Notes

Here, the subdifferential is formally obtained through the chain rule $\partial f(x)=\nabla F(x)^*\partial h(F(x))$, where $\partial h(\cdot )$ is the subdifferential in the sense of convex analysis.
Both the parameters $\alpha _t$ and $\beta $ must be properly chosen for these guarantees to take hold.
The authors of [57] provide a bound on L that scales with the Frobenius norm $\sqrt{\Vert M_{\sharp }}\Vert _F$. We instead derive a sharper bound that scales as $\sqrt{\Vert M_{\sharp }\Vert _\mathrm{op}}$. As a by-product, the linear rate of convergence for the subgradient method scales only with the condition number $\sigma _1(M_{\sharp })/\sigma _r(M_{\sharp })$ instead of $\Vert M_{\sharp }\Vert _F/\sigma _r(M_{\sharp })$.
The guarantees we develop in the symmetric setting are similar to those in the recent preprint [57], albeit we obtain a sharper bound on L; the two sets of results were obtained independently. The guarantees for the asymmetric setting are different and are complementary to each other: we analyze the conditioning of the basic problem formulation (1.2), while [57] introduces a regularization term $ \Vert X^\top X - YY^\top \Vert _F$ that improves the basin of attraction for the subgradient method by a factor of the condition number of $M_{\sharp }$.
In the latter case, RIP is also called restricted uniform boundedness (RUB) [10].
Weakly convex functions also go by other names such as lower-$C^2$, uniformly prox-regularity, paraconvex, and semiconvex. We refer the reader to the seminal works on the topic [2, 67, 69, 72, 74].
with
By this we mean that the vectorized matrix $\mathbf {vec}(P)$ is a $\eta $-sub-Gaussian random vector.
Recall that $\Vert X\Vert _{2,\infty } = \max _{i \in [d]} \Vert X_{i \cdot }\Vert _2$ is the maximum row norm.

References

Ahmed, A., Recht, B., Romberg, J.: Blind deconvolution using convex programming. IEEE Transactions on Information Theory 60(3), 1711–1732 (2014)
Article MathSciNet Google Scholar
Albano, P., Cannarsa, P.: Singularities of semiconcave functions in Banach spaces. In: Stochastic analysis, control, optimization and applications, Systems Control Found. Appl., pp. 171–190. Birkhäuser Boston, Boston, MA (1999)
Balcan, M.F., Liang, Y., Song, Z., Woodruff, D.P., Zhang, H.: Non-convex matrix completion and related problems via strong duality. Journal of Machine Learning Research 20(102), 1–56 (2019)
MathSciNet MATH Google Scholar
Bauch, J., Nadler, B.: Rank $2r$ iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries. arXiv preprint arXiv:2002.01849 (2020)
Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)
Borwein, J., Lewis, A.: Convex analysis and nonlinear optimization. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, 3. Springer-Verlag, New York (2000). Theory and examples
Boucheron, S., Lugosi, G., Massart, P.: Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press (2013)
Book Google Scholar
Burke, J.: Descent methods for composite nondifferentiable optimization problems. Math. Programming 33(3), 260–279 (1985). 10.1007/BF01584377.
Article MathSciNet MATH Google Scholar
Burke, J., Ferris, M.: A Gauss-Newton method for convex composite optimization. Math. Programming 71(2, Ser. A), 179–194 (1995). https://doi.org/10.1007/BF01585997.
Cai, T., Zhang, A.: ROP: matrix recovery via rank-one projections. Ann. Stat. 43(1), 102–138 (2015). 10.1214/14-AOS1267.
Article MathSciNet MATH Google Scholar
Candès, E., Eldar, Y., Strohmer, T., Voroninski, V.: Phase retrieval via matrix completion. SIAM J. Imaging Sci. 6(1), 199–225 (2013). 10.1137/110848074
Article MathSciNet MATH Google Scholar
Candès, E., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inform. Theory 61(4), 1985–2007 (2015). 10.1109/TIT.2015.2399924
Article MathSciNet MATH Google Scholar
Candes, E., Plan, Y.: Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory 57(4), 2342–2359 (2011)
Article MathSciNet Google Scholar
Candes, E., Strohmer, T., Voroninski, V.: Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics 66(8), 1241–1274 (2013)
Article MathSciNet Google Scholar
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? Journal of the ACM (JACM) 58(3), 1–37 (2011)
Article MathSciNet Google Scholar
Candés, E.J., Recht, B.: Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9(6), 717 (2009). 10.1007/s10208-009-9045-5
Article MathSciNet MATH Google Scholar
Candès, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory 56(5), 2053–2080 (2010)
Article MathSciNet Google Scholar
Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Willsky, A.S.: Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization 21(2), 572–596 (2011). 10.1137/090761793
Article MathSciNet MATH Google Scholar
Charisopoulos, V., Davis, D., Díaz, M., Drusvyatskiy, D.: Composite optimization for robust blind deconvolution. arXiv:1901.01624 (2019)
Chen, Y.: Incoherence-optimal matrix completion. IEEE Transactions on Information Theory 61(5), 2909–2923 (2015)
Article MathSciNet Google Scholar
Chen, Y., Candès, E.: Solving random quadratic systems of equations is nearly as easy as solving linear systems. Comm. Pure Appl. Math. 70(5), 822–883 (2017)
Article MathSciNet Google Scholar
Chen, Y., Chi, Y.: Harnessing structures in big data via guaranteed low-rank matrix estimation: Recent theory and fast algorithms via convex and nonconvex optimization. IEEE Signal Processing Magazine 35(4), 14–31 (2018)
Article Google Scholar
Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Mathematical Programming (2019). https://doi.org/10.1007/s10107-019-01363-6
Article MathSciNet MATH Google Scholar
Chen, Y., Chi, Y., Goldsmith, A.: Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Trans. Inform. Theory 61(7), 4034–4059 (2015). 10.1109/TIT.2015.2429594
Article MathSciNet MATH Google Scholar
Chen, Y., Fan, J., Ma, C., Yan, Y.: Bridging Convex and Nonconvex Optimization in Robust PCA: Noise, Outliers, and Missing Data. arXiv e-prints arXiv:2001.05484 (2020)
Chen, Y., Jalali, A., Sanghavi, S., Caramanis, C.: Low-rank matrix recovery from errors and erasures. IEEE Transactions on Information Theory 59(7), 4324–4337 (2013)
Article Google Scholar
Chen, Y., Wainwright, M.J.: Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv:1509.03025 (2015)
Chi, Y., Lu, Y., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: An overview. arXiv:1809.09573 (2018)
Davenport, M., Romberg, J.: An overview of low-rank matrix recovery from incomplete observations. IEEE J. Selected Top. Signal Process. 10(4), 608–622 (2016). 10.1109/JSTSP.2016.2539100
Article Google Scholar
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization 29(1), 207–239 (2019)
Article MathSciNet Google Scholar
Davis, D., Drusvyatskiy, D., MacPhee, K., Paquette, C.: Subgradient methods for sharp weakly convex functions. J. Optim. Theory Appl. 179(3), 962–982 (2018). 10.1007/s10957-018-1372-8
Article MathSciNet MATH Google Scholar
Davis, D., Drusvyatskiy, D., Paquette, C.: The nonsmooth landscape of phase retrieval. To appear in IMA J. Numer. Anal., arXiv:1711.03247 (2017)
Díaz, M.: The nonsmooth landscape of blind deconvolution. arXiv preprint arXiv:1911.08526 (2019)
Ding, L., Chen, Y.: Leave-one-out approach for matrix completion: Primal and dual analysis. IEEE Trans. Inf. Theory (2020). https://doi.org/10.1109/TIT.2020.2992769
Article MathSciNet MATH Google Scholar
Drusvyatskiy, D., Lewis, A.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018). 10.1287/moor.2017.0889
Article MathSciNet MATH Google Scholar
Drusvyatskiy, D., Paquette, C.: Efficiency of minimizing compositions of convex functions and smooth maps. Math. Prog. pp. 1–56 (2018)
Duchi, J., Ruan, F.: Solving (most) of a set of quadratic equalities: composite optimization for robust phase retrieval. IMA J. Inf. Inference (2018). https://doi.org/10.1093/imaiai/iay015
Article MATH Google Scholar
Duchi, J., Ruan, F.: Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28(4), 3229–3259 (2018)
Article MathSciNet Google Scholar
Eldar, Y., Mendelson, S.: Phase retrieval: stability and recovery guarantees. Appl. Comput. Harmon. Anal. 36(3), 473–494 (2014). 10.1016/j.acha.2013.08.003
Article MathSciNet MATH Google Scholar
Fazel, M.: Matrix rank minimization with applications. Ph.D. thesis, Stanford University (2002)
Fletcher, R.: A model algorithm for composite nondifferentiable optimization problems. Math. Programming Stud. (17), 67–76 (1982). https://doi.org/10.1007/bfb0120959. Nondifferential and variational techniques in optimization (Lexington, Ky., 1980)
Ge, R., Jin, C., Zheng, Y.: No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 1233–1242. PMLR, International Convention Centre, Sydney, Australia (2017)
Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, R. Garnett (eds.) Advances in Neural Information Processing Systems 29, pp. 2973–2981. Curran Associates, Inc. (2016)
Goffin, J.: On convergence rates of subgradient optimization methods. Math. Programming 13(3), 329–347 (1977). 10.1007/BF01584346
Article MathSciNet MATH Google Scholar
Goldstein, T., Studer, C.: Phasemax: Convex phase retrieval via basis pursuit. IEEE Transactions on Information Theory 64(4), 2675–2689 (2018). 10.1109/TIT.2018.2800768
Article MathSciNet MATH Google Scholar
Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57(3), 1548–1566 (2011)
Article MathSciNet Google Scholar
Hardt, M.: Understanding alternating minimization for matrix completion. In: Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS ’14, p. 651–660. IEEE Computer Society, USA (2014). https://doi.org/10.1109/FOCS.2014.75
Hardt, M., Wootters, M.: Fast matrix completion without the condition number. In: M.F. Balcan, V. Feldman, C. Szepesvári (eds.) Proceedings of The 27th Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 35, pp. 638–678. PMLR, Barcelona, Spain (2014)
Hsu, D., Kakade, S.M., Zhang, T.: Robust matrix decomposition with sparse corruptions. IEEE Transactions on Information Theory 57(11), 7221–7234 (2011)
Article MathSciNet Google Scholar
Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, p. 665–674. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2488608.2488693
Keshavan, R., Montanari, A., Oh, S.: Matrix completion from noisy entries. In: Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 952–960. Curran Associates, Inc. (2009)
Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from a few entries. IEEE Transactions on Information Theory 56(6), 2980–2998 (2010)
Article MathSciNet Google Scholar
Klein, T., Rio, E.: Concentration around the mean for maxima of empirical processes. The Annals of Probability 33(3), 1060–1077 (2005)
Article MathSciNet Google Scholar
Lewis, A., Wright, S.: A proximal method for composite minimization. Math. Program. 158(1-2, Ser. A), 501–546 (2016). https://doi.org/10.1007/s10107-015-0943-9
Li, X.: Compressed sensing and matrix completion with constant proportion of corruptions. Constr. Approximation 37(1), 73–99 (2013)
Article MathSciNet Google Scholar
Li, X., Ling, S., Strohmer, T., Wei, K.: Rapid, robust, and reliable blind deconvolution via nonconvex optimization. arXiv:1606.04933 (2016)
Li, X., Zhu, Z., So, A.C., Vidal, R.: Nonconvex robust low-rank matrix recovery. arXiv:1809.09237 (2018)
Li, Y., Ma, C., Chen, Y., Chi, Y.: Nonconvex matrix factorization from rank-one measurements. arXiv:1802.06286 (2018)
Li, Y., Sun, Y., Chi, Y.: Low-rank positive semidefinite matrix recovery from corrupted rank-one measurements. IEEE Transactions on Signal Processing 65(2), 397–408 (2016)
Article MathSciNet Google Scholar
Ling, S., Strohmer, T.: Self-calibration and biconvex compressive sensing. Inverse Probl. 31(11), 115002, 31 (2015). https://doi.org/10.1088/0266-5611/31/11/115002
Ma, C., Wang, K., Chi, Y., Chen, Y.: Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In: J. Dy, A. Krause (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 3345–3354. PMLR, Stockholmsmässan, Stockholm Sweden (2018)
Mendelson, S.: A remark on the diameter of random sections of convex bodies. In: Geometric aspects of functional analysis, Lecture Notes in Math., vol. 2116, pp. 395–404. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09477-9_25
Mendelson, S.: Learning without concentration. J. ACM 62(3), Art. 21, 25 (2015). https://doi.org/10.1145/2699439
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I: Basic Theory. Grundlehren der mathematischen Wissenschaften, Vol 330, Springer, Berlin (2006)
Negahban, S., Ravikumar, P., Wainwright, M., Yu, B.: A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers. Statist. Sci. 27(4), 538–557 (2012). 10.1214/12-STS400
Article MathSciNet MATH Google Scholar
Netrapalli, P., Niranjan, U., Sanghavi, S., Anandkumar, A., Jain, P.: Non-convex robust PCA. In: Advances in Neural Information Processing Systems, pp. 1107–1115 (2014)
Nurminskii, E.: The quasigradient method for the solving of the nonlinear programming problems. Cybernetics 9(1), 145–150 (1973). 10.1007/BF01068677
Article MathSciNet Google Scholar
Parikh, N., Boyd, S.: Block splitting for distributed optimization. Mathematical Programming Computation 6(1), 77–102 (2014)
Article MathSciNet Google Scholar
Poliquin, R., Rockafellar, R.: Prox-regular functions in variational analysis. Trans. Amer. Math. Soc. 348, 1805–1838 (1996)
Article MathSciNet Google Scholar
Recht, B.: A simpler approach to matrix completion. Journal of Machine Learning Research 12(104), 3413–3430 (2011)
MathSciNet MATH Google Scholar
Recht, B., Fazel, M., Parrilo, P.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010). 10.1137/070697835
Article MathSciNet MATH Google Scholar
Rockafellar, R.: Favorable classes of Lipschitz-continuous functions in subgradient optimization. In: Progress in nondifferentiable optimization, IIASA Collaborative Proc. Ser. CP-82, vol. 8, pp. 125–143. Int. Inst. Appl. Sys. Anal., Laxenburg (1982)
Rockafellar, R., Wets, R.B.: Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin (1998)
Rolewicz, S.: On paraconvex multifunctions. In: Third Symposium on Operations Research (Univ. Mannheim, Mannheim, 1978), Section I, Operations Res. Verfahren, vol. 31, pp. 539–546. Hain, Königstein/Ts. (1979)
Rudelson, M., Vershynin, R.: Small ball probabilities for linear images of high-dimensional distributions. International Mathematics Research Notices 2015(19), 9594–9617 (2014)
Article MathSciNet Google Scholar
Shechtman, Y., Eldar, Y., Cohen, O., Chapman, H., Miao, J., Segev, M.: Phase retrieval with application to optical imaging: A contemporary overview. IEEE Signal Processing Magazine 32(3), 87–109 (2015). 10.1109/MSP.2014.2352673
Article Google Scholar
Sun, R., Luo, Z.Q.: Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory 62(11), 6535–6579 (2016)
Article MathSciNet Google Scholar
Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., Recht, B.: Low-rank solutions of linear matrix equations via Procrustes flow. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, Volume 48, ICML’16, pp. 964–973. JMLR.org (2016)
Vershynin, R.: High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press (2018)
Book Google Scholar
Yi, X., Park, D., Chen, Y., Caramanis, C.: Fast algorithms for robust pca via gradient descent. In: Advances in neural information processing systems, pp. 4152–4160 (2016)
Zheng, Q., Lafferty, J.: Convergence Analysis for Rectangular Matrix Completion Using Burer-Monteiro Factorization and Gradient Descent. arXiv e-prints arXiv:1605.07051 (2016)

Download references

Author information

Authors and Affiliations

School of ORIE, Cornell University, Ithaca, NY, 14853, USA
Vasileios Charisopoulos, Yudong Chen, Damek Davis & Lijun Ding
Center for Applied Mathematics, Cornell University, Ithaca, NY, 14853, USA
Mateo Díaz
Department of Mathematics, U. Washington, Seattle, WA, 98195, USA
Dmitriy Drusvyatskiy

Authors

Vasileios Charisopoulos
View author publications
You can also search for this author inPubMed Google Scholar
Yudong Chen
View author publications
You can also search for this author inPubMed Google Scholar
Damek Davis
View author publications
You can also search for this author inPubMed Google Scholar
Mateo Díaz
View author publications
You can also search for this author inPubMed Google Scholar
Lijun Ding
View author publications
You can also search for this author inPubMed Google Scholar
Dmitriy Drusvyatskiy
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Dmitriy Drusvyatskiy.

Additional information

Communicated by Thomas Strohmer.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research of Y. Chen was supported by the NSF 1657420 and 1704828 Grants. Research of D. Davis was supported by an Alfred P. Sloan Research Fellowship. Research of D. Drusvyatskiy was supported by the NSF DMS 1651851 and CCF 1740551 awards.

Appendices

A Proofs in Sect. 5

In this section, we prove rapid local convergence guarantees for the subgradient and prox-linear algorithms under regularity conditions that hold only locally around a particular solution. We will use the Euclidean norm throughout this section; therefore, to simplify the notation, we will drop the subscript two. Thus, $\Vert \cdot \Vert $ denotes the $\ell _2$ on a Euclidean space $\mathbf {E}$ throughout.

We will need the following quantitative version of Lemma 5.1.

Lemma A.1

Suppose Assumption C holds and let $\gamma \in (0,2)$ be arbitrary. Then, for any point $x\in B_{\epsilon /2}(\bar{x})\cap \mathcal {T}_{\gamma }\backslash \mathcal {X}^*$, the estimate holds:

$$\begin{aligned} \mathrm{dist}\left( 0,\partial f(x)\right) \ge \left( 1-\tfrac{\gamma }{2}\right) \mu . \end{aligned}$$

Proof

Consider any point $x\in B_{\epsilon /2}(\bar{x})$ satisfying $\mathrm{dist}(x,\mathcal {X}^*)\le \gamma \frac{\mu }{\rho }$. Let $x^*\in \mathrm {proj}_{\mathcal {X}^*}(x)$ be arbitrary and note $x^*\in B_{\epsilon }(\bar{x})$. Thus, for any $\zeta \in \partial f(x)$ we deduce

$$\begin{aligned} \mu \cdot \mathrm{dist}(x,\mathcal {X}^*)\le & {} f(x)-f(x^*)\le \langle \zeta ,x-x^*\rangle +\frac{\rho }{2}\Vert x-x^*\Vert ^2\le \Vert \zeta \Vert \mathrm{dist}(x,\mathcal {X}^*)\\&+\frac{\rho }{2}\mathrm{dist}^2(x,\mathcal {X}^*). \end{aligned}$$

Therefore, we deduce the lower bound on the subgradients $\Vert \zeta \Vert \ge \mu -\frac{\rho }{2}\cdot \mathrm{dist}(x,\mathcal {X}^*)\ge \left( 1-\tfrac{\gamma }{2}\right) \mu ,$ as claimed. $\square $

1.1 A.1 Proof of Theorem 5.6

Let k be the first index (possibly infinite) such that $x_k\notin B_{\epsilon /2}(\bar{x})$. We claim that (5.4) holds for all $i<k$. We show this by induction. To this end, suppose (5.4) holds for all indices up to $i-1$. In particular, we deduce $\mathrm{dist}(x_{i},\mathcal {X}^*)\le \mathrm{dist}(x_{0},\mathcal {X}^*)\le \frac{\mu }{2\rho }$. Let $x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_i)$ and note $x^*\in B_{\epsilon }(\bar{x})$, since

$$\begin{aligned} \Vert x^*-\bar{x}\Vert \le \Vert x^*-x_i\Vert +\Vert x_i-\bar{x}\Vert \le 2\Vert x_i-\bar{x}\Vert \le \epsilon . \end{aligned}$$

Thus, we deduce

$$\begin{aligned} \Vert x_{i+1} - x^*\Vert ^2&=\left\| \mathrm {proj}_{\mathcal {X}}\left( x_{i}-\tfrac{f(x_i)-\min _{\mathcal {X}} f}{\Vert \zeta _i\Vert ^2} \zeta _i\right) -\mathrm {proj}_\mathcal {X}(x^*)\right\| ^2\nonumber \\&\le \left\| (x_{i} - x^*)- \tfrac{f(x_i)-\min _{\mathcal {X}} f}{\Vert \zeta _i\Vert ^2} \zeta _i\right\| ^2 \end{aligned}$$

(A.1)

$$\begin{aligned}&= \Vert x_{i} - x^*\Vert ^2 + \frac{2(f(x_i) - \min _{\mathcal {X}} f)}{\Vert \zeta _i\Vert ^2}\cdot \langle \zeta _i, x^* - x_{i}\rangle + \frac{(f(x_i) - f( x^*))^2}{\Vert \zeta _i\Vert ^2} \nonumber \\&\le \Vert x_{i} - x^*\Vert ^2 + \frac{2(f(x_i) - \min f)}{\Vert \zeta _i\Vert ^2}\left( f( x^*) - f(x_i) + \frac{\rho }{2}\Vert x_i - x^*\Vert ^2 \right) \nonumber \\&\quad + \frac{(f(x_i) - f(x^*))^2}{\Vert \zeta _i\Vert ^2} \end{aligned}$$

(A.2)

$$\begin{aligned}&= \Vert x_{i} - x^*\Vert ^2 + \frac{f(x_i) - \min f}{\Vert \zeta _i\Vert ^2}\left( \rho \Vert x_i - x^*\Vert ^2 - (f(x_i) - f( x^*)) \right) \nonumber \\&\le \Vert x_{i} - x^*\Vert ^2 + \frac{f(x_i) - \min f}{\Vert \zeta _i\Vert ^2}\left( \rho \Vert x_i - x^*\Vert ^2 - \mu \Vert x_i - x^*\Vert \right) \end{aligned}$$

(A.3)

$$\begin{aligned}&= \Vert x_{i} - x^*\Vert ^2 + \frac{\rho (f(x_i) - \min f)}{\Vert \zeta _i\Vert ^2}\left( \Vert x_i - x^*\Vert -\frac{\mu }{\rho }\right) \Vert x_i - x^*\Vert \nonumber \\&\le \Vert x_{i} - x^*\Vert ^2 - \frac{\mu (f(x_i) - \min f)}{2\Vert \zeta _i\Vert ^2}\cdot \Vert x_i - x^*\Vert \end{aligned}$$

(A.4)

$$\begin{aligned}&\le \left( 1-\frac{ \mu ^2}{2\Vert \zeta _i\Vert ^2}\right) \Vert x_i- x^*\Vert ^2. \end{aligned}$$

(A.5)

Here, the estimate (A.1) follows from the fact that the projection $\mathrm {proj}_Q(\cdot )$ is nonexpansive, (A.2) uses local weak convexity, (A.4) follow from the estimate $\mathrm{dist}(x_i,\mathcal {X}^*)\le \frac{\mu }{2\rho }$, while (A.3) and (A.5) use local sharpness. We therefore deduce

$$\begin{aligned} \mathrm{dist}^2(x_{i+1};\mathcal {X}^*)\le \Vert x_{i+1} - x^*\Vert ^2\le \left( 1-\frac{\mu ^2}{2L^2}\right) \mathrm{dist}^2(x_i,\mathcal {X}^*). \end{aligned}$$

(A.6)

Thus, (5.4) holds for all indices up to $k-1$. We next show that k is infinite. To this end, observe

$$\begin{aligned} \Vert x_k-x_0\Vert&\le \sum _{i=0}^{k-1} \Vert x_{i+1}-x_i\Vert \nonumber \\&= \sum _{i=0}^{k-1} \left\| \mathrm {proj}_\mathcal {X}\left( x_i-\tfrac{f(x_i)-\min _{\mathcal {X}} f}{\Vert \zeta _i\Vert ^2}\zeta _i\right) -\mathrm {proj}_\mathcal {X}(x_i)\right\| \nonumber \\&\le \sum _{i=0}^{k-1} \frac{f(x_i)-\min _{\mathcal {X}} f}{\Vert \zeta _i\Vert }\quad \nonumber \\&\le \sum _{i=0}^{k-1} \left\langle \tfrac{\zeta _i}{\Vert \zeta _i\Vert },x_i-\mathrm {proj}_{\mathcal {X}^*} (x_i) \right\rangle +\frac{\rho }{2\Vert \zeta _i\Vert }\Vert x_i-\mathrm {proj}_{\mathcal {X}^*}(x_i)\Vert ^2\nonumber \\&\le \sum _{i=0}^{k-1} \mathrm{dist}(x_i,\mathcal {X}^*)+\frac{2\rho }{3\mu }\mathrm{dist}^2(x_i,\mathcal {X}^*)\end{aligned}$$

(A.7)

$$\begin{aligned}&\le \frac{4}{3}\cdot \sum _{i=0}^{k-1} \mathrm{dist}(x_i,\mathcal {X}^*)\end{aligned}$$

(A.8)

$$\begin{aligned}&\le \frac{4}{3}\cdot \mathrm{dist}(x_0,\mathcal {X}^*) \cdot \sum _{i=0}^{k-1} \left( 1-\frac{\mu ^2}{2L^2}\right) ^{\frac{i}{2}}\nonumber \\&\le \frac{16L^2}{3\mu ^2}\cdot \mathrm{dist}(x_0,\mathcal {X}^*)\le \frac{\epsilon }{4}, \end{aligned}$$

(A.9)

where (A.7) follows by Lemma A.1 with $\gamma = 1/2$, the bound in (A.8) follows by (A.6) and the assumption on $\mathrm{dist}(x_0, \mathcal {X}^*),$ finally (A.9) holds thanks to (A.6). Thus, applying the triangle inequality we get the contradiction $\Vert x_k-\bar{x}\Vert \le \epsilon /2$. Consequently, all the iterates $x_k$ for $k=0,1,\ldots , \infty $ lie in $B_{\epsilon /2}(\bar{x})$ and satisfy (5.4).

Finally, let $x_{\infty }$ be any limit point of the sequence $\{x_i\}$. We then successively compute

$$\begin{aligned} \Vert x_k-x_\infty \Vert \le \sum _{i=k}^{\infty } \Vert x_{i+1}-x_i\Vert&\le \sum _{i=k}^{\infty } \frac{f(x_i)-\min f}{\Vert \zeta _i\Vert }\\&\le \frac{4L}{3\mu }\cdot \sum _{i=k}^{\infty } \mathrm{dist}(x_i,\mathcal {X}^*)\\&\le \frac{4L}{3\mu }\cdot \mathrm{dist}(x_0,\mathcal {X}^*) \cdot \sum _{i=k}^{\infty } \left( 1-\frac{\mu ^2}{2L^2}\right) ^{\frac{i}{2}}\\&\le \frac{16L^3}{3\mu ^3}\cdot \mathrm{dist}(x_0,\mathcal {X}^*)\cdot \left( 1-\frac{\mu ^2}{2L^2}\right) ^{\frac{k}{2}}. \end{aligned}$$

This completes the proof.

1.2 Proof of Theorem 5.7

Fix an arbitrary index k and observe

$$\begin{aligned} \Vert x_{k+1}-x_k\Vert =\left\| \mathrm {proj}_Q(x_k)-\mathrm {proj}_Q\left( x_k-\alpha _k\frac{\xi _k}{\Vert \xi _k\Vert }\right) \right\| \le \alpha _k=\lambda \cdot q^k. \end{aligned}$$

Hence, we conclude the uniform bound on the iterates:

$$\begin{aligned} \Vert x_{k}-x_0\Vert \le \sum _{i=0}^{\infty }\Vert x_{i+1}-x_i\Vert \le \tfrac{\lambda }{1-q} \end{aligned}$$

and the linear rate of convergence

$$\begin{aligned} \Vert x_{k}-x_{\infty }\Vert \le \sum _{i=k}^{\infty }\Vert x_{i+1}-x_i\Vert \le \tfrac{\lambda }{1-q}q^k, \end{aligned}$$

where $x_{\infty }$ is any limit point of the iterate sequence.

Let us now show that the iterates do not escape $B_{\epsilon /2}(\bar{x})$. To this end, observe

$$\begin{aligned} \Vert x_k-\bar{x}\Vert \le \Vert x_k-x_0\Vert +\Vert x_0-\bar{x}\Vert \le \tfrac{\lambda }{1-q}+\tfrac{\epsilon }{4}. \end{aligned}$$

We must therefore verify the estimate $\tfrac{\lambda }{1-q}{\le } \tfrac{\epsilon }{4}$, or equivalently $\gamma {\le } \frac{\epsilon \rho L(1-\gamma )\tau ^2}{4\mu ^2(1+\sqrt{1-(1-\gamma ) \tau ^2})}.$ Clearly, it suffices to verify $\gamma \le \frac{\epsilon \rho (1-\gamma )}{4L},$ which holds by the definition of $\gamma $. Thus, all the iterates $x_k$ lie in $B_{\epsilon /2}(\bar{x})$. Moreover, $\tau \le \sqrt{\frac{1}{2}} \le \sqrt{\frac{1}{2-\gamma }}$, the rest of the proof is identical to that in [31, Theorem 5.1].

1.3 A.3 Proof of Theorem 5.8

Fix any index i such that $x_i\in B_{\epsilon }(\bar{x})$ and let $x\in \mathcal {X}$ be arbitrary. Since the function $z\mapsto f_{x_i}(z)+\frac{\beta }{2}\Vert z-x_i\Vert ^2$ is $\beta $-strongly convex and $x_{i+1}$ is its minimizer, we deduce

$$\begin{aligned} \left( f_{x_i}(x_{i+1})+\frac{\beta }{2}\Vert x_{i+1}-x_i\Vert ^2\right) +\frac{\beta }{2}\Vert x_{i+1}-x\Vert ^2\le f_{x_i}(x)+\frac{\beta }{2}\Vert x-x_i\Vert ^2. \end{aligned}$$

(A.10)

Setting $x=x_i$ and appealing to approximation accuracy, we obtain the descent guarantee

$$\begin{aligned} \Vert x_{i+1}-x_i\Vert ^2\le \frac{2}{\beta }(f(x_i)-f(x_{i+1})). \end{aligned}$$

(A.11)

In particular, the function values are decreasing along the iterate sequence. Next choosing any $x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_i)$ and setting $x=x^*$ in (A.10) yields

$$\begin{aligned} \left( f_{x_i}(x_{i+1})+\frac{\beta }{2}\Vert x_{i+1}-x_i\Vert ^2\right) +\frac{\beta }{2}\Vert x_{i+1}-x^*\Vert ^2\le f_{x_i}(x^*)+\frac{\beta }{2}\Vert x^*-x_i\Vert ^2. \end{aligned}$$

Appealing to approximation accuracy and lower-bounding $\frac{\beta }{2}\Vert x_{i+1}-x^*\Vert ^2$ by zero, we conclude

$$\begin{aligned} f(x_{i+1})\le f(x^*)+\beta \Vert x^*-x_i\Vert ^2. \end{aligned}$$

(A.12)

Using sharpness, we deduce the contraction guarantee

$$\begin{aligned} f(x_{i+1})-f(x^*)&\le \beta \cdot \mathrm{dist}^2(x_i,\mathcal {X}^*)\nonumber \\&\le \frac{\beta }{\mu ^2}(f(x_i)-f(x^*))^2\nonumber \\&\le \frac{\beta (f(x_i)-f(x^*))}{\mu ^2}\cdot (f(x_i)-f(x^*))\le \frac{1}{2}\cdot (f(x_i)-f(x^*)), \end{aligned}$$

(A.13)

where the last inequality uses the assumption $f(x_0)-\min _{\mathcal {X}} f\le \frac{\mu ^2}{2\beta }$. Let $k>0$ be the first index satisfying $x_{k}\notin B_{\epsilon }(\bar{x})$. We then deduce

$$\begin{aligned} \Vert x_{k}-x_0\Vert \le \sum _{i=0}^{k-1} \Vert x_{i+1}-x_i\Vert&\le \sqrt{\frac{2}{\beta }}\cdot \sum _{i=0}^{k-1} \sqrt{f(x_i)-f(x_{i+1})} \end{aligned}$$

(A.14)

$$\begin{aligned}&\le \sqrt{\frac{2}{\beta }}\cdot \sum _{i=0}^{k-1} \sqrt{f(x_i)- f(x^*)}\nonumber \\&\le \sqrt{\frac{2}{\beta }}\cdot \sqrt{f(x_0)-f(x^*)} \cdot \sum _{i=0}^{k-1} \left( \frac{1}{2}\right) ^{\frac{i}{2}}\nonumber \\&\le \frac{1}{\sqrt{2}-1}\sqrt{\frac{f(x_0)-f(x^*)}{\beta }}\le \epsilon /2, \end{aligned}$$

(A.15)

where (A.14) follows from (A.11) and (A.15) follows from (A.13). Thus, we conclude $\Vert x_k-\bar{x}\Vert \le \epsilon $, which is a contradiction. Therefore, all the iterates $x_k$, for $k=0,1,\ldots , \infty $, lie in $B_{\epsilon }(\bar{x})$. Combining this with (A.12) and sharpness yields the claimed quadratic converge guarantee

$$\begin{aligned} \mu \cdot \mathrm{dist}(x_{k+1},\mathcal {X}^*)\le f(x_{k+1})-f(\bar{x})\le \beta \cdot \mathrm{dist}^2(x_k,\mathcal {X}). \end{aligned}$$

Finally, let $x_{\infty }$ be any limit point of the sequence $\{x_i\}$. We then deduce

$$\begin{aligned} \Vert x_{k}-x_\infty \Vert \le \sum _{i=k}^{\infty } \Vert x_{i+1}-x_i\Vert&\le \sqrt{\frac{2}{\beta }}\cdot \sum _{i=k}^{\infty } \sqrt{f(x_i)-f(x_{i+1})}\nonumber \\&\le \sqrt{\frac{2}{\beta }}\cdot \sum _{i=k}^{\infty } \sqrt{f(x_i)-\min _{\mathcal {X}} f}\nonumber \\&\le \frac{\mu \sqrt{2}}{\beta }\cdot \sum _{i=k}^{\infty } \left( \frac{\beta }{\mu ^2}(f(x_0)-\min f)\right) ^{2^{i-1}}\nonumber \\&\le \frac{\mu \sqrt{2}}{\beta }\cdot \sum _{i=k}^{\infty } \left( \frac{1}{2}\right) ^{2^{i-1}}\nonumber \\&\le \frac{\mu \sqrt{2}}{\beta } \sum _{j=0}^{\infty } \left( \frac{1}{2}\right) ^{2^{k-1}+j}\le \frac{2\sqrt{2}\mu }{\beta }\cdot \left( \frac{1}{2}\right) ^{2^{k-1}}, \end{aligned}$$

(A.16)

where (A.16) follows from (A.13). The theorem is proved.

B Proofs in Sect. 6

1.1 B.1 Proof of Lemma 6.3

In order to prove that the assumption in each case, we will prove a stronger “small-ball condition” [62, 63], which immediately implies the claimed lower bounds on the expectation by Markov’s inequality. More precisely, we will show that there exist numerical constants $\mu _0,p_0>0$ such that

1.
(Matrix Sensing)
$$\begin{aligned} \inf _{\begin{array}{c} M: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}(|\langle P,M\rangle | \ge \mu _0) \ge p_0, \end{aligned}$$
2.
(Quadratic Sensing I)
$$\begin{aligned} \inf _{\begin{array}{c} M\in \mathcal {S}^d: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}(|p^\top M p| \ge \mu _0) \ge p_0, \end{aligned}$$
3.
(Quadratic Sensing II)
$$\begin{aligned} \inf _{\begin{array}{c} M\in \mathcal {S}^d: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}\big ( |p^\top M p- \tilde{p}^\top M \tilde{p}| \ge \mu _0\big ) \ge p_0, \end{aligned}$$
4.
(Bilinear Sensing)
$$\begin{aligned} \inf _{\begin{array}{c} M: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}(|p^\top M q| \ge \mu _0) \ge p_0. \end{aligned}$$

These conditions immediately imply Assumptions C-F. Indeed, by Markov’s inequality, in the case of matrix sensing we deduce

$$\begin{aligned} \mathbb {E}|\langle P,M\rangle | \ge \mu _0\mathbb {P}\left( |\langle P, M\rangle |> \mu _0\right) \ge \mu _0p_0. \end{aligned}$$

The same reasoning applies to all the other problems.

Matrix sensing Consider any matrix M with $\Vert M\Vert _F =1.$ Then, since $g := \langle P, M\rangle $ follows a standard normal distribution, we may set $\mu _0$ to be the median of |g| and $p_0= 1/2$ to obtain

$$\begin{aligned} \inf _{\begin{array}{c} M: \; \text {Rank }\,M \le 2r \\ \Vert M\Vert _F = 1 \end{array}} \mathbb {P}(|\langle P,M\rangle | \ge \mu _0) = \mathbb {P}(|g| \ge \mu _0) \ge p_0. \end{aligned}$$

Quadratic Sensing I Fix a matrix M with $\text {Rank }\,M \le 2r$ and $\Vert M\Vert _F=1$. Let $M = UDU^\top $ be an eigenvalue decomposition of M. Using the rotational invariance of the Gaussian distribution, we deduce

$$\begin{aligned} p^\top M p{\mathop {=}\limits ^{ d }}p^\top D p= \sum _{k=1}^{2r} \lambda _k p_k^2, \end{aligned}$$

where ${\mathop {=}\limits ^{ d }}$ denotes equality in distribution. Next, let z be a standard normal variable. We will now invoke Proposition F.2. Let $C>0$ be the numerical constant appearing in the proposition. Notice that the function $\phi :\mathbf{R}_+ \rightarrow \mathbf{R}$ given by

$$\begin{aligned} \phi (t) = \sup _{u \in \mathbf{R}} \mathbb {P}(|z^2 - u| \le t) \end{aligned}$$

is continuous and strictly increasing, and it satisfies $\phi (0)= 0$ and $\lim _{t \rightarrow \infty } \phi (t) = 1.$ Hence, we may set $\mu _0= \phi ^{-1}(\min \{1/2C,1/2\})$. Proposition F.2 then yields

$$\begin{aligned} \mathbb {P}(|p^\top M p| \le \mu _0)= & {} \mathbb {P}\left( \left| \sum _{k=1}^{2r} \lambda _k p_k^2 \right| \le \mu _0\right) \le \sup _{u \in \mathbf{R}} \mathbb {P}\left( \left| \sum _{k=1}^{2r} \lambda _k p_k^2 - u\right| \le \mu _0\right) \\\le & {} C \phi (\mu _0) \le \frac{1}{2}. \end{aligned}$$

By taking the supremum of both sides of the inequality we conclude that Assumption D holds with $\mu _0$ and $p_0= 1/2.$

Quadratic sensing II Let $M = UDU^\top $ be an eigenvalue decomposition of M. Using the rotational invariance of the Gaussian distribution, we deduce

$$\begin{aligned} p^\top M p- \tilde{p}^\top M \tilde{p} {\mathop {=}\limits ^{ d }}p^\top D p- \tilde{p}^\top D \tilde{p} = \sum _{k=1}^{2r} \lambda _k \left( p_k^2 - \tilde{p}_k^2\right) {\mathop {=}\limits ^{ d }}2 \sum _{k=1}^{2r} \lambda _k p_k \tilde{p}_k, \end{aligned}$$

where the last relation follows since $\left( p_k - \tilde{p}_k\right) ,\left( p_k + \tilde{p}_k\right) $ are independent standard normal random variables with mean zero and variance two. We will now invoke Proposition F.2. Let $C>0$ be the numerical constant appearing in the proposition. Let z and $ \tilde{z} $ be independent standard normal variables. Notice that the function $\phi :\mathbf{R}_+ \rightarrow \mathbf{R}$ given by

$$\begin{aligned} \phi (t) = \sup _{u \in \mathbf{R}} \mathbb {P}(|2 z\tilde{z} - u| \le t) \end{aligned}$$

is continuous, strictly increasing, satisfies $\phi (0)= 0$ and approaches one at infinity. Defining $\mu _0= \phi ^{-1}(\min \{1/2C,1/2\})$ and applying Proposition F.2, we get

$$\begin{aligned} \mathbb {P}\left( \left| 2\sum _{k=1}^{2r} \sigma _k p_k \tilde{p}_k \right| \le \mu _0\right) \le \sup _{u \in \mathbf{R}} \mathbb {P}\left( \left| 2\sum _{k=1}^{2r} \sigma _k p_k \tilde{p}_k - u\right| \le \mu _0\right) \le C \phi (\mu _0) \le \frac{1}{2}. \end{aligned}$$

By taking the supremum of both sides of the inequality we conclude that Condition E holds with $\mu _0$ and $p_0= 1/2.$

We omit the details for the bilinear case, which follow by similar arguments.

1.2 B.2 Proof of Theorem 6.4

The proofs in this section rely on the following proposition, which shows that that pointwise concentration imply uniform concentration. We defer the proof to Appendix B.3.

Proposition B.1

Let $\mathcal {A}: \mathbf{R}^{d_1 \times d_2} \rightarrow \mathbf{R}^m$ be a random linear mapping with property that for any fixed matrix $M \in \mathbf{R}^{d_1 \times d_2}$ of rank at most 2r with norm $\Vert M\Vert _F =1$ and any fixed subset of indices $\mathcal {I}\subseteq \{1, \dots , m\}$ satisfying $|\mathcal {I}| < m/2$, the following hold:

(1)
The measurements $\mathcal {A}(M)_1, \dots , \mathcal {A}(M)_m$ are i.i.d.
(2)
RIP holds in expected value:
$$\begin{aligned} \alpha \le \mathbb {E}| \mathcal {A}(M)_i | \le \beta (r) \qquad \text {for all } i \in \{1, \dots , m\} \end{aligned}$$
(B.1)
where $\alpha > 0$ is a universal constant and $\beta $ is a positive-valued function that could potentially depend on the rank of M.
(3)
There exist a universal constant $K>0$ and a positive-valued function c(m, r) such that for any $t \in [0, K]$ the deviation bound
$$\begin{aligned} \frac{1}{m}\left| \Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1 - \mathbb {E}\big [\Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1\big ] \right| \le t \end{aligned}$$
(B.2)
holds with probability at least $1-2\exp (-t^2c(m,r)).$

Then, there exist universal constants $c_1, \dots , c_6 > 0$ depending only on $\alpha $ and K such that if $\mathcal {I}\subseteq \{1, \dots , m\}$ is a fixed subset of indices satisfying $|\mathcal {I}| < m/2$ and

$$\begin{aligned} c(m,r) \ge \frac{c_1}{(1-2|\mathcal {I}|/m)^2}r(d_1+d_2 + 1) \ln \left( c_2 + \frac{c_2\beta (r)}{1- 2|\mathcal {I}|/m} \right) \end{aligned}$$

then with probability at least $1-4\exp \left( -c_3(1-2|\mathcal {I}|/m)^2 c(m,r)\right) $ every matrix $M \in \mathbf{R}^{d_1 \times d_2}$of rank at most 2r satisfies

$$\begin{aligned} c_4 \Vert M\Vert _F \le \frac{1}{m}\Vert \mathcal {A}(M)\Vert _1 \le c_5 \beta (r) \Vert M\Vert _F, \end{aligned}$$

(B.3)

and

$$\begin{aligned} c_6 \left( 1 - \frac{2|\mathcal {I}|}{m} \right) \Vert M\Vert _F \le \frac{1}{m}\left( \Vert \mathcal {A}_{\mathcal {I}^c}(M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} M\Vert _1\right) . \end{aligned}$$

(B.4)

Due to scale invariance of the above result, we need only verify it in the case that $\Vert M\Vert _F = 1$. We implicitly use this observation below.

1.2.1 B.2.1 Part 1 of Theorem 6.4 (Matrix sensing)

Lemma B.2

The random variable $|\langle P, M\rangle |$ is sub-Gaussian with parameter $C\eta .$ Consequently,

$$\begin{aligned} \alpha \le \mathbb {E}|\langle P, M\rangle | \lesssim \eta . \end{aligned}$$

(B.5)

Moreover, there exists a universal constant $c> 0$ such that for any $t \in [0, \infty )$ the deviation bound

$$\begin{aligned} \frac{1}{m}\left| \Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1 - \mathbb {E}\big [\Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1\big ] \right| \le t \end{aligned}$$

(B.6)

holds with probability at least $1-2\exp \left( - \frac{ct^2}{\eta ^2}m\right) .$

Proof

Condition C immediately implies the lower bound in (B.5). To prove the upper bound, first note that by assumption we have

$$\begin{aligned} \Vert \langle P, M\rangle \Vert _{\psi _2} \lesssim \eta . \end{aligned}$$

This bound has two consequences, first $\langle P, M\rangle $ is a sub-Gaussian random variable with parameter $\eta $ and second $\mathbb {E}|\langle P,M\rangle | \lesssim \eta $ [79, Proposition 2.5.2]. Thus, we have proved (B.5).

To prove the deviation bound (B.6), we introduce the random variables

$$\begin{aligned} Y_i = {\left\{ \begin{array}{ll} |\langle P_i, M\rangle | - \mathbb {E}|\langle P_i, M\rangle | &{} \text {if } i \notin \mathcal {I}\text {, and }\\ - \left( |\langle P_i, M\rangle | - \mathbb {E}|\langle P_i, M\rangle | \right) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Since $|\langle P_i, M\rangle |$ is sub-Gaussian, we have $\Vert Y_i\Vert _{\psi _2} \lesssim \eta $ for all i, see [79, Lemma 2.6.8]. Hence, Hoeffding’s inequality for sub-Gaussian random variables [79, Theorem 2.6.2] gives the desired upper bound on $\mathbb {P}\left( \frac{1}{m} \left| \sum _{i=1}^m Y_i \right| \ge t \right) .$ $\square $

Applying Proposition B.1 with $\beta (r) \asymp \eta $ and $c(m,r) \asymp m/\eta ^2$ now yields the result. $\square $

1.2.2 B.2.2 Part 2 of Theorem 6.4 (Quadratic sensing I)

Lemma B.3

The random variable $|p^\top M p|$ is sub-exponential with parameter $\sqrt{2r} \eta ^2.$ Consequently,

$$\begin{aligned} \alpha \le \mathbb {E}|p^\top M p| \lesssim \sqrt{2r}\eta ^2. \end{aligned}$$

(B.7)

Moreover, there exists a universal constant $c> 0$ such that for any $t \in [0, \sqrt{2r} \eta ]$ the deviation bound

$$\begin{aligned} \frac{1}{m}\left| \Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1 - \mathbb {E}\big [\Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1\big ] \right| \le t \end{aligned}$$

(B.8)

holds with probability at least $1-2\exp \left( - \frac{ct^2}{\eta ^4}m/r\right) .$

Proof

Condition D gives the lower bound in (B.7). To prove the upper bound, first note that $M = \sum _{k=1}^{2r} \sigma _k u_k u_k^\top $ where $\sigma _k$ and $u_k$ are the kth singular values and vectors of M, respectively. Hence,

$$\begin{aligned} \Vert p^\top M p\Vert _{\psi _1}&= \left\| p^\top \left( \sum _{k=1}^{2r} \sigma _k u_k u_k^\top \right) p\right\| _{\psi _1} = \left\| \sum _{k=1}^{2r} \sigma _k \langle p, u_k\rangle ^2 \right\| _{\psi _1} \\&\le \sum _{k=1}^{2r} \sigma _k \left\| \langle p, u_k\rangle ^2 \right\| _{\psi _1} \le \sum _{k=1}^{2r}\sigma _k \left\| \langle p, u_k\rangle \right\| _{\psi _2}^2 = \eta ^2 \sum _{k=1}^{2r} \sigma _k \le \sqrt{2r} \eta ^2, \end{aligned}$$

where the first inequality follows since $\Vert \cdot \Vert _{\psi _1}$ is a norm, the second one follows since $\Vert XY\Vert _{\psi _1} \le \Vert X\Vert _{\psi _2}\Vert Y\Vert _{\psi _2}$ [79, Lemma 2.7.7], and the third inequality holds since $\Vert \sigma \Vert _1 \le \sqrt{2r}\Vert \sigma \Vert _2$. This bound has two consequences, first $p^\top M p$ is a sub-exponential random variable with parameter $\sqrt{r} \eta ^2$ and second $\mathbb {E}p^\top M p\le \sqrt{2r} \eta ^2$ [79, Exercise 2.7.2]. Thus, we have proved (B.7).

To prove the deviation bound (B.8), we introduce the random variables

$$\begin{aligned} Y_i = {\left\{ \begin{array}{ll} p_i^\top M p_i - \mathbb {E}p_i^\top M p_i &{} \text {if } i \notin \mathcal {I}\text {, and }\\ - \left( p_i^\top M p_i - \mathbb {E}p_i^\top M p_i \right) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Since $p^\top M p$ is sub-exponential, we have $\Vert Y_i\Vert _{\psi _1} \lesssim \sqrt{r} \eta ^2$ for all i, see [79, Exercise 2.7.10]. Hence, Bernstein inequality for sub-exponential random variables [79, Theorem 2.8.2] gives the desired upper bound on $\mathbb {P}\left( \frac{1}{m} \left| \sum _{i=1}^m Y_i \right| \ge t \right) .$ $\square $

Applying Proposition B.1 with $\beta (r) \asymp \sqrt{r}\eta ^2$ and $c(m,r) \asymp m/{\eta ^4}r$ now yields the result. $\square $

1.2.3 B.2.3 Part 3 of Theorem 6.4 (Quadratic sensing II)

Lemma B.4

The random variable $|p^\top M p- \tilde{p}^\top M \tilde{p}|$ is sub-exponential with parameter $C\eta ^2.$ Consequently,

$$\begin{aligned} \alpha \le \mathbb {E}|p^\top M p- \tilde{p}^\top M \tilde{p}| \lesssim \eta ^2. \end{aligned}$$

(B.9)

Moreover, there exists a universal constant $c> 0$ such that for any $t \in [0, \eta ^2]$ the deviation bound

$$\begin{aligned} \frac{1}{m}\left| \Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1 - \mathbb {E}\big [\Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1\big ] \right| \le t \end{aligned}$$

(B.10)

holds with probability at least $1-2\exp \left( - \frac{ct^2}{\eta ^4}m\right) .$

Proof

Condition E implies the lower bound in (B.9). To prove the upper bound, we will show that $\Vert |p^\top M p- \tilde{p}^\top M \tilde{p}^\top |\Vert _{\psi _1} \le \eta ^2$. By definition of the Orlicz norm $\Vert |X|\Vert _{\psi _1} = \Vert X\Vert _{\psi _1}$ for any random variable X, hence without loss of generality we may remove the absolute value. Recall that $M = \sum _{k=1}^{2r} \sigma _k u_k u_k^\top $ where $\sigma _k$ and $u_k$ are the kth singular values and vectors of M, respectively. Hence, the random variable of interest can be rewritten as

$$\begin{aligned} p^\top M p- \tilde{p}^\top M \tilde{p}^\top {\mathop {=}\limits ^{ d }}\sum _{k=1}^{2r} \sigma _k\left( \langle u_k, p\rangle ^2 - \langle u_k, \tilde{p}\rangle ^2 \right) . \end{aligned}$$

(B.11)

By assumption the random variables $\langle u_k, p\rangle $ are $\eta $-sub-Gaussian, this implies that $\langle u_k,p\rangle ^2$ are $\eta ^2$-sub-exponential, since $\Vert \langle u_k, p\rangle ^2\Vert _{\psi _1} \le \Vert \langle u_k, p\rangle \Vert _{\psi _2}^2$.

Recall the following characterization of the Orlicz norm for mean-zero random variables

$$\begin{aligned} \Vert X\Vert _{\psi _1} \le Q \iff \mathbb {E}\exp (\lambda X) \le \exp (\tilde{Q}^2\lambda ^2) \;\; \text {for all }\lambda \text { satisfying } |\lambda | \le 1/\tilde{Q}^2 \end{aligned}$$

(B.12)

where the $Q \asymp \tilde{Q},$ see [79, Proposition 2.7.1]. To prove that the random variable (B.11) is sub-exponential we will exploit this characterization. Since each inner product squared $\langle u_k,p\rangle ^2$ is sub-exponential, the equivalence implies the existence of a constant $c>0$ for which the uniform bound

$$\begin{aligned} \mathbb {E}\exp (\lambda \langle u_k,p\rangle ^2) \le \exp \left( c\eta ^4 \lambda ^2\right) \qquad \text {for all } k\in [2r]\text { and }|\lambda |\le 1/c\eta ^4 \end{aligned}$$

(B.13)

holds. Let $\lambda $ be an arbitrary scalar with $|\lambda |\le 1/c\eta ^4$, then by expanding the moment generating function of (B.11) we get

$$\begin{aligned}&\mathbb {E}\exp \left( \lambda \sum _{k=1}^{2r} \sigma _k \left( \langle u_k, p\rangle ^2 - \langle u_k, \tilde{p}\rangle ^2 \right) \right) \\&\quad = \mathbb {E}\prod _{k=1}^{2r} \exp \left( \lambda \sigma _k \langle u_k, p\rangle ^2\right) \exp \left( - \lambda \sigma _k \langle u_k, \tilde{p}\rangle ^2 \right) \\&\quad = \prod _{k=1}^{2r} \mathbb {E}\exp \left( \lambda \sigma _k \langle u_k, p\rangle ^2\right) \mathbb {E}\exp \left( - \lambda \sigma _k \langle u_k, \tilde{p}\rangle ^2 \right) \\&\quad \le \prod _{k=1}^{2r} \exp \left( (c\eta )^2\lambda ^2 \sigma _k^2\right) \exp \left( c\eta ^4\lambda ^2 \sigma _k^2 \right) \\&\quad = \exp \left( 2c\eta ^4\lambda ^2 \sum _{k=1}^{2r} \sigma _k^2\right) = \exp \left( 2c\eta ^4\lambda ^2\right) . \end{aligned}$$

where the inequality follows by (B.13) and the last relation follows since $\sigma $ is unit norm. Combining this with (B.12) gives

$$\begin{aligned} \Vert |p^\top M p- \tilde{p}^\top M \tilde{p}^\top |\Vert _{\psi _1}\lesssim \eta ^2. \end{aligned}$$

This bound has two consequences, first $|p^\top M p- \tilde{p}^\top M \tilde{p}^\top |$ is a sub-exponential random variable with parameter $C\eta ^2$ and second $\mathbb {E}|p^\top M p- \tilde{p}^\top M \tilde{p}^\top | \le C \eta ^2$ [79, Exercise 2.7.2]. Thus, we have proved (B.9).

To prove the deviation bound (B.10) we introduce the random variables

$$\begin{aligned} Y_i = {\left\{ \begin{array}{ll} \mathcal {A}(M)_i - \mathbb {E}\mathcal {A}(M)_i &{} \text {if } i \notin \mathcal {I}\text {, and }\\ - \left( \mathcal {A}(M)_i - \mathbb {E}\mathcal {A}(M)_i \right) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

The sub-exponentiality of $\mathcal {A}(M)_i$ implies $\Vert Y_i\Vert _{\psi _1} \lesssim \eta ^2$ for all i, see [79, Exercise 2.7.10]. Hence, Bernstein inequality for sub-exponential random variables [79, Theorem 2.8.2] gives the desired upper bound on $\mathbb {P}\left( \frac{1}{m} \left| \sum _{i=1}^m Y_i \right| \ge t \right) .$ $\square $

Applying Proposition B.1 with $\beta (r) \asymp \eta ^2$ and $c(m,r) \asymp m/{\eta ^4}$ now yields the result. $\square $

1.2.4 B.2.4 Part 4 of Theorem 6.4 (Bilinear sensing)

Lemma B.5

The random variable $|p^\top M q|$ is sub-exponential with parameter $C\eta ^2.$ Consequently,

$$\begin{aligned} \alpha \le \mathbb {E}|p^\top M q| \lesssim \eta ^2. \end{aligned}$$

(B.14)

Moreover, there exists a universal constant $c> 0$ such that for any $t \in [0, \eta ^2]$ the deviation bound

$$\begin{aligned} \frac{1}{m}\left| \Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1 - \mathbb {E}\big [\Vert \mathcal {A}_{\mathcal {I}^c} (M)\Vert _1 - \Vert \mathcal {A}_{\mathcal {I}} (M)\Vert _1\big ] \right| \le t \end{aligned}$$

(B.15)

holds with probability at least $1-2\exp \left( - \frac{ct^2}{\eta ^4}m\right) .$

Proof

As before the lower bound in (B.14) is implied by Condition F. To prove the upper bound, we will show that $\Vert |p^\top M q|\Vert _{\psi _1} \le \eta ^2$. By definition of the Orlicz norm $\Vert |X|\Vert _{\psi _1} = \Vert X\Vert _{\psi _1}$ for any random variable X, hence we may remove the absolute value. Recall that $M = \sum _{k=1}^{2r} \sigma _k u_k v_k^\top $ where $\sigma _k$ and $(u_k, v_k)$ are the kth singular values and vectors of M, respectively. Hence, the random variable of interest can be rewritten as

$$\begin{aligned} p^\top M q{\mathop {=}\limits ^{ d }}\sum _{k=1}^{2r} \sigma _k \langle p,u_k\rangle \langle v_k, q\rangle . \end{aligned}$$

(B.16)

By assumption the random variables $\langle p, u_k\rangle $ and $\langle v_k,q\rangle $ are $\eta $-sub-Gaussian, this implies that $\langle p,u_k\rangle \langle v_k,q\rangle $ are $\eta ^2$-sub-exponential.

To prove that the random variable (B.16) is sub-exponential, we will again use (B.12). Since each random variable $\langle p,u_k\rangle \langle v_k,q\rangle $ is sub-exponential, the equivalence implies the existence of a constant $c>0$ for which the uniform bound

$$\begin{aligned} \mathbb {E}\exp (\lambda \langle p,u_k\rangle \langle v_k,q\rangle ) \le \exp \left( c\eta ^4 \lambda ^2\right) \qquad \text {for all } k\in [2r]\text { and }|\lambda |\le 1/c\eta ^4 \end{aligned}$$

(B.17)

holds. Let $\lambda $ be an arbitrary scalar with $|\lambda |\le 1/c\eta ^4$, then by expanding the moment generating function of (B.16) we get

$$\begin{aligned} \mathbb {E}\exp \left( \lambda \sum _{k=1}^{2r} \sigma _k \langle p,u_k\rangle \langle v_k,q\rangle \right)&= \prod _{k=1}^{2r} \mathbb {E}\exp \left( \lambda \sigma _k \langle p,u_k\rangle \langle v_k,q\rangle \right) \\&\le \exp \left( 2c\eta ^4\lambda ^2 \sum _{k=1}^r \sigma _k^2\right) = \exp \left( 2c\eta ^4\lambda ^2\right) . \end{aligned}$$

where the inequality follows by (B.17) and the last relation follows since $\sigma $ is unitary. Combining this with (B.12) gives

$$\begin{aligned} \Vert |p^\top M q|\Vert _{\psi _1}\lesssim \eta ^2. \end{aligned}$$

Thus, we have proved (B.14).

Once again, to show the deviation bound (B.15) we introduce the random variables

$$\begin{aligned} Y_i = {\left\{ \begin{array}{ll} |p_i^\top M q_i| - \mathbb {E}|p_i^\top M q_i| &{} \text {if } i \notin \mathcal {I}\text {, and }\\ - \left( |p_i^\top M q_i| - \mathbb {E}|p_i^\top M q_i| \right) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

and apply Bernstein’s inequality for sub-exponential random variables [79, Theorem 2.8.2] to get the stated upper bound on $\mathbb {P}\left( \frac{1}{m} \left| \sum _{i=1}^m Y_i \right| \ge t \right) .$ $\square $

Applying Proposition B.1 with $\beta (r) \asymp \eta ^2$ and $c(m,r) \asymp m/{\eta ^4}$ now yields the result. $\square $

1.3 B.3 Proof of Proposition B.1

Choose $\epsilon \in (0,\sqrt{2})$ and let $\mathcal {N}$ be the ($\epsilon /\sqrt{2}$)-net guaranteed by Lemma F.1. Pick some $t \in (0,K]$ so that (B.2) can hold, we will fix the value of this parameter later in the proof. Let $\mathcal {E}$ denote the event that the following two estimates hold for all matrices in $M\in \mathcal {N}$:

$$\begin{aligned} \frac{1}{m}\Big |\Vert \mathcal {A}_{ \mathcal {I}^c }(M)\Vert _1 - \Vert \mathcal {A}_{ \mathcal {I}}(M)\Vert _1 - \mathbb {E}\left[ \Vert \mathcal {A}_{ \mathcal {I}^c }(M)\Vert _1 - \Vert \mathcal {A}_{ \mathcal {I}}(M)\Vert _1\right] \Big |&\le t, \end{aligned}$$

(B.18)

$$\begin{aligned} \frac{1}{m}\Big |\Vert \mathcal {A}(M)\Vert _1 - \mathbb {E}\left[ \Vert \mathcal {A}(M)\Vert _1 \right] \Big |&\le t. \end{aligned}$$

(B.19)

Throughout the proof, we will assume that the event $\mathcal {E}$ holds. We will estimate the probability of $\mathcal {E}$ at the end of the proof. Meanwhile, seeking to establish RIP, define the quantity

$$\begin{aligned} c_2 := \sup _{M \in S_{2r}} \frac{1}{m}\Vert \mathcal {A}(M)\Vert _1. \end{aligned}$$

We aim first to provide a high probability bound on $c_2$.

Let $M \in S_{2r}$ be arbitrary and let $M_\star $ be the closest point to M in $\mathcal {N}$. Then, we have

$$\begin{aligned} \frac{1}{m}\Vert \mathcal {A}(M)\Vert _1&\le \frac{1}{m}\Vert \mathcal {A}(M_\star )\Vert _1 + \frac{1}{m}\Vert \mathcal {A}(M- M_\star )\Vert _1\nonumber \\&\le \frac{1}{m}\mathbb {E}\Vert \mathcal {A}(M_\star )\Vert _1 + t+ \frac{1}{m}\Vert \mathcal {A}(M- M_\star )\Vert _1 \end{aligned}$$

(B.20)

$$\begin{aligned}&\le \frac{1}{m}\mathbb {E}\Vert \mathcal {A}(M)\Vert _1 + t+ \frac{1}{m}\left( \mathbb {E}\Vert \mathcal {A}(M - M_\star )\Vert _1+ \Vert \mathcal {A}(M- M_\star )\Vert _1 \right) , \end{aligned}$$

(B.21)

where (B.20) follows from (B.19) and (B.21) follows from the triangle inequality. To simplify the third term in (B.21), using SVD, we deduce that there exist two orthogonal matrices $M_1, M_2$ of rank at most 2r satisfying $M - M_\star = M_1+M_2.$ With this decomposition in hand, we compute

$$\begin{aligned} \frac{1}{m}\Vert \mathcal {A}(M - M_\star )\Vert _1&\le \frac{1}{m}\Vert \mathcal {A}(M_1)\Vert _1 + \frac{1}{m}\Vert \mathcal {A}(M_2)\Vert _1\nonumber \\&\le c_2 (\Vert M_1\Vert _F+\Vert M_2\Vert _F) \le \sqrt{2}c_2 \Vert M-M_\star \Vert _F \le c_2 \epsilon , \end{aligned}$$

(B.22)

where the second inequality follows from the definition of $c_2$ and the estimate $\Vert M_1\Vert _F + \Vert M_2\Vert _F \le \sqrt{2} \Vert (M_1, M_2)\Vert _F = \sqrt{2} \Vert M_1 + M_2\Vert _F.$ Thus, we arrive at the bound

$$\begin{aligned} \frac{1}{m}\Vert \mathcal {A}(M)\Vert _1 \le \frac{1}{m}\mathbb {E}\Vert \mathcal {A}(M)\Vert _1 + t+ 2c_2 \epsilon . \end{aligned}$$

(B.23)

As M was arbitrary, we may take the supremum of both sides of the inequality, yielding $c_2\le \frac{1}{m}\sup _{M \in S_{2r}}\mathbb {E}\Vert \mathcal {A}(M)\Vert _1 + t+ 2c_2 \epsilon $. Rearranging yields the bound

$$\begin{aligned} c_2 \le \dfrac{\frac{1}{m}\sup _{M \in S_{2r}}\mathbb {E}\Vert \mathcal {A}(M)\Vert _1 + t}{1-2\epsilon }. \end{aligned}$$

Assuming that $\epsilon \le 1/4$, we further deduce that

$$\begin{aligned} c_2 \le \bar{\sigma } := \frac{2}{m}\sup _{M \in S_{2r}}\mathbb {E}\Vert \mathcal {A}(M)\Vert _1 + 2t \le 2 \beta (r) + 2t, \end{aligned}$$

(B.24)

establishing that the random variable $c_2$ is bounded by $\bar{\sigma }$ in the event $\mathcal {E}$.

Now let $\hat{\mathcal {I}}$ denote either $\hat{\mathcal {I}}=\emptyset $ or $\hat{\mathcal {I}}=\mathcal {I}$. We now provide a uniform lower bound on $\frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}} }(M)\Vert _1$. Indeed,

$$\begin{aligned}&\frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}} }(M)\Vert _1\nonumber \\&\quad =\frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M_{\star })+\mathcal {A}_{\hat{\mathcal {I}}^c }(M-M_{\star })\Vert _1 - \frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}} }(M_{\star })+\mathcal {A}_{\hat{\mathcal {I}} }(M-M_\star )\Vert _1\nonumber \\&\quad \ge \frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M_\star )\Vert _1 - \frac{1}{m}\Vert \mathcal {A}_{\hat{\mathcal {I}} }(M_\star )\Vert _1 - \frac{1}{m}\Vert \mathcal {A}(M- M_\star )\Vert _1 \end{aligned}$$

(B.25)

$$\begin{aligned}&\quad \ge \frac{1}{m}\mathbb {E}\left[ \Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M_\star )\Vert _1 - \Vert \mathcal {A}_{\hat{\mathcal {I}} }(M_\star )\Vert _1\right] - t - \frac{1}{m}\Vert \mathcal {A}(M- M_\star )\Vert _1 \end{aligned}$$

(B.26)

$$\begin{aligned}&\quad \ge \frac{1}{m}\mathbb {E}\left[ \Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \Vert \mathcal {A}_{\hat{\mathcal {I}} }(M)\Vert _1\right] - t -\frac{1}{m} \left( \mathbb {E}\Vert \mathcal {A}(M- M_\star )\Vert _1 + \Vert \mathcal {A}(M- M_\star )\Vert _1\right) \end{aligned}$$

(B.27)

$$\begin{aligned}&\quad \ge \frac{1}{m}\mathbb {E}\left[ |\Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \Vert \mathcal {A}_{\hat{\mathcal {I}} }(M)\Vert _1\right] - t - 2\bar{\sigma } \epsilon , \end{aligned}$$

(B.28)

where (B.25) uses the forward and reverse triangle inequalities, (B.26) follows from (B.18), the estimate (B.27) follows from the forward and reverse triangle inequalities, and (B.28) follows from (B.22) and (B.24). Switching the roles of $\mathcal {I}$ and $\mathcal {I}^c$ in the above sequence of inequalities, and choosing $\epsilon = t/4\bar{\sigma }$, we deduce

$$\begin{aligned} \frac{1}{m}\sup _{M \in S_{2r}}\Big |\Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \Vert \mathcal {A}_{\hat{\mathcal {I}} }(M)\Vert _1 - \mathbb {E}\left[ \Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \Vert \mathcal {A}_{\hat{\mathcal {I}} }(M)\Vert _1\right] \Big | \le \frac{3t}{2}. \end{aligned}$$

In particular, setting $\hat{\mathcal {I}}=\emptyset $, we deduce

$$\begin{aligned} \frac{1}{m}\sup _{M \in S_{2r}}\Big |\Vert \mathcal {A}(M)\Vert _1 - \mathbb {E}\left[ \Vert \mathcal {A}(M)\Vert _1 \right] \Big | \le \frac{3t}{2} \end{aligned}$$

and therefore using (B.1), we conclude the RIP property

$$\begin{aligned} \alpha -\frac{3t}{2}\le \frac{1}{m}\Vert \mathcal {A}(M)\Vert _1\lesssim \beta (r)+\frac{3t}{2},\qquad \forall X\in S_{2r}. \end{aligned}$$

(B.29)

Next, let $\hat{\mathcal {I}} = \mathcal {I}$ and note that

$$\begin{aligned} \frac{1}{m}\mathbb {E}\left[ \Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \Vert \mathcal {A}_{\hat{\mathcal {I}} }(M)\Vert _1\right] = \frac{|\mathcal {I}^c| - |\mathcal {I}|}{m}\cdot \mathbb {E}|\mathcal {A}(M)_i | \ge \left( 1- \frac{2|\mathcal {I}|}{m}\right) \alpha , \end{aligned}$$

where the equality follows by assumption (1). Therefore, every $M\in S_{2r}$ satisfies

$$\begin{aligned} \frac{1}{m}\left[ \Vert \mathcal {A}_{\hat{\mathcal {I}}^c }(M)\Vert _1 - \Vert \mathcal {A}_{\hat{\mathcal {I}}}(M)\Vert _1\right] \ge \left( 1- \frac{2|\mathcal {I}|}{m}\right) \alpha -\frac{3t}{2}. \end{aligned}$$

(B.30)

Setting $t=\frac{2}{3}\min \{\alpha , \alpha (1-2|\mathcal {I}|/m)/2\} = \frac{1}{3}\alpha (1-2|\mathcal {I}|/m)$ in (B.29) and (B.30), we deduce the claimed estimates (B.3) and (B.4). Finally, let us estimate the probability of $\mathcal {E}$. Using the union bound and Lemma F.1 yields

$$\begin{aligned} \mathbb {P}(\mathcal {E}^c)&\le \sum _{M \in \mathcal {N}} \mathbb {P}\big \{ (\hbox {B.18}) \text { or } (\hbox {B.19}) \text { fails at }M\big \} \\&\le 4|\mathcal {N}|\exp \left( -t^2 c(m,r)\right) \\&\le 4\left( \frac{9}{\epsilon }\right) ^{2(d_1+d_2+1)r} \exp \left( -t^2 c(m,r)\right) \\&=4 \exp \left( 2(d_1+d_2+1)r\ln (9/\epsilon )- t^2c(m,r)\right) \end{aligned}$$

where c(m, r) is the function guaranteed by assumption (3).

By (B.1), we get $1/\epsilon = 4\bar{\sigma }/t \lesssim 2 + \beta (r)/(1 - 2|\mathcal {I}|/m)$. Then, we deduce

$$\begin{aligned} \mathbb {P}(\mathcal {E}^c)\le 4 \exp \left( \!c_1(d_2+d_2+1)r\ln \left( \!c_2+\frac{c_2\beta (r)}{1-2|\mathcal {I}|/m}\!\right) -\frac{\alpha ^2}{9}(\!1-\frac{2|\mathcal {I}|}{m})^2c({m},{r})\!\right) . \end{aligned}$$

Hence, as long as $c(m,r)\ge \frac{9c_1(d_1+d_2+1)r^2\ln \left( c_2+\frac{c_2\beta (r)}{1-2|\mathcal {I}|/m}\right) }{\alpha ^2 \left( 1-\frac{2|\mathcal {I}|}{m}\right) ^2}$, we can be sure

$$\begin{aligned} \mathbb {P}(\mathcal {E}^c)\le 4 \exp \left( -\frac{\alpha ^2}{18}\left( 1-\frac{2|\mathcal {I}|}{m}\right) ^2c({m},{r})\right) . \end{aligned}$$

Proving the desired result. $\square $

C Proof in Sect. 7

1.1 C.1 Proof of Lemma 7.4

Define $P(x,y)=a\Vert y-x\Vert ^2_2+b\Vert y-x\Vert _2$. Fix an iteration k and choose $x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_k)$. Then, the estimate holds:

$$\begin{aligned} f(x_{k+1})&\le f_{x_k}(x_{k+1})+ P(x_{k+1},x_k) \\&\le f_{x_k}(x^*)+ P(x^*,x_k)\le f(x^*)+2P(x^*,x_k). \end{aligned}$$

Rearranging and using the sharpness and approximation accuracy assumptions, we deduce

$$\begin{aligned} \mu \cdot \mathrm{dist}(x_{k+1},\mathcal {X}^*)&\le 2(a\cdot \mathrm{dist}^2(x,\mathcal {X}^*)+b\cdot \mathrm{dist}(x,\mathcal {X}^*))\\&=2(b+a\mathrm{dist}(x,\mathcal {X}^*))\mathrm{dist}(x,\mathcal {X}^*). \end{aligned}$$

The result follows.

1.2 C.2 Proof of Theorem 7.6

First notice that for any y, we have $\partial f(y) = \partial f_y(y)$. Therefore, since $f_y$ is a convex function, we have that for all $x, y \in \mathcal {X}$ and $v \in \partial f(y)$, the bound

$$\begin{aligned} f(y) + \langle v , x - y \rangle&= f_y(y) + \langle v ,x -y \rangle \le f_y(x) \nonumber \\&\le f(x) + a\Vert x - y\Vert _F^2 + b\Vert x - y\Vert _F. \end{aligned}$$

(C.1)

Consequently, given that $\mathrm{dist}(x_i,\mathcal {X}^*)\le \gamma \cdot \frac{\mu - 2b}{2a}$, we have

$$\begin{aligned} \Vert x_{i+1} - x^*\Vert ^2&=\left\| \mathrm {proj}_{\mathcal {X}}\left( x_{i}-\tfrac{f(x_i)-\min _{\mathcal {X}} f}{\Vert \zeta _i\Vert ^2} \zeta _i\right) -\mathrm {proj}_\mathcal {X}(x^*)\right\| ^2\nonumber \\&\le \left\| (x_{i} - x^*)- \tfrac{f(x_i)-\min _{\mathcal {X}} f}{\Vert \zeta _i\Vert ^2} \zeta _i\right\| ^2 \end{aligned}$$

(C.2)

$$\begin{aligned}&= \Vert x_{i} - x^*\Vert ^2 + \frac{2(f(x_i) - \min _{\mathcal {X}} f)}{\Vert \zeta _i\Vert ^2}\cdot \langle \zeta _i, x^* - x_{i}\rangle + \frac{(f(x_i) - f( x^*))^2}{\Vert \zeta _i\Vert ^2} \nonumber \\&\le \Vert x_{i} - x^*\Vert ^2 + \frac{2(f(x_i) - \min f)}{\Vert \zeta _i\Vert ^2}n\nonumber \\&\qquad \qquad \left( f( x^*) - f(x_i) + a\Vert x_i - x^*\Vert ^2 + b \Vert x_i - x^*\Vert \right) \nonumber \\&\qquad + \frac{(f(x_i) - f(x^*))^2}{\Vert \zeta _i\Vert ^2} \end{aligned}$$

(C.3)

$$\begin{aligned}&= \Vert x_{i} - x^*\Vert ^2 + \frac{f(x_i) - \min f}{\Vert \zeta _i\Vert ^2}\nonumber \\&\qquad \left( 2a\Vert x_i - x^*\Vert ^2 + 2b \Vert x_i - x^*\Vert - (f(x_i) - f( x^*)) \right) \nonumber \\&\le \Vert x_{i} - x^*\Vert ^2 + \frac{f(x_i) - \min f}{\Vert \zeta _i\Vert ^2}\left( a\Vert x_i - x^*\Vert ^2 - (\mu -2b)\Vert x_i - x^*\Vert \right) \end{aligned}$$

(C.4)

$$\begin{aligned}&= \Vert x_{i} - x^*\Vert ^2 + \frac{2a (f(x_i) - \min f)}{\Vert \zeta _i\Vert ^2}\left( \Vert x_i - x^*\Vert -\frac{\mu - 2b}{2a}\right) \Vert x_i - x^*\Vert \nonumber \\&\le \Vert x_{i} - x^*\Vert ^2 - \frac{(1-\gamma )(\mu - 2b)(f(x_i) - \min f)}{\Vert \zeta _i\Vert ^2}\cdot \Vert x_i - x^*\Vert \end{aligned}$$

(C.5)

$$\begin{aligned}&\le \left( 1-\frac{ (1-\gamma )\mu (\mu - 2b)}{\Vert \zeta _i\Vert ^2}\right) \Vert x_i- x^*\Vert ^2. \end{aligned}$$

(C.6)

Here, the estimate (C.2) follows from the fact that the projection $\mathrm {proj}_\mathcal {X}(\cdot )$ is nonexpansive, (C.3) uses the bound in (C.1), (C.5) follow from the estimate $\mathrm{dist}(x_i,\mathcal {X}^*)\le \gamma \cdot \frac{\mu - 2b}{2a}$, while (C.4) and (C.6) use local sharpness. The result then follows by the upper bound $\Vert \zeta _i\Vert \le L$.

D Proofs in Sect. 8

1.1 D.1 Proof of Lemma 8.1

The inequality can be established using an argument similar to that for bounding the $ T_3 $ term in [27, Section 6.6]. We provide the proof below for completeness. Define the shorthand $ \varDelta _S := S-S_{\sharp }$ and $ \varDelta _X = X- X_{\sharp } $, and let $ e_j \in \mathbb {R}^d$ denote the j-th standard basis vector of $ \mathbb {R}^d $. Simple algebra gives

$$\begin{aligned} | \langle S-S_{\sharp }, XX^\top -X_{\sharp } X_{\sharp }^\top \rangle |&= | 2 \langle \varDelta _S, \varDelta _X X_{\sharp }^\top \rangle + \langle \varDelta _S, \varDelta _X \varDelta _X^\top \rangle | \\&\le \Big ( 2 \Vert X^{\top }_{\sharp } \varDelta _S \Vert _F + \Vert \varDelta _X^\top \varDelta _S \Vert _F \Big ) \cdot \Vert \varDelta _X\Vert _F. \end{aligned}$$

We claim that $ \Vert \varDelta _S e_j \Vert _1 \le 2\sqrt{k} \Vert \varDelta _S e_j \Vert _2$ for each $ j\in [d] $. To see this, fix any $ j\in [d] $ and let $ v := Se_j $, $ v^* := S_\sharp e_j $, and $ T := \text {support}(v^*). $ We have

$$\begin{aligned} \Vert v^*_T \Vert _1 = \Vert v^* \Vert _1&\ge \Vert v \Vert _1&S \in \mathcal {S}\\&= \Vert v_T \Vert _1 + \Vert v_{T^c} \Vert _1&\text {decomposability of } \ell _1 \text { norm}\\&= \Vert v^*_T + (v - v^*)_T \Vert _1 + \Vert (v - v^*)_{T^c} \Vert _1&\\&\ge \Vert v^*_T \Vert _1 - \Vert (v - v^*)_T \Vert _1 + \Vert (v - v^*)_{T^c} \Vert _1.&\text {reverse triangle inequality} \end{aligned}$$

Rearranging terms gives $ \Vert (v - v^*)_{T^c} \Vert _1 \le \Vert (v - v^*)_T \Vert _1 $, whence

$$\begin{aligned} \Vert v - v^* \Vert _1 = \Vert (v - v^*)_T \Vert _1 + \Vert (v- v^*)_{T^c} \Vert _1&\le 2 \Vert (v - v^*)_T \Vert _1 \\&\le 2\sqrt{k} \Vert (v- v^*)_T \Vert _2 \le 2\sqrt{k} \Vert v - v^* \Vert _2, \end{aligned}$$

where step the second inequality holds because $ |T| \le k $ by assumption. The claim follows from noting that $ v-v^* = \varDelta _S e_j $.

Using the claim, we get that

$$\begin{aligned} \Vert X^{\top }_{\sharp } \varDelta _S \Vert _F = \sqrt{\sum _{j\in [d]} \Vert X^{\top }_{\sharp } \varDelta _S e_j \Vert _2^2 }&\le \sqrt{\sum _{j\in [d]} \Vert X_{\sharp } \Vert _{2,\infty }^2 \Vert \varDelta _S e_j \Vert _1^2 } \\&\le \Vert X_{\sharp } \Vert _{2,\infty } \sqrt{\sum _{j\in [d]} 4k \Vert \varDelta _S e_j \Vert _2^2 } \le 2 \sqrt{\frac{\nu r k}{d}} \Vert \varDelta _S \Vert _F. \end{aligned}$$

Using a similar argument and the fact that $ \Vert \varDelta _X \Vert _{2,\infty } \le \Vert X\Vert _{2,\infty } + \Vert X_{\sharp }\Vert _{2,\infty } \le 3\sqrt{\frac{\nu r}{d}} $, we obtain

$$\begin{aligned} \Vert \varDelta _X^{\top } \varDelta _S \Vert _F \le 6 \sqrt{\frac{\nu r k}{d}} \Vert \varDelta _S \Vert _F. \end{aligned}$$

Putting everything together, we have

$$\begin{aligned} | \langle S-S^*, XX^\top -X_{\sharp }X_{\sharp }^\top \rangle | \le \left( 2 \cdot 2 \sqrt{\frac{\nu r k}{d}} \Vert \varDelta _S \Vert _F + 6 \sqrt{\frac{\nu r k}{d}} \Vert \varDelta _S \Vert _F \right) \cdot \Vert \varDelta _X \Vert _F. \end{aligned}$$

The claim follows.

1.2 D.2 Proof of Theorem 8.6

Without loss of generality, suppose that x is closer to $\bar{x}$ than to $-\bar{x}$. Consider the following expression:

$$\begin{aligned}&\Vert \bar{x}(x - \bar{x})^\top + (x - \bar{x}) \bar{x}^\top \Vert _1\\&\quad = \sup _{\Vert V\Vert _\infty = 1, V^\top = V} \mathrm {Tr}((\bar{x}(x - \bar{x})^\top + (x - \bar{x}) \bar{x}^\top )V)\\&\quad = \sup _{\Vert V\Vert _\infty = 1, V^\top = V} \mathrm {Tr}(\bar{x}x^\top V + x \bar{x}^\top V - 2\bar{x}\bar{x}^\top V)\\&\quad = \sup _{\Vert V\Vert _\infty = 1, V^\top = V} \mathrm {Tr}(x^\top V\bar{x} + \bar{x}^\top Vx - 2\bar{x}^\top V\bar{x})\\&\quad = 2\sup _{\Vert V\Vert _\infty = 1, V^\top = V} \mathrm {Tr}(x^\top V\bar{x} - \bar{x}^\top V\bar{x})\\&\quad = 2\sup _{\Vert V\Vert _\infty = 1, V^\top = V} \mathrm {Tr}((x- \bar{x})^\top V\bar{x})\\&\quad = 2\sup _{\Vert V\Vert _\infty = 1, V^\top = V} \mathrm {Tr}(\bar{x} (x - \bar{x} )^\top V). \end{aligned}$$

We now produce a few different lower bounds by testing against different V. In what follows, we set $a = \sqrt{2} - 1$, i.e., the positive solution of the equation $1-a^2 = 2a$.

Case 1: Suppose that

$$\begin{aligned} |(x - \bar{x} )^\top \mathrm {sign}(\bar{x})| \ge a\Vert x - \bar{x}\Vert _1. \end{aligned}$$

Then, set $\bar{V} = \mathrm {sign}((x - \bar{x} )^\top \mathrm {sign}(\bar{x})) \cdot \mathrm {sign}(\bar{x})\mathrm {sign}(\bar{x})^\top $, to get

$$\begin{aligned}&\Vert \bar{x}(x - \bar{x})^\top + (x - \bar{x}) \bar{x}^\top \Vert _1 \\&\quad \ge 2\mathrm {Tr}(\bar{x} (x - \bar{x} )^\top \bar{V})\\&\quad = 2\mathrm {sign}((x - \bar{x} )^\top \mathrm {sign}(\bar{x}))\cdot \mathrm {Tr}( (x - \bar{x} )^\top \mathrm {sign}( \bar{x})\mathrm {sign}(\bar{x})^\top \bar{x}) \\&\quad = 2\Vert \bar{x}\Vert _1\mathrm {sign}((x - \bar{x} )^\top \mathrm {sign}(\bar{x}))\cdot (x - \bar{x} )^\top \mathrm {sign}( \bar{x}) \\&\quad \ge 2a\Vert \bar{x}\Vert _1 \Vert x - \bar{x}\Vert _1 \end{aligned}$$

Case 2: Suppose that

$$\begin{aligned} |\mathrm {sign}(x - \bar{x} )^\top \bar{x}| \ge a \Vert \bar{x}\Vert _1. \end{aligned}$$

Then, set $\bar{V} = \mathrm {sign}(\mathrm {sign}(x - \bar{x} )^\top \bar{x}) \cdot \mathrm {sign}( x - \bar{x})\mathrm {sign}( x - \bar{x})^\top $, to get

$$\begin{aligned}&\Vert \bar{x}(x - \bar{x})^\top + (x - \bar{x}) \bar{x}^\top \Vert _1 \\&\quad \ge 2\mathrm {Tr}(\bar{x} (x - \bar{x} )^\top \bar{V})\\&\quad = 2\mathrm {sign}(\mathrm {sign}(x - \bar{x} )^\top \bar{x}) \cdot \mathrm {Tr}( (x - \bar{x} )^\top \mathrm {sign}( x - \bar{x})\mathrm {sign}( x - \bar{x})^\top \bar{x}) \\&\quad = 2\Vert x - \bar{x}\Vert _1\mathrm {sign}(\mathrm {sign}(x - \bar{x} )^\top \bar{x})\cdot \mathrm {sign}( x - \bar{x})^\top \bar{x} \\&\quad \ge 2a\Vert \bar{x}\Vert _1 \Vert x - \bar{x}\Vert _1 \end{aligned}$$

Case 3: Suppose that

$$\begin{aligned} |(x - \bar{x} )^\top \mathrm {sign}(\bar{x})| \le a\Vert x - \bar{x}\Vert _1 \qquad \text {and} \qquad |\mathrm {sign}(x - \bar{x} )^\top \bar{x}| \le a \Vert \bar{x}\Vert _1 \end{aligned}$$

Define $\bar{V} = \frac{1}{2}(\mathrm {sign}(\bar{x}(x - \bar{x})^\top ) + \mathrm {sign}((x - \bar{x}) \bar{x}^\top ))$. Observe that

$$\begin{aligned} \mathrm {Tr}( \bar{x}(x - \bar{x} )^\top \mathrm {sign}(\bar{x}(x - \bar{x})^\top ))&= (x - \bar{x} )^\top \mathrm {sign}(\bar{x}) \mathrm {sign}(x - \bar{x})^\top \bar{x} \\&\ge - a^2\Vert \bar{x}\Vert _1 \Vert x - \bar{x}\Vert _1 \end{aligned}$$

and

$$\begin{aligned} \mathrm {Tr}( \bar{x}(x - \bar{x} )^\top \mathrm {sign}((x - \bar{x}) \bar{x}^\top ))&= \mathrm {Tr}( \bar{x}(x - \bar{x} )^\top \mathrm {sign}(x - \bar{x}) \mathrm {sign}(\bar{x}^\top ))\\&= \Vert \bar{x}\Vert _1\Vert x - \bar{x}\Vert _1. \end{aligned}$$

Putting these two bounds together, we find that

$$\begin{aligned} \Vert \bar{x}(x - \bar{x})^\top + (x - \bar{x}) \bar{x}^\top \Vert _1&\ge 2\mathrm {Tr}(\bar{x} (x - \bar{x} )^\top \bar{V}) = (1-a^2)\Vert \bar{x}\Vert _1\Vert x - \bar{x}\Vert _1. \end{aligned}$$

Altogether, we find that

$$\begin{aligned} F(x)&= \Vert xx^\top - \bar{x} \bar{x}^\top \Vert _1\\&= \Vert \bar{x} (x - \bar{x})^\top + (x - \bar{x}) \bar{x}^\top + (x - \bar{x})(x - \bar{x})^\top \Vert _1 \\&\ge \Vert \bar{x} (x - \bar{x})^\top + (x - \bar{x}) \bar{x}^\top \Vert _1 - \Vert (x - \bar{x})(x - \bar{x})^\top \Vert _1\\&\ge 2a\Vert \bar{x}\Vert _1\Vert x - \bar{x}\Vert _1 - \Vert (x - \bar{x})\Vert _1^2\\&= 2a\Vert \bar{x}\Vert _1\left( 1 - \frac{\Vert x - \bar{x}\Vert _1}{2a\Vert \bar{x}\Vert _1}\right) \Vert x - \bar{x}\Vert _1, \end{aligned}$$

as desired.

1.3 D.3 Proof of Lemma 8.8

We start by stating a claim we will use to prove the lemma. Let us introduce some notation. Consider the set

$$\begin{aligned} S = \left\{ (\varDelta _+, \varDelta _-) \in \mathbf{R}^{d \times r } \times \mathbf{R}^{d \times r}\mid \Vert \varDelta _+\Vert _{2, \infty } \le (1+C)\sqrt{\frac{\nu r}{ d}} \Vert X_\sharp \Vert _{op}, \Vert \varDelta _-\Vert _{2, 1} \ne 0\right\} . \end{aligned}$$

Define the random variable

$$\begin{aligned} Z&= \sup _{(\varDelta _+, \varDelta _-) \in S} \bigg | \frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\sum _{i,j=1}^d \delta _{ij} |\langle \varDelta _{-,i},\varDelta _{+,j}\rangle + \langle \varDelta _{+,i},\varDelta _{-,j}\rangle | \\&\quad - \mathbb {E}\frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\sum _{i,j=1}^d \delta _{ij} |\langle \varDelta _{-,i},\varDelta _{+,j}\rangle + \langle \varDelta _{+,i},\varDelta _{-,j}\rangle |\bigg |. \end{aligned}$$

Claim

There exist constants $ c_2, c_3 > 0$ such that with probability at least $1-\exp (-c_2 \log d)$

$$\begin{aligned} Z \le c_3 C\sqrt{ \tau \nu r\log d } \left\| X_\sharp \right\| _{op}. \end{aligned}$$

Before proving this claim, let us show how it implies the theorem. Let

$$\begin{aligned} R \in {\mathop {\hbox {argmin}}\limits _{\hat{R}^\top \hat{R} = I}} \Vert X - X_\sharp \hat{R}\Vert _{2, 1}. \end{aligned}$$

Set $\varDelta _- = X - X_\sharp R$ and $\varDelta _+ = X + X_\sharp R$. Notice that

$$\begin{aligned} \Vert \varDelta _+\Vert _{2, \infty } \le \Vert X\Vert _{2, \infty } + \Vert X_\sharp \Vert _{2, \infty } \le (1 + C) \Vert X_\sharp \Vert _{2, \infty } \le \sqrt{\frac{\nu r}{ d}} (1+C) \Vert X_\sharp \Vert _{op}. \end{aligned}$$

Therefore, because $(\varDelta _+, \varDelta _-) \in S$ and

$$\begin{aligned}&\frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\sum _{i,j=1}^d \delta _{ij} |\langle X_i,X_j\rangle - \langle (X_\sharp )_i, (X_\sharp )_j\rangle | = \frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\sum _{i,j=1}^d \delta _{ij} |\langle \varDelta _{-,i},\varDelta _{+,j}\rangle \\&\quad + \langle \varDelta _{+,i},\varDelta _{-,j}\rangle |, \end{aligned}$$

we have that

$$\begin{aligned}&\sum _{i,j=1}^d \delta _{ij} |\langle X_i,X_j\rangle - \langle (X_\sharp )_i,(X_\sharp )_j\rangle | \\&\quad \le \tau \Vert XX^\top - X_\sharp X_\sharp ^\top \Vert _1 +c_3C \sqrt{\tau \nu r\log d } \Vert X_\sharp \Vert _{op} \Vert X - X_\sharp R \Vert _{2, 1}\\&\quad \le \left( \tau + \frac{c_3C \sqrt{\tau \nu r\log d }}{c} \Vert X_\sharp \Vert _{op}\right) \Vert XX^\top - X_\sharp X_\sharp ^\top \Vert _1 , \end{aligned}$$

where the last line follows by Conjecture 8.7. This proves the desired result.

Proof of the Claim

Our goal is to show that the random variable Z is highly concentrated around its mean. We may apply the standard symmetrization inequality [7, Lemma 11.4] to bound the expectation $\mathbb {E}Z$ as follows:

$$\begin{aligned} \mathbb {E}Z&\le 2\mathbb {E}\sup _{(\varDelta _+, \varDelta _-) \in S} \left| \frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\sum _{i,j=1}^d \varepsilon _{ij}\delta _{ij} |\langle \varDelta _{-,i},\varDelta _{+,j}\rangle + \langle \varDelta _{+,i},\varDelta _{-,j}\rangle | \right| \\&\le 2\mathbb {E}\sup _{(\varDelta _+, \varDelta _-) \in S} \left| \frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\sum _{i,j=1}^d \varepsilon _{ij}\delta _{ij}| \langle \varDelta _{-,i},\varDelta _{+,j}\rangle | \right| \\&\quad + 2\mathbb {E}\sup _{(\varDelta _+, \varDelta _-) \in S} \left| \frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\sum _{i,j=1}^d \varepsilon _{ij}\delta _{ij} |\langle \varDelta _{+,i},\varDelta _{-,j}\rangle |\right| \\&=:T_1+T_2. \end{aligned}$$

Observing that $T_1$ and $T_2$ can both be bounded by

$$\begin{aligned} \max \{T_1, T_2\}&\le 2 \sup _{(\varDelta _+, \varDelta _-) \in S} \frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\Vert \varDelta _+\varDelta _-^\top \Vert _{2,\infty } \mathbb {E}\max _{j} \left| \sum _{i=1}^d \varepsilon _{ij}\delta _{ij} \right| \\&\le 2 \sup _{(\varDelta _+, \varDelta _-) \in S} \Vert \varDelta _+\Vert _{2, \infty } \mathbb {E}\max _{j} \left| \sum _{i=1}^d \varepsilon _{ij}\delta _{ij} \right| \\&\le 2(1+C) \sqrt{\frac{\nu r}{d}}\Vert X_\sharp \Vert _{op} \mathbb {E}\max _{j} \left| \sum _{i=1}^d \varepsilon _{ij}\delta _{ij} \right| \\&\lesssim C \sqrt{\frac{\nu r}{d}}\Vert X_\sharp \Vert _{op} (\sqrt{\tau d \log d} + \log d), \end{aligned}$$

where the final inequality follows from Bernstein’s inequality and a union bound, we find that

$$\begin{aligned} \mathbb {E}Z \lesssim C\sqrt{\frac{\nu r}{d}}\Vert X_\sharp \Vert _{op} (\sqrt{\tau d \log d} + \log d). \end{aligned}$$

To prove that Z is well concentrated around $ \mathbb {E}Z$, we apply Theorem F.3. To apply this theorem, we set $\mathcal {S}= S$ and define the collection $(Z_{ij,s})_{ij, s\in \mathcal {S}}$, where $s = (\varDelta _+, \varDelta _-)$ by

$$\begin{aligned} Z_{ij,s}&= \frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\delta _{ij} |\langle \varDelta _{-,i},\varDelta _{+,j}\rangle + \langle \varDelta _{+,i},\varDelta _{-,j}\rangle | \\&\quad - \mathbb {E}\frac{1}{\Vert \varDelta _-\Vert _{2, 1}}\delta _{ij} |\langle \varDelta _{-,i},\varDelta _{+,j}\rangle + \langle \varDelta _{+,i},\varDelta _{-,j}\rangle |\\&= \frac{(\delta _{ij} - \tau )}{\Vert \varDelta _-\Vert _{2, 1}} |\langle \varDelta _{-,i},\varDelta _{+,j}\rangle + \langle \varDelta _{+,i},\varDelta _{-,j}\rangle |. \end{aligned}$$

We also bound

$$\begin{aligned} b&= \sup _{ij, s \in \mathcal {S}} |Z_{ij,s}| \le \sup _{ij, (\varDelta _+, \varDelta _-) \in S} \left| \frac{(\delta _{ij} - \tau )}{\Vert \varDelta _-\Vert _{2, 1}} (\Vert \varDelta _{-,i}\Vert _F\Vert \varDelta _{+,j}\Vert _F + \Vert \varDelta _{+,i}\Vert _F\Vert \varDelta _{-,j}\Vert _F) \right| \\&\le (1+C)\sqrt{\frac{\nu r}{d}}\Vert X_\sharp \Vert _{op} \sup _{ij,(\varDelta _+, \varDelta _-) \in S} \left| \frac{1}{\Vert \varDelta _-\Vert _{2, 1}} (\Vert \varDelta _{-,i}\Vert _F + \Vert \varDelta _{-,j}\Vert _F) \right| \\&\le 2C\sqrt{\frac{\nu r}{d}}\Vert X_\sharp \Vert _{op} \end{aligned}$$

and

$$\begin{aligned} \sigma ^2&= \sup _{(\varDelta _+, \varDelta _-) \in S} \mathbb {E}\frac{1}{\Vert \varDelta _-\Vert _{2, 1}^2} \sum _{ij =1}^d(\delta _{ij} - \tau )^2 |\langle \varDelta _{-,i},\varDelta _{+,j}\rangle + \langle \varDelta _{+,i},\varDelta _{-,j}\rangle |^2\\&\le \tau \sup _{(\varDelta _+, \varDelta _-) \in S} \frac{1}{\Vert \varDelta _-\Vert _{2, 1}^2} \sum _{ij =1}^d (\Vert \varDelta _{-,i}\Vert _F\Vert \varDelta _{+,j}\Vert _F + \Vert \varDelta _{+,i}\Vert _F\Vert \varDelta _{-,j}\Vert _F)^2\\&\le \tau \sup _{(\varDelta _+, \varDelta _-) \in S} \frac{4}{\Vert \varDelta _-\Vert _{2, 1}^2} \sum _{ij =1}^d \Vert \varDelta _{-,i}\Vert _F^2\Vert \varDelta _{+,j}\Vert _F^2\\&\le \tau \frac{4(1+C)^2\nu r}{d}\Vert X_\sharp \Vert _{op}^2\sup _{(\varDelta _+, \varDelta _-) \in S} \frac{2}{\Vert \varDelta _-\Vert _{2, 1}^2} \sum _{ij =1}^d \Vert \varDelta _{-,i}\Vert _F^2\\&\le \tau \frac{4(1+C)^2\nu r}{d}\Vert X_\sharp \Vert _{op}^2\sup _{(\varDelta _+, \varDelta _-) \in S} \frac{2d\Vert \varDelta _-\Vert _F^2}{\Vert \varDelta _-\Vert _{2, 1}^2} \\&\le 16\tau C^2\nu r\Vert X_\sharp \Vert _{op}^2. \end{aligned}$$

Therefore, due to Theorem F.3 there exists a constant $c_1, c_2, c_3 > 0$ so that with $t = c_2 \log d$, we have that with probability $1-e^{-c_2\log d}$ that Z is upped bounded by

$$\begin{aligned}&\mathbb {E}Z + \sqrt{8\left( 2b\mathbb {E}Z+\sigma {}^{2}\right) t}+8bt\\&\quad \le c_1C\sqrt{\frac{\nu r}{d}}\Vert X_\sharp \Vert _{op} (\sqrt{\tau d \log d} + \log d) \\&\qquad + \sqrt{8c_2\left( \frac{c_1^2 C^2\nu r}{d}\Vert X_\sharp \Vert _{op}^2 (\sqrt{\tau d \log d} + \log d)+16\tau C^2 \nu r\Vert X_\sharp \Vert _{op}^2\right) \log d } \\&\qquad + 16 c_2C \sqrt{\frac{\nu r}{d}}\Vert X_\sharp \Vert _{op}\log (d)\\&\quad \le C \sqrt{\nu r\log d } \Vert X_\sharp \Vert _{op} \left( c_1\sqrt{\tau } + c_1\sqrt{\tfrac{\log d }{d}} + \sqrt{8c_2}\sqrt{ \sqrt{\tfrac{c^4_1\tau \log d}{d}} + \tfrac{c_1^2\log d}{d} + 16\tau } + 16c_2 \sqrt{\tfrac{\log d}{d}}\right) \\&\quad \le c_3C \sqrt{\tau \nu r \log d } \Vert X_\sharp \Vert _{op}. \end{aligned}$$

where the last line follows since by assumption $\log d / d \lesssim \tau .$ $\square $

E Proofs in Sect. 9

1.1 E.1 Proof of Lemma 9.1

The proof follows the same strategy as [32, Theorem 6.1]. Fix $x \in \widetilde{\mathcal {T}}_1$ and let $\zeta \in \partial \tilde{f}(x)$. Then, for all y, we have, from Lemma 9.3, that

$$\begin{aligned} f(y) \ge \tilde{f}(x) + \langle \zeta , y - x\rangle - \frac{\rho }{2} \Vert x - y\Vert ^2_2 - 3\varepsilon . \end{aligned}$$

Therefore, the function

$$\begin{aligned} g(y) := f(y) - \langle \zeta , y - x\rangle + \frac{\rho }{2} \Vert x - y\Vert ^2_2 + 3\varepsilon \end{aligned}$$

satisfies

$$\begin{aligned} g(x) - \inf g \le f(x) - \tilde{f}(x) + 3\varepsilon \le 4\varepsilon . \end{aligned}$$

Now, for some $\gamma > 0$ to be determined momentarily, define

$$\begin{aligned} \hat{x} = \hbox {argmin} \left\{ g(x) + \frac{ \varepsilon }{\gamma ^2}\Vert x - y\Vert ^2_2 \right\} . \end{aligned}$$

First-order optimality conditions and the sum rule immediately imply that

$$\begin{aligned} \frac{2\varepsilon }{\gamma ^2} (x -\hat{x}) \in \partial g(\hat{x}) = \partial f(\hat{x}) - \zeta + \rho (\hat{x} - x). \end{aligned}$$

Thus,

$$\begin{aligned} \mathrm{dist}(\zeta , \partial f(\hat{x})) \le \left( \frac{2\varepsilon }{\gamma ^2} + \rho \right) \Vert x -\hat{x}\Vert _2. \end{aligned}$$

Now we estimate $\Vert x - \hat{x}\Vert _2$. Indeed, from the definition of $\hat{x}$ we have

$$\begin{aligned} \frac{\varepsilon }{\gamma ^2}\Vert \hat{x} - x\Vert ^2 \le g(x) - g(\hat{x}) \le g(x) - \inf g \le 4\varepsilon . \end{aligned}$$

Consequently, we have $\Vert x - \hat{x}\Vert \le 2\gamma $. Thus, setting $\gamma = \sqrt{2\varepsilon /\rho }$ and recalling that $\varepsilon \le \mu ^2/56\rho $ we find that

$$\begin{aligned} \mathrm{dist}(\hat{x}, \mathcal {X}^*) \le \Vert x - \hat{x}\Vert + \mathrm{dist}(x, \mathcal {X}^*) \le 2\sqrt{\frac{2\varepsilon }{ \rho }} + \frac{\mu }{4\rho } \le \frac{\mu }{\rho }. \end{aligned}$$

Likewise, we have

$$\begin{aligned} \mathrm{dist}(\hat{x}, \mathcal {X}) \le \Vert x - \hat{x}\Vert \le 2\sqrt{\frac{2\varepsilon }{ \rho }} . \end{aligned}$$

Therefore, setting $L = \sup \left\{ \Vert \zeta \Vert _2:\zeta \in \partial f(x), \mathrm{dist}(x, \mathcal {X}^*) \le \frac{\mu }{\rho }, \mathrm{dist}(x, \mathcal {X}) \le 2\sqrt{\frac{\varepsilon }{\rho }}\right\} $, we find that

$$\begin{aligned} \Vert \zeta \Vert \le L + \mathrm{dist}(\zeta , \partial f(\hat{x})) \le L + \frac{4\varepsilon }{\gamma } + 2\rho \gamma = L + 2\sqrt{8\rho \varepsilon }, \end{aligned}$$

as desired.

1.2 E.2 Proof of Theorem 9.4

Let $i \ge 0$, suppose $x_i \in \widetilde{\mathcal {T}}_1$, and let $x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_i)$. Notice that Lemma 9.2 implies $\tilde{f}(x_i)-\min _{\mathcal {X}}f>0$. We successively compute

$$\begin{aligned} \Vert x_{i+1} - x^*\Vert ^2&=\left\| \mathrm {proj}_{\mathcal {X}}\left( x_{i}-\tfrac{\tilde{f}(x_i)-\min _{\mathcal {X}} f}{\Vert \zeta _i\Vert ^2} \zeta _i\right) -\mathrm {proj}_\mathcal {X}(x^*)\right\| ^2\nonumber \\&\le \left\| (x_{i} - x^*)- \tfrac{\tilde{f}(x_i)-\min _{\mathcal {X}} f}{\Vert \zeta _i\Vert ^2} \zeta _i\right\| ^2 \end{aligned}$$

(E.1)

$$\begin{aligned}&= \Vert x_{i} - x^*\Vert ^2 + \frac{2(\tilde{f}(x_i) - \min _{\mathcal {X}} f)}{\Vert \zeta _i\Vert ^2}\cdot \langle \zeta _i, x^* - x_{i}\rangle + \frac{(\tilde{f}(x_i) - \min _\mathcal {X}f)^2}{\Vert \zeta _i\Vert ^2} \nonumber \\&\le \Vert x_{i} - x^*\Vert ^2 + \frac{2(\tilde{f}(x_i) - \min _\mathcal {X}f)}{\Vert \zeta _i\Vert ^2}\left( \min _\mathcal {X}f - \tilde{f}(x_i) + \frac{\rho }{2}\Vert x_i - x^*\Vert ^2 + 3\varepsilon \right) \nonumber \\&\qquad \qquad + \frac{(\tilde{f}(x_i) - \min _\mathcal {X}f)^2}{\Vert \zeta _i\Vert ^2} \end{aligned}$$

(E.2)

$$\begin{aligned}&= \Vert x_{i} - x^*\Vert ^2 + \frac{\tilde{f}(x_i) - \min _\mathcal {X}f}{\Vert \zeta _i\Vert ^2}\left( \rho \Vert x_i - x^*\Vert ^2 - (\tilde{f}(x_i) - \min _\mathcal {X}f) + 6\varepsilon \right) \nonumber \\&\le \Vert x_{i} - x^*\Vert ^2 + \frac{\tilde{f}(x_i) - \min _\mathcal {X}f}{\Vert \zeta _i\Vert ^2}\left( \rho \Vert x_i - x^*\Vert ^2 - \mu \Vert x_i - x^*\Vert + 7\varepsilon \right) \end{aligned}$$

(E.3)

$$\begin{aligned}&\le \Vert x_{i} - x^*\Vert ^2 + \frac{\rho (\tilde{f}(x_i) - \min _\mathcal {X}f)}{\Vert \zeta _i\Vert ^2}\left( \Vert x_i - x^*\Vert -\frac{\mu }{2\rho }\right) \Vert x_i - x^*\Vert \end{aligned}$$

(E.4)

$$\begin{aligned}&\le \Vert x_{i} - x^*\Vert ^2 - \frac{\mu (\tilde{f}(x_i) - \min _\mathcal {X}f)}{4\Vert \zeta _i\Vert ^2}\cdot \Vert x_i - x^*\Vert \end{aligned}$$

(E.5)

$$\begin{aligned}&\le \Vert x_{i} - x^*\Vert ^2 - \frac{\mu (\mu \Vert x_i - x^*\Vert - \varepsilon )}{4\Vert \zeta _i\Vert ^2}\cdot \Vert x_i - x^*\Vert \nonumber \\&\le \left( 1-\frac{13\mu ^2}{56\Vert \zeta _i\Vert ^2}\right) \Vert x_i- x^*\Vert ^2. \end{aligned}$$

(E.6)

Here, the estimate (E.1) follows from the fact that the projection $\mathrm {proj}_Q(\cdot )$ is nonexpansive, (E.2) uses Lemma 9.3, the estimate (E.4) follows from the assumption $\epsilon <\frac{\mu }{14}\Vert x_k-x^*\Vert $, the estimate (E.5) follows from the estimate $\Vert x_i-x^*\Vert \le \frac{\mu }{4\rho }$, while (E.3) and (E.6) use Lemma 9.2. We therefore deduce

$$\begin{aligned} \mathrm{dist}^2(x_{i+1};\mathcal {X}^*)\le \Vert x_{i+1} - x^*\Vert ^2\le \left( 1-\frac{13\mu ^2}{56L^2}\right) \mathrm{dist}^2(x_i,\mathcal {X}^*). \end{aligned}$$

Consequently, either we have $\mathrm{dist}(x_{i+1}, \mathcal {X}^*) < \frac{14\varepsilon }{\mu }$ or $x_{i+1} \in \widetilde{\mathcal {T}}_1$. Therefore, by induction, the proof is complete.

1.3 E.3 Proof of Theorem 9.6

Let $i \ge 0$, suppose $x_i \in \mathcal {T}_\gamma $, and let $x^*\in \mathrm {proj}_{\mathcal {X}^*}(x_i)$. Then,

$$\begin{aligned} \mu \mathrm{dist}(x_{i+1}, \mathcal {X}^*) \le f(x_{i+1}) - \inf _{\mathcal {X}} f&\le f_x(x_{i+1}) - \inf _\mathcal {X}f + \frac{\rho }{2}\Vert x_{i+1} - x_i\Vert ^2 \\&\le \tilde{f}_x(x_{i+1}) - \inf _\mathcal {X}f + \frac{\rho }{2}\Vert x_{i+1} - x_i\Vert ^2 + \varepsilon \\&\le \tilde{f}_x(x^*) - \inf _{\mathcal {X}} f + \frac{\beta }{2}\Vert x_i - x^*\Vert ^2 + \varepsilon \\&\le f_x(x^*) - \inf _{\mathcal {X}} f + \frac{\beta }{2}\Vert x_i - x^*\Vert ^2 + 2\varepsilon \\&\le f(x^*) - \inf _{\mathcal {X}} f + \beta \Vert x_i - x^*\Vert ^2 + 2\varepsilon \\&= \beta \mathrm{dist}^2(x_i, \mathcal {X}^*) + 2\varepsilon . \end{aligned}$$

Rearranging yields the result.

F Auxiliary Lemmas

Lemma F.1

(Lemma 3.1 in [13]) Let $S_r := \left\{ X \in \mathbf{R}^{d_1 \times d_2} \mid \text {Rank }\,(X) \le r, \left\| X \right\| _F = 1\right\} $. There exists an $\epsilon $-net $\mathcal {N}$ (with respect to $\Vert \cdot \Vert _F$) of $S_r$ obeying

$$\begin{aligned} |\mathcal {N}| \le \left( \frac{9}{\epsilon }\right) ^{(d_1+d_2+1)r}. \end{aligned}$$

Proposition F.2

(Corollary 1.4 in [75]) Consider $X_1, \dots , X_d$ real-valued random variables and let $\sigma \in \mathbb {S}^{d-1}$ be a unit vector. Let $t, p > 0$ such that

$$\begin{aligned} \sup _{u \in \mathbf{R}} \mathbb {P}\left( |X_i - u| \le t \right) \le p \qquad \text {for all }i = 1, \dots , d. \end{aligned}$$

Then, the following holds

$$\begin{aligned} \sup _{u \in \mathbf{R}} \mathbb {P}\left( \left| \sum _k \sigma _k X_k - u\right| \le t \right) \le Cp, \end{aligned}$$

where $C > 0$ is a universal constant.

Theorem F.3

(Talagrand’s Functional Bernstein for non-identically distributed variables [53, Theorem 1.1(c)]) Let $\mathcal {S}$ be a countable index set. Let $Z_{1},\ldots ,Z_{n}$ be independent vector-valued random variables of the form $Z_{i}=(Z_{i,s})_{s\in \mathcal {S}}$. Let $Z:=\sup _{s\in \mathcal {S}}\sum _{i=1}^{n}Z_{i,s}$. Assume that for all $i\in [n]$ and $s\in \mathcal {S}$, $\mathbb {E}Z_{i,s}=0$ and $\left| Z_{i,s}\right| \le b$. Let

$$\begin{aligned} \sigma {}^{2}=\sup _{s\in \mathcal {S}}\sum _{i=1}^{n}\mathbb {E}Z_{i,s}^{2}. \end{aligned}$$

Then, for each $t>0$, we have the tail bound

$$\begin{aligned} \mathbb {P}\left( Z-\mathbb {E}Z\ge \sqrt{8\left( 2b\mathbb {E}Z+\sigma {}^{2}\right) t}+8bt\right)&\le e^{-t}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Charisopoulos, V., Chen, Y., Davis, D. et al. Low-Rank Matrix Recovery with Composite Optimization: Good Conditioning and Rapid Convergence. Found Comput Math 21, 1505–1593 (2021). https://doi.org/10.1007/s10208-020-09490-9

Download citation

Received: 05 June 2019
Revised: 29 October 2020
Accepted: 05 November 2020
Published: 28 January 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10208-020-09490-9

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Low-Rank Matrix Recovery with Composite Optimization: Good Conditioning and Rapid Convergence

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Convergence of projected subgradient method with sparse or low-rank constraints

Provably Accelerating Ill-Conditioned Low-Rank Estimation via Scaled Gradient Descent, Even with Overparameterization

Sparse Recovery: The Square of \(\ell _1/\ell _2\) Norms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Proofs in Sect. 5

Lemma A.1

Proof

1.1 A.1 Proof of Theorem 5.6

1.2 Proof of Theorem 5.7

1.3 A.3 Proof of Theorem 5.8

B Proofs in Sect. 6

1.1 B.1 Proof of Lemma 6.3

1.2 B.2 Proof of Theorem 6.4

Proposition B.1

1.2.1 B.2.1 Part 1 of Theorem 6.4 (Matrix sensing)

Lemma B.2

Proof

1.2.2 B.2.2 Part 2 of Theorem 6.4 (Quadratic sensing I)

Lemma B.3

Proof

1.2.3 B.2.3 Part 3 of Theorem 6.4 (Quadratic sensing II)

Lemma B.4

Proof

1.2.4 B.2.4 Part 4 of Theorem 6.4 (Bilinear sensing)

Lemma B.5

Proof

1.3 B.3 Proof of Proposition B.1

C Proof in Sect. 7

1.1 C.1 Proof of Lemma 7.4

1.2 C.2 Proof of Theorem 7.6

D Proofs in Sect. 8

1.1 D.1 Proof of Lemma 8.1

1.2 D.2 Proof of Theorem 8.6

1.3 D.3 Proof of Lemma 8.8

Claim

Proof of the Claim

E Proofs in Sect. 9

1.1 E.1 Proof of Lemma 9.1

1.2 E.2 Proof of Theorem 9.4

1.3 E.3 Proof of Theorem 9.6

F Auxiliary Lemmas

Lemma F.1

Proposition F.2

Theorem F.3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now