Abstract
This work studies an optimization scheme for computing sparse approximate solutions of over-determined linear systems. Sparse Conjugate Directions Pursuit (SCDP) aims to construct a solution using only a small number of nonzero (i.e. nonsparse) coefficients. Motivations of this work can be found in a setting of machine learning where sparse models typically exhibit better generalization performance, lead to fast evaluations, and might be exploited to define scalable algorithms. The main idea is to build up iteratively a conjugate set of vectors of increasing cardinality, in each iteration solving a small linear subsystem. By exploiting the structure of this conjugate basis, an algorithm is found (i) converging in at most D iterations for D-dimensional systems, (ii) with computational complexity close to the classical conjugate gradient algorithm, and (iii) which is especially efficient when a few iterations suffice to produce a good approximation. As an example, the application of SCDP to Fixed-Size Least Squares Support Vector Machines (FS-LSSVM) is discussed resulting in a scheme which efficiently finds a good model size for the FS-LSSVM setting, and is scalable to large-scale machine learning tasks. The algorithm is empirically verified in a classification context. Further discussion includes algorithmic issues such as component selection criteria, computational analysis, influence of additional hyper-parameters, and determination of a suitable stopping criterion.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Allan, D. (1974). The relationship between variable selection and prediction. Technometrics, 16, 125–127.
Blumensath, T., & Davies, M. E. (2008). Gradient pursuits. IEEE Transactions on Signal Processing, 56(6), 2370–2382.
Bruckstein, A. M., Donoho, D. L., & Elad, M. (2009). From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51(1), 34–81.
Bühlmann, P., & Yu, B. (2003). Boosting with the l2-loss: Regression and classification. Journal of the American Statistical Association, 98, 324–339.
Candès, E. (2006). Compressive sampling. In Proc. of the international congress of mathematicians, Madrid, Spain.
Cawley, G. C. (2006). Leave-one-out cross-validation based model selection criteria for weighted LS-SVMs. In Proc. of the international joint conference on neural networks (IJCNN-2006), Vancouver, Canada (Vol. 2415, pp. 1661–1668).
Cawley, G. C., & Talbot, N. L. C. (2002). A greedy training algorithm for sparse least-squares support vector machines. In Proc. of the international conference on artificial neural networks, Madrid, Spain (Vol. 2415, pp. 681–686).
Chang, C. C., & Lin, C. J. (2001). Libsvm: a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm.
Chen, S., Billings, S. A., & Luo, W. (1989). Orthogonal least squares methods and their application to non-linear system identification. International Journal of Control, 50(5), 1873–1896.
Chen, S., Cowan, C., & Grant, P. (1991). Orthogonal least squares learning algorithm for radial basis function networks. IEEE Transactions on Neural Networks, 2(2), 302–309.
Chen, S., Billings, S. A., & Grant, M. (1992). Recursive hybrid algorithm for non-linear system identification using radial basis function networks. International Journal of Control, 55(5), 1051–1070.
Chen, S., Hong, X., Luk, B. L., & Harris, C. J. (2009). Orthogonal-least-squares regression: A unified approach for data modelling. Neurocomputing, 72(10–12), 2670–2681.
Chen, S. S., Donoho, D. L., & Saunders, M. A. (1999). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1), 33–61.
Chu, W., Keerthi, S. S., & Ong, C. (2005). An improved conjugate gradient scheme to the solution of least squares SVM. IEEE Transactions on Neural Networks, 16(2), 498–501.
Daubechies, I., Vore, R. D., Fornasier, M., & Gunturk, S. (2008). Iteratively re-weighted least squares minimization: Proof of faster than linear rate for sparse recovery. In Proc. of the information sciences and systems, Princeton, NJ (pp. 26–29).
De Brabanter, K., De Brabanter, J., Suykens, J. A. K., & De Moor, B. (2010). Optimized fixed-size kernel models for large data sets. Computational Statistics & Data Analysis, 54(6), 1484–1504.
de Kruif, B. J., & de Vries, T. J. A. (2003). Pruning error minimization in least squares support vector machines. IEEE Transactions on Neural Networks, 14(3), 696–702.
Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4), 1289–1306.
Donoho, D. L., & Tsaig, Y. (2006). Fast solution of l1-norm minimization problems when the solution may be sparse (Tech. rep.). Stanford University.
Donoho, D. L., Tsaig, Y., Drori, I., & Starck, J. (2006). Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit (Tech. rep.). Stanford University.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression (with discussion). Annals of Statistics, 32(2), 407–499.
Figueiredo, M. A. T., Nowak, R. D., & Wright, S. J. (2007). Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 1(4), 586–597.
Floyd, S., & Warmuth, M. (1995). Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning Journal, 21, 269–304.
Gonzalez, T. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Berlin: Springer.
Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49, 409–436.
Jiao, L., Bo, L., & Wang, L. (2007). Fast sparse approximation for least squares support vector machine. IEEE Transactions on Neural Networks, 18(3), 685–697.
Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge: MIT Press, Chap. 11. URL http://www.cs.cornell.edu/People/tj/publications/joachims_99a.pdf.
Keerthi, S. S., Chapelle, O., Decoste, D., Bennett, P., & Parrado-hernández, E. (2006). Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research, 7, 1493–1515.
Kim, S. J., Koh, K., Lustig, M., Boyd, S., & Gorinevsky, D. (2007). An interior-point method for large-scale l1-regularized least squares. IEEE Journal on Selected Topics in Signal Processing, 1(4), 606–617.
Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464–1480.
Lutz, R. W., & Bühlmann, P. (2006). Conjugate direction boosting. Journal of Computational and Graphical Statistics, 15(2), 287–311.
Mallat, S. (1999). A wavelet tour of signal processing. New York: Academic Press.
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A., & Müller, K. R. (2000). Invariant feature extraction and classification in feature spaces. Advances in Neural Information Processing Systems, 12, 526–532.
Moghaddam, B., Weiss, Y., & Avidan, S. (2006). Spectral bounds for sparse PCA: exact and greedy algorithms. Advances in Neural Information Processing Systems, 18, 915–922.
Möller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525–533.
Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24, 227–234.
Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308–313.
Nocedal, J., & Wright, S. J. (2006). Numerical optimization (2nd ed.). Berlin: Springer.
Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20, 389–403.
Pati, Y. C., Rezaifar, R., & Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proc. of the 27-th annual Asilomar conference on signals, systems, and computers (pp. 40–44).
Popovici, V., Bengio, S., & Thiran, J. P. (2005). Kernel matching pursuit for large datasets. Pattern Recognition, 38(12), 2385–2390.
Press, W. H., Cornell, B. P. F., Teukolsky, S. A., & Vetterling, W. T. (1993). Numerical recipes in C: The art of scientific computing (2nd ed.). Cambridge: Harvard University Press.
Rajasekaran, S. (2000). On simulated annealing and nested annealing. Journal of Global Optimization, 16, 43–56.
Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In Proc. of the 15th int. conf. on machine learning (ICML-98) (pp. 515–521).
Smola, A. J., & Bartlett, P. L. (2001). Sparse greedy Gaussian process regression. In Proc. neural information processing systems (Vol. 13, pp. 619–625).
Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300.
Suykens, J. A. K., Lukas, L., & Vandewalle, J. (2000). Sparse approximation using least squares support vector machines. In IEEE Proc. int. symp. circuits syst., Geneva, Switzerland (pp. 757–760).
Suykens, J. A. K., Vandewalle, J., & De Moor, B. (2001). Intelligence and cooperative search by coupled local minimizers. International Journal of Bifurcation and Chaos, 11(8), 2133–2144.
Suykens, J. A. K., De Brabanter, J., Lukas, L., & Vandewalle, J. (2002a). Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing, 48(1–4), 85–105.
Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2002b). Least squares support vector machines. Singapore: World Scientific.
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B, 58(1), 267–289.
Tropp, J. A. (2004). Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10), 2231–2242.
Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.
Vincent, P., & Bengio, Y. (2002). Kernel matching pursuit. Machine Learning, 48, 165–187.
Williams, C. K. I., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. Advances in Neural Information Processing Systems, 13, 682–688.
Xavier de Souza, S., Suykens, J. A. K., Vandewalle, J., & Bollé, D. (2010). Coupled simulated annealing. IEEE Transactions on Systems, Man and Cybernetics. Part B, 40(2), 320–336.
Young, D. (2003). Iterative solution of large linear systems. New York: Courier Dover Publications.
Zeng, X. Y., & Chen, X. W. (2005). Smo-based pruning methods for sparse least squares support vector machines. IEEE Transactions on Neural Networks, 16(6), 1541–1546.
Zhang, K., Tsang, I. W., & Kwok, J. T. (2008). Improved Nyström low-rank approximation and error analysis. In Proc. of the 25th international conference on machine learning, Helsinki, Finland (pp. 1232–1239).
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Sūreyya Ōzōǧūr-Akyūz, Devrim Unay, and Alex Smola.
Rights and permissions
About this article
Cite this article
Karsmakers, P., Pelckmans, K., De Brabanter, K. et al. Sparse conjugate directions pursuit with application to fixed-size kernel models. Mach Learn 85, 109–148 (2011). https://doi.org/10.1007/s10994-011-5253-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5253-8