Abstract
Data copy is an important compiler optimization which dynamically rearranges the layout of arrays by copying their elements into local buffers. Traditionally, array copy is considered expensive and has been applied only to the working sets of fully blocked computations. This paper presents an algorithm which automatically applies data copy to optimize the performance of general computations independent of blocking. The algorithm automatically decides where to insert copy operations and which regions of arrays to copy. In addition, when specialized, it is equivalent to a general scalar replacement algorithm on arbitrary array computations. The algorithm is fully implemented and has been applied to optimize several scientific kernels. The results show that the algorithm is highly effective and that data copy can significantly improve the performance of scientific computations, both when combined with blocking and when applied alone without blocking.
The work was developed when the author was under employment by Lawrence Livermore National Laboratory, Livermore, CA, 94550.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco (2001)
Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. The Society for Industrial and Applied Mathematics (1999)
Anderson, J., Amarasinghe, S., Lam, M.: Data and computation transformation for multiprocessors. In: ACM Symposium on Principles and Practices of Parallel Programming, Santa Barbara (July 1995)
Banerjee, U.: Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Boston (1988)
Carr, S., Kennedy, K.: Scalar replacement in the presence of conditional control flow. Software – Practice and Experience 24(1), 51–77 (1994)
Ding, C., Kennedy, K.: Improving cache performance in dynamic applications through data and computation reorganization at run time. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Gorgia (May 1999)
Han, H., Tseng, C.-W.: Improving locality for adaptive irregular scientific codes. Technical Report CS-TR-4039, Dept. of Computer Science, University of Maryland (September 1999)
Kennedy, K., McKinley, K.S.: Typed fusion with applications to parallel and sequential code generation. Technical Report TR93-208, Dept. of Computer Science, Rice University (also available as CRPC-TR94370) (August 1993)
Lam, M., Rothberg, E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara (April 1991)
Mellor-Crummy, J., Whalley, D., Kennedy, K.: Improving Memory Hierarchy Performance For Irregular Applications. In: Proceedings of the 13th ACMSIGARCH International Conference on Supercomputing, Phodes, Greece (1999)
O’Boyle, M., Knijnenburg, P.: Integrating loop and data transformations for global optimisation. In: International Conference on Parallel Architectures and Compilation Techniques, Paris, France (October 1998)
Rivera, G., Tseng, C.-W.: Data transformations for eliminating conflict misses. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal, Canada (June 1998)
Temam, O., Granston, E., Jalby, W.: To copy or not to copy: A compiletime technique for assessing when data copying should be used to eliminate cache conflicts. In: Proceedings of Supercomputing 1993, Portland, OR (November 1993)
Wolfe, M.J.: Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge (1989)
Yi, Q., Kennedy, K., Adve, V.: Transforming complex loop nests for locality. The Journal of Supercomputing 27, 219–264 (2004)
Yi, Q., Kennedy, K., You, H., Seymour, K., Dongarra, J.: Automatic blocking of qr and lu factorizations for locality. In: The Second ACM SIGPLAN Workshop on Memory System Performance, Washington, DC, USA (June 2004)
Yi, Q., Quinlan, D.: Applying loop optimizations to object-oriented abstractions through general classification of array semantics. In: The 17th International Workshop on Languages and Compilers for Parallel Computing, West Lafayette, Indiana, USA (September 2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yi, Q. (2006). Applying Data Copy to Improve Memory Performance of General Array Computations. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2005. Lecture Notes in Computer Science, vol 4339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69330-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-69330-7_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69329-1
Online ISBN: 978-3-540-69330-7
eBook Packages: Computer ScienceComputer Science (R0)