Abstract
Obtaining high performance on the STI CELL processor requires substantial programming effort because its architectural features must be explicitly managed, with separate codes required for two different types of cores (PPE and SPE). Research at IBM has developed a single source-image compiler for CELL that performs vectorization but uses OpenMP to specify cross-core parallelism. In this paper, we present and evaluate an alternative dependence-based compiler approach that automatically generates parallel and vector code for CELL from a single source program with no parallelism directives. In contrast to OpenMP, our approach can also handle loop nests that carry dependences. To preserve correct program semantics, we employ on-chip communication mechanisms to implement barrier and unidirectional synchronization primitives. We also implement strategies to boost performance by managing DMA data movement, improving data alignment, and exploiting memory reuse in the innermost loop.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allen, J.R.: Dependence Analysis for Subscripted Variables and its Application to Program Transformation. PhD thesis, Rice University, Houston, Texas (1983)
Allen, R., Callahan, D., Kennedy, K.: Automatic decomposition of scientific programs for parallel execution. In: POPL ’87: Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, Munich, West Germany, ACM Press, New York (1987)
Allen, R., Kennedy, K.: Vector register allocation. IEEE Transactions on Computers 41(10), 1290–1317 (1992)
Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco (2001)
Bik, A.J.C., et al.: Automatic intra-register vectorization for the intel architecture. International Journal of Parallel Programming 30(2), 65–98 (2002)
Callahan, D., Kennedy, K., Porterfield, A.: Software prefetching. In: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, California, April (1991)
Carr, S., Kennedy, K.: Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems 15(3), 400–462 (1994)
Crescent Bay Software. VAST/AltiVec. http://www.crescentbaysoftware.com/vast_altivec.html
Eichenberger, A.E., et al.: Optimizing compiler for a cell processor. In: PACT (2005)
Eichenberger, A.E., Wu, P., O’Brien, K.: Vectorization for SIMD architectures with alignment constraints. In: PLDI’04, Washington DC, USA, June (2004)
Feldman, S.I., et al.: A fortran-to-C converter. Technical Report 149, AT&T Bell Laboratories, Murray Hill, NJ (1990)
Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS-IV: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, California, United States, April (1991)
Larsen, S., Amarasinghe, S.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI (2000)
Mowry, T.C.: Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Standford University, California (1994)
Nuzman, D., Henderson, R.: Multi-platform auto-vectorization. In: CGO ’06: Proceedings of the International Symposium on Code Generation and Optimization, Washington, DC, USA (2006)
Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for SIMD. In: PLDI, Ottawa, Ontario, Canada (2006)
Shin, J., Chame, J., Hall, M.W.: Compiler-controlled caching in superword register files for multimeida extension architecture. In: PACT (2002)
Temam, O., Granston, E.D., Jalby, W.: To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In: Supercomputing ’93: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, Portland, Oregon, United States, November 1993, IEEE Computer Society Press, Los Alamitos (1993)
Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(1), 3–25 (2001)
Yi, Q.: Applying data copy to improve memory performance of general array computations. In: Ayguadé, E., et al. (eds.) LCPC 2005. LNCS, vol. 4339, Springer, Heidelberg (2006)
Zhao, Y., Kennedy, K.: Scalarization on short vector machines. In: 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, Texas, March 20–22, 2005, IEEE Computer Society Press, Los Alamitos (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Zhao, Y., Kennedy, K. (2007). Dependence-Based Code Generation for a CELL Processor. In: Almási, G., Caşcaval, C., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2006. Lecture Notes in Computer Science, vol 4382. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72521-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-72521-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72520-6
Online ISBN: 978-3-540-72521-3
eBook Packages: Computer ScienceComputer Science (R0)