TR-08-07 Performance Consistency on Multi-socket AMD Opteron Systems

Allan Porterfield, Robert J. Fowler, Anirban Mandal and Min Yeol Lim. Performance Consistency on Multi-socket AMD Opteron Systems. Technical Report TR-08-07, RENCI, North Carolina, December 2008.

Compute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bigger for memory-bound applications but still noticeable for all but the most compute-intensive applications. Memory and thread placement across sockets has significant impact on performance of these systems. We test overall performance and performance consistency for a number of OpenMP and pthread benchmarks including Stream, pChase , the NAS Parallel Benchmarks and SPEC OMP. The tests are run on a variety of multi-socket quad-core AMD Opteron systems. We examine the benefits of explicitly pinning each thread to a different core before any data initialization, thus improving and reducing the variability of performance due to data-to-thread co-location. Execution time variability falls to less than 2% and for one memory-bound application peak performance increases over 40%. For applications running on hundreds or thousands of nodes, reducing variability will improve load balance and total application performance. Careful memory and thread placement is critical for the successful performance tuning of nodes on a modern scientific compute cluster.

@TECHREPORT{LFRT2008:tr0807,
AUTHOR = {Allan Porterfield, Robert J. Fowler, Anirban Mandal and Min Yeol Lim},
TITLE = {Performance Consistency on Multi-socket AMD Opteron Systems.},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-07},
ADDRESS = {North Carolina},
MONTH = {December},
ABSTRACT = {Compute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bigger for memory-bound applications but still noticeable for all but the most compute-intensive applications. Memory and thread placement across sockets has significant impact on performance of these systems. We test overall performance and performance consistency for a number of OpenMP and pthread benchmarks including Stream, pChase , the NAS Parallel Benchmarks and SPEC OMP. The tests are run on a variety of multi-socket quad-core AMD Opteron systems. We examine the benefits of explicitly pinning each thread to a different core before any data initialization, thus improving and reducing the variability of performance due to data-to-thread co-location. Execution time variability falls to less than 2% and for one memory-bound application peak performance increases over 40%. For applications running on hundreds or thousands of nodes, reducing variability will improve load balance and total application performance. Careful memory and thread placement is critical for the successful performance tuning of nodes on a modern scientific compute cluster.},
URL = {http://www.renci.org/publications/techreports/TR-08-07.pdf}
}