Below is a list of technical reports produced by RENCI researchers as part of their ongoing work on projects with collaborators across the nation. Reports are listed by year and number.
-
TR-10-01
Allan Porterfield, Rob Fowler, Min Yeol Lim. RCRTool: Design Document; Version 0.1, Technical Report TR-10-01, RENCI, North Carolina, February 2010.
AbstractRCRTool, Resource Centric Reflection Tool, will allow application programmers to better understand resource contention between multiple threads of a single application or between simultaneously active applications sharing varying levels of hardware. The improved knowledge of how the entire system is performing will be available to applications and runtimes for dynamic performance tuning. This document provides some of the motivation and the initial design of the entire system including access of hardware and OS performance counters, system modeling with that data, API that allow access to the data by runtimes and applications, and a data logging facility for post-run analysis.
The design attempts to allow the same tool to be used with a future single shared address node (with tens of cores) and with a distributed memory system with tens of thousands of nodes and hundreds of thousands of cores. The dierence between these systems, should be contained by dierence in what parts of the system are watched for potential bottlenecks and the granularity of available dynamic feedback.
At the center of RCRTool will be the RCRdaemon. It will have several jobs, including watching the hardware and OS for performance bottlenecks using performance models. RCRTool will supply some models, but mechanisms for the user to add their own will exist. RCRdaemon will also be responsible for transmitting the current state of the system to applications and the OS for dynamic tuning. A third function of the daemon will be logging the information for post-execution analysis.
-
TR-09-04
Daniel Bedard, Allan Porterfield, Rob Fowler, Min Yeol Lim. PowerMon 2: Fine-grained, Integrated Power Measurement, Technical Report TR-09-04, RENCI, North Carolina, October 2009
AbstractWe describe version 2 of RENCI PowerMon, a device that can be inserted between a computer power supply and the computer’s main board to measure power usage at each of the DC power rails supplying the board. PowerMon 2 provides a capability to collect accurate, frequent, and time-correlated measurements. Since the measurements occur after the AC power supply, this approach eliminates power supply efficiency and time-domain filtering perturbations of the power measurements. PowerMon 2 provides detail about the power consumption of the hardware subsystems connected to each of its eight measurement channles. The device fits in an internal 3.5” hard disk drive bay, thus allowing it to be used in a 1U server chassis. It cost less than $150 per unit to fabricate our small quantity of prototypes.
-
TR-09-03
Chris Bizon, Andreas Prlic. Calculating All Pairwise Similarities from the RCSB Protein Data Bank: Client/Server Work Distribution on the Open Science Grid, Technical Report TR-09-03, RENCI, North Carolina, December 2009.
AbstractProteins can have various degrees of similarity. If two proteins show high similarity in their amino acid sequence, it is generally assumed that they are closely evolutionary related. With increasing evolutionary distance the degree of similarity usually drops, but proteins can still show similar activity in the cell and have an overall similar 3D structure, even if the sequence similarity is low. The detection of such remote similarities is important in order to infer functional and evolutionary relationships between protein families and is a core technique used in protein structure bioinformatics. The goal is to establish regions of equivalence between two or more molecules.
The RCSB Protein Data Bank (PDB) is a leading primary database that provides access to experimentally determined protein structures, nucleic acids, and complex assemblies. PDB is a vital part of the infrastructure supporting biomedical science worldwide and is used by around 200,000 unique scientists per month.
While protein sequence comparisons can be computed quickly, the calculation of protein structure alignments is much more time consuming. The RCSB PDB has recently started to add new tools to the site, that allow users to quickly identify protein sequence neighbors and run pairwise protein structure comparisons. In order to allow users to also quickly identify more distant 3D relationships the goal of this project is to provide a pre-calculated set of all vs. all 3D protein structure alignments.
-
TR-09-02
Jeffrey L. Tilson, Gloria Rendon, and Eric Jakobsson. Algorithms and Performance measurements for MotifNetwork analysis programs, Technical Report TR-09-02, RENCI, North Carolina, July 2009.
AbstractThe MotifNetwork system is a high performance system for the fast scanning and interpretation of large numbers of proteins into their constituent domains. Once transformed into a domain dataset, several levels of analysis such as domain-domain and protein-protein co-location graphs are constructed. These basic data products form the beginning of a comprehensive environment for work in evolutionary processes with particular support for comparative analysis. MotifNetwork is based on a distributed architecture that has evolved into a reasonably secure system and is currently supporting researchers in drug target identification, ion-channels biophysics, functional orthologs, and socio-genomic processes.
To better support a broader research community, detailed analysis of the performance of several aspects of MotifNetwork are presented. Further, illustrative examples of using the data products in various matrix analyses are provided. Lastly, in conjunction with a recent submission to the BIOCOMP’09 conference, several remaining details on access, usage, and data archiving are summarized rendering fairly complete the technical details, architecture, and software components of the system as well as expected runtimes and results.
-
TR-09-01
Allan Porterfield, Rob Fowler, Anirban Mandal, Min Yeol Lim. Empirical Evaluation of Multi-Core Memory Concurrency, Technical Report TR-09-01, RENCI, North Carolina, January 2009.
AbstractMulti-socket, multi-core computers are becoming ubiquitous, especially as nodes in compute clusters of all sizes. Common memory benchmarks and memory performance models treat memory as characterized by well-defined maximum bandwidth and average latency parameters. In contrast, current and future systems are based on deep hierarchies and NUMA memory systems, which are not easily described this simply. Memory performance characterization of multi-socket, multi-core systems require measurements and models more sophisticated than than simple peak bandwidth/minimum latency models. To investigate this issue, we performed a detailed experimental study of the memory performance of a variety of AMD multi-socket quad-core systems. We used the pChase benchmark to generate memory system loads with a variable number of concurrent memory operations in the system across a variable number of threads pinned to specific chips in the system. While processor differences had minor but measurable impact on bandwidth, the make-up and structure of the memory has major impact on achievable bandwidth. Our experiments exposed 3 different bottlenecks at different levels of the hardware architecture: limits on the number of references outstanding per thread; limits to the memory requests serviced by a single memory channel; and limits on the total global memory references outstanding were observed. We discuss the impact of these limits on constraints in tuning code for these systems, the impact on compilers and operating systems, and on future system implementation decisions.
BibTeX@TECHREPORT{LFRT2009:tr0901,
AUTHOR = {Allan Porterfield, Rob Fowler, Anirban Mandal, Min Yeol Lim},
TITLE = {Empirical Evaluation of Multi-Core Memory Concurrency},
INSTITUTION = {RENCI},
YEAR = {2009},
NUMBER = {TR-09-01},
ADDRESS = {North Carolina},
MONTH = {January},
ABSTRACT = {Multi-socket, multi-core computers are becoming ubiquitous, especially as nodes in compute clusters of all sizes. Common memory benchmarks and memory performance models treat memory as characterized by well-defined maximum bandwidth and average latency parameters. In contrast, current and future systems are based on deep hierarchies and NUMA memory systems, which are not easily described this simply. Memory performance characterization of multi-socket, multi-core systems require measurements and models more sophisticated than than simple peak bandwidth/minimum latency models. To investigate this issue, we performed a detailed experimental study of the memory performance of a variety of AMD multi-socket quad-core systems. We used the pChase benchmark to generate memory system loads with a variable number of concurrent memory operations in the system across a variable number of threads pinned to specific chips in the system. While processor differences had minor but measurable impact on bandwidth, the make-up and structure of the memory has major impact on achievable bandwidth. Our experiments exposed 3 different bottlenecks at different levels of the hardware architecture: limits on the number of references outstanding per thread; limits to the memory requests serviced by a single memory channel; and limits on the total global memory references outstanding were observed. We discuss the impact of these limits on constraints in tuning code for these systems, theimpact on compilers and operating systems, and on future system implementation decisions.},
URL = {http://www.renci.org/publications/techreports/TR-09-01.pdf}
} -
TR-08-07
Allan Porterfield, Robert J. Fowler, Anirban Mandal and Min Yeol Lim. Performance Consistency on Multi-socket AMD Opteron Systems. Technical Report TR-08-07, RENCI, North Carolina, December 2008.
AbstractCompute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bigger for memory-bound applications but still noticeable for all but the most compute-intensive applications. Memory and thread placement across sockets has significant impact on performance of these systems. We test overall performance and performance consistency for a number of OpenMP and pthread benchmarks including Stream, pChase , the NAS Parallel Benchmarks and SPEC OMP. The tests are run on a variety of multi-socket quad-core AMD Opteron systems. We examine the benefits of explicitly pinning each thread to a different core before any data initialization, thus improving and reducing the variability of performance due to data-to-thread co-location. Execution time variability falls to less than 2% and for one memory-bound application peak performance increases over 40%. For applications running on hundreds or thousands of nodes, reducing variability will improve load balance and total application performance. Careful memory and thread placement is critical for the successful performance tuning of nodes on a modern scientific compute cluster.
BibTeX@TECHREPORT{LFRT2008:tr0807,
AUTHOR = {Allan Porterfield, Robert J. Fowler, Anirban Mandal and Min Yeol Lim},
TITLE = {Performance Consistency on Multi-socket AMD Opteron Systems.},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-07},
ADDRESS = {North Carolina},
MONTH = {December},
ABSTRACT = {Compute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bigger for memory-bound applications but still noticeable for all but the most compute-intensive applications. Memory and thread placement across sockets has significant impact on performance of these systems. We test overall performance and performance consistency for a number of OpenMP and pthread benchmarks including Stream, pChase , the NAS Parallel Benchmarks and SPEC OMP. The tests are run on a variety of multi-socket quad-core AMD Opteron systems. We examine the benefits of explicitly pinning each thread to a different core before any data initialization, thus improving and reducing the variability of performance due to data-to-thread co-location. Execution time variability falls to less than 2% and for one memory-bound application peak performance increases over 40%. For applications running on hundreds or thousands of nodes, reducing variability will improve load balance and total application performance. Careful memory and thread placement is critical for the successful performance tuning of nodes on a modern scientific compute cluster.},
URL = {http://www.renci.org/publications/techreports/TR-08-07.pdf}
} -
TR-08-06
Peter J. Vickery and Brian O. Blanton. North Carolina Coastal Flood Analysis System Hurricane Parameter Development. Technical Report TR-08-06, RENCI, North Carolina, September 2008.
AbstractThe simulation system for the North Carolina floodplain-mapping project uses a suite of state-ofthe- art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. This technical report describes the development of the tropical storm statistical representation. This constitutes Section 5 of Submittal Number One, which the State of North Carolina, Division of Emergency Management has tendered for review to the Federal Emergency Management Agency.
BibTeX@TECHREPORT{LFRT2008:tr0806,
AUTHOR = {},
TITLE = {},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-06},
ADDRESS = {North Carolina},
MONTH = {September},
ABSTRACT = {The simulation system for the North Carolina floodplain-mapping project uses a suite of state-ofthe- art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. This technical report describes the development of the tropical storm statistical representation. This constitutes Section 5 of Submittal Number One, which the State of North Carolina, Division of Emergency Management has tendered for review to the Federal Emergency Management Agency.},
URL = {http://www.renci.org/publications/techreports/TR-08-06.pdf}
} -
TR-08-05
Brian O. Blanton and Richard A. Luettich. North Carolina Coastal Flood Analysis System Model Grid Generation. Technical Report TR-08-05, RENCI, North Carolina, September 2008.
AbstractThe simulation system for the North Carolina floodplain-mapping project uses a suite of state-of-the-art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. Bathymetric and topographic representations in the ADCIRC and SWAN models are derived from the Digital Elevation Model developed as part of the project. This report describes the computational model grids, and in particular details the methods used to generate the unstructured finite-element ADCIRC grid. This report is Section 4 of Submittal Number One, which the State of North Carolina, Division of Emergency Management has tendered for review to the Federal Emergency Management Agency.
BibTeX@TECHREPORT{LFRT2008:tr0805,
AUTHOR = {Brian O. Blanton and Richard A. Luettich},
TITLE = {North Carolina Coastal Flood Analysis System Model Grid Generation},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-05},
ADDRESS = {North Carolina},
MONTH = {September},
ABSTRACT = {The simulation system for the North Carolina floodplain-mapping project uses a suite of state-of-the-art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. Bathymetric and topographic representations in the ADCIRC and SWAN models are derived from the Digital Elevation Model developed as part of the project. This report describes the computational model grids, and in particular details the methods used to generate the unstructured finite-element ADCIRC grid. This report is Section 4 of Submittal Number One, which the State of North Carolina, Division of Emergency Management has tendered for review to the Federal Emergency Management Agency.},
URL = {http://www.renci.org/publications/techreports/TR-08-05.pdf}
} -
TR-08-04
Brian O. Blanton. North Carolina Coastal Flood Analysis System Computational System. Technical Report TR-08-04, RENCI, North Carolina, September 2008.
AbstractThe simulation system for the North Carolina floodplain-mapping project uses a suite of state-ofthe- art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. The model suite consists of the Hurricane Boundary Layer (HBL) wind model for tropical storms (hurricanes) and OceanWeather Inc’s Planetary Boundary Layer (PBL) model for extra-tropical storms; the wave-field models WaveWatch3 (WW3) and Simulating Waves Nearshore (SWAN), and the storm surge and tidal model ADvanced CIRCulation for Model for Oceanic, Coastal and Estuarine Waters (ADCIRC). This modeling approach is very similar to recent FEMA-sponsored projects in Louisiana and Mississippi. Each model in the system is linked through scripts that manage the simulation process on a highperformance computer at RENCI.
BibTeX@TECHREPORT{LFRT2008:tr0804,
AUTHOR = {Brian O. Blanton},
TITLE = {North Carolina Coastal Flood Analysis System Computational System},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-04},
ADDRESS = {North Carolina},
MONTH = {September},
ABSTRACT = {The simulation system for the North Carolina floodplain-mapping project uses a suite of state-ofthe- art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. The model suite consists of the Hurricane Boundary Layer (HBL) wind model for tropical storms (hurricanes) and OceanWeather Inc’s Planetary Boundary Layer (PBL) model for extra-tropical storms; the wave-field models WaveWatch3 (WW3) and Simulating Waves Nearshore (SWAN), and the storm surge and tidal model ADvanced CIRCulation for Model for Oceanic, Coastal and Estuarine Waters (ADCIRC). This modeling approach is very similar to recent FEMA-sponsored projects in Louisiana and Mississippi. Each model in the system is linked through scripts that manage the simulation process on a highperformance computer at RENCI.},
URL = {http://www.renci.org/publications/techreports/TR-08-04.pdf}
} -
TR-08-03
Jeffrey L. Tilson, Mark S.C. Reed, and Robert J. Fowler. Workflows for Performance Evaluation and Tuning. Technical Report TR-08-03, RENCI, North Carolina, May 2008.
AbstractWe report our experiences with using highthroughput techniques to run large sets of performance experiments on collections of grid accessible parallel computer systems for the purpose of deploying optimally compiled and configured scientific applications. In these environments, the set of variable parameters (compiler, link, and runtime flags; application and library options; partition size) can be very large, so running the performance ensembles is labor intensive, tedious, and prone to errors. Automating this process improves productivity, reduces barriers to deploying and maintaining multi-platform codes, and facilitates the tracking of application and system performance over time. We describe the design and implementation of our system for running performance ensembles and we use two case studies as the basis for evaluating the long term potential for this approach. The architecture of a prototype benchmarking system is presented along with results on the efficacy of the workflow approach.
BibTeX@TECHREPORT{LFRT2008:tr0803,
AUTHOR = {Jeffrey L. Tilson, Mark S.C. Reed, Robert J. Fowler},
TITLE = {Workflows for Performance Evaluation and Tuning },
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-03},
ADDRESS = {North Carolina},
MONTH = {May},
ABSTRACT = {We report our experiences with using highthroughput techniques to run large sets of performance experiments on collections of grid accessible parallel computer systems for the purpose of deploying optimally compiled and configured scientific applications. In these environments, the set of variable parameters (compiler, link, and runtime flags; application and library options; partition size) can be very large, so running the performance ensembles is labor intensive, tedious, and prone to errors. Automating this process improves productivity, reduces barriers to deploying and maintaining multi-platform codes, and facilitates the tracking of application and system performance over time. We describe the design and implementation of our system for running performance ensembles and we use two case studies as the basis for evaluating the long term potential for this approach. The architecture of a prototype benchmarking system is presented along with results on the efficacy of the workflow approach.},
URL = {http://www.renci.org/publications/techreports/TR0803.pdf}
}



















