Below is a list of technical reports produced by RENCI researchers as part of their ongoing work on projects with collaborators across the nation. Reports are listed by year and number.
-
TR-13-02 Using Semantic Web Description Techniques for Managing Resources in a Multi-Domain Infrastructure-as-a-Service Environment
Yufeng Xin (RENCI), Ilia Baldine (RENCI), Jeff Chase (Duke University) and Kemafor Anyanwu (NCSU), Technical Report TR-13-02, Using Semantic Web Description Techniques for Managing Resources in a Multi-Domain Infrastructure-as-a-Service Environment, Renaissance Computing Institute, 2013.
AbstractThis paper reports on experience with using Seman- tic Web technologies for managing multi-domain net- working infrastructure-s-a-service (IaaS) testbed. An OWL ontology based on newly-created vocabular- ies was used to model multi-layer network providers with common base classes for fundamental cyber- resources, and adaptation functions from resources at one layer onto resources of the same base class at the layer below. Extended SPARQL path queries supported by GLEEN were used to support topol- ogy embedding and resource provisioning for creat- ing connected arrangements of compute, storage and network resources gathered from multiple resource providers.
The context for the work is the use of the seman- tic models in ORCA – the control software for Ex- oGENI, a new testbed funded through NSFs GENI project. ExoGENI is a multi-domain cloud testbed with a high degree of control over networking func- tions, including links within each domain and dy- namic inter-domain links over national circuit fab- rics. The paper describes how the semantic network models enable ExoGENI to instantiate on-demand virtual topologies of virtual machines linked by on- demand circuits and segments for a variety of applica- tions ranging from networking experiments to high- performance computing.
-
TR-13-01 Enabling Persistent Queries for Cross-aggregate Performance Monitoring
Anirban Mandal, Ilia Baldine, Yufeng Xin, Paul Ruth, Chris Heerman, Technical Report TR-13-01, Enabling Persistent Queries for Cross-aggregate Performance Monitoring, Renaissance Computing Institute, 2013.
AbstractIt is essential for distributed data-intensive applications to monitor the performance of the underlying network, storage and computational resources. Increasingly, dis- tributed applications need performance information from multiple aggregates, and tools need to take real-time steering decisions based on the performance feedback. With increasing scale and complexity, the volume and velocity of monitoring data is increasing, posing scal- ability challenges. In this work, we have developed a Persistent Query Agent (PQA) that provides real- time application and network performance feedback to clients/applications, thereby enabling dynamic adapta- tions. PQA enables federated performance monitor- ing by interacting with multiple aggregates and perfor- mance monitoring sources. Using a publish-subscribe framework, it sends triggers asynchronously to appli- cations/clients when relevant performance events occur. The applications/clients register their events of interest using declarative queries and get notified by the PQA. PQA leverages a complex event processing (CEP) frame- work for managing and executing the queries expressed in a standard SQL-like query language. Instead of sav- ing all monitoring data for future analysis, PQA observes performance event streams in real-time, and runs contin- uous queries over streams of monitoring events. In this work, we present the design and architecture of the per- sistent query agent, and describe some relevant use cases.
-
TR-12-02 Adaptive Scheduling Using Performance Introspection
Allan Porterfield, Rob Fowler, Anirban Mandal, David O’Brien, Stephen L. Olivier, Michael Spiegel, Technical Report TR-12-02, Adaptive Scheduling Using Performance Introspection, Renaissance Computing Institute, 2012.
AbstractAs energy becomes a driving force in High Per- formance Computing, determining when and how en- ergy can be saved without impacting performance is a key goal for both HPC hardware and software. Scalability studies have shown that some memory- bound applications do not scale as the thread count increases, and in some cases performance degrades. Adaptive Scheduling recognizes when an application is in a memory-bound region and throttles the number of active hardware threads. Our RCRdaemon tool acquires hardware performance counter measurements in near-real time. A simple hardware model added to the Qthreads runtime system reads the collected data to determine when memory contention exists. Using that information, our extension to the Qthreads scheduler reduces contention by throttling hardware threads. Adaptive Scheduling has very low performance impact both for memory-bound benchmarks (below 4.2%) and for compute-bound benchmarks (2.4% – 3.7%).
For these techniques to reduce energy costs, ad- ditional hardware energy features will be required. Applications using Adaptive Scheduling can transition from memory-bound to compute-bound regions hun- dreds of times a second. Hardware mechanisms or instructions to allow energy savings during the short memory-bound regions could be used effectively by multithreaded software to reduce the overall power requirements for memory-bound applications.
-
TR-12-01 Designing Smart Camera Networks using Smartphone Platforms: A Case Study
Yusuf Simonson, Robert Fowler, Edgar Lobaton, Ron Alterovitz. Technical Report TR-12-01, Designing Smart Camera Networks using
Smartphone Platforms: A Case Study, Renaissance Computing Institute, The Department of Computer Science of The University of North Carolina at Chapel Hill, January, 2012.AbstractWe believe that new smartphone architectures like Android and iOS will play increasingly important roles in sensor network design. Because of this, we wish to investigate the state of development for sensor network-based applications on the Android framework. We introduce an application for Android that allows for the semi-autonomous remote control of Rovio robots. It coordinates with a sensor network of cameras to provide a live stream of camera images and predicted robot locations. Furthermore, it provides functionality for moving the robot to user-selected destinations without the need for manual control. The application acts as a straw man for working with traditional sensor network architectures, and provides important insight on some of the challenges related to sensor network development, especially on the Android platform. We outline these challenges and provide some suggestions on semantics that could alleviate development effort.
-
TR-11-04 Communicating Coastal Risk Analysis in an Age of Climate Change
Brian Blanton, John McGee, Oleg Kapeljushnik, Technical Report TR-11-04, Communicating Coastal Risk Analysis in
an Age of Climate Change, Renaissance Computing Institute, 2011.AbstractComplex science and large volumes of disparate data required for risk analysis of coastal hazards can be very difficult to communicate effectively to government and business decision makers. Including potential future scenarios as a result of climate change complicates matters further. An immersive visualization environment integrating data from high resolution imagery, sensed and measured data, model output, and more, that can scale from the desktop to large dome theater venues demonstrate promise for greatly enhancing the impact and reach of these scientific endeavors.
-
TR-11-03 Geoanalytics
Jeff R. Heard. Technical Report TR-11-03, Geoanalytics, July 2011.
AbstractThe pressures of producing science with global relevance and global impact have made understanding and using geographic essential to a large portion of scientific research. Geographic information systems live at the heart of projects in public health, environmental science, policy and government, situational awareness, and others. Geographic data need has also become a Big Data need. Datasets essential to these projects are often terabytes in size, or they are rapidly evolving streams of complex data.
The tools available to professionals looking to do things with geographic data have not grown to meet the Big Data problem. Traditional GIS allows researchers to build custom databases with analytics. Google Maps allows them to publish data to the web. Various open source tools exist for specialized and general GIS purposes. As of yet, the world is without a compelling infrastructure for integrating these. As a result, geographic solutions to scientific and social problems are often cobbled together, and the results are isolated silos that cannot be easily integrated or adapted to new and different data.
Traditional GIS solutions like ArcGIS and GRASS allow a user to import a number of maps and work with them as a project, doing complex analysis, but the results of this are offline, or at very least relatively static. There exist “onlining” modules for these, but they are built on a pre- web paradigm whose thinking pervades the online experience. Modern users expect integration, mashups, and web-based application platforms that include data “at the bleeding edge of now.”
What we will call “second generation” solutions, like Google Maps and Google App Engine allow a user to quickly create a map without prior training, requiring only a text editor, a web browser, and some patience. These solutions, in their simplicity, abstract away functionality that is needed for serious scientific analysis. These must be completed with other tools, imported into data formats tailored towards visual presentation, and do not preserve the data for analysis. This adds complexity to the scientific process as well as discourages the sharing of source data.
-
TR-11-02 Scheduling OpenMP for Qthreads with MAESTRO
Allan Porterfield, Rob Fowler, Paul Horst, David O’Brien, Stephen Olivier, Kyle Wheeler, Brad Viviano, Technical Report TR-11-02, Scheduling OpenMP for Qthreads with MAESTRO, Renaissance Computing Institute, September, 2011.
AbstractObtaining good performance from modern Multi- and Many-core processors requires understanding the dynamic performance of the resources shared by multiple cores. Perfor- mance of single core systems only requires understanding the way that threads interact with the core on which they are executing. Multi- and Many cores systems have complicated this by adding various shared resources (e.g. L3 cache, I/O, network access, etc.) which are shared by multiple cores. The usage of these resources may be influenced by other threads within an application or by other concurrent programs. Ecient scheduling will require a dynamic scheduler. Fortunately, the increase of cores (and increasing frequency of non-ALU bottlenecks) provides the scheduler an opportunity to acquire the resources necessary to do dynamic monitoring and modeling.
MAESTRO includes a scheduler for the Qthreads runtime to explore any potential benefits from dynamic performance monitoring and modeling on application performance. The idea is to use computational resources that would otherwise be idle (because of mem- ory bottlenecks) to measure and model system performance. MAESTRO implements an experimental scheduler on top of the Qthreads runtime. The scheduler communicates with a dynamic performance model to understand the dynamic state of the system. Scheduling decisions use that knowledge to better determine which threads should be executing and where. Qthreads already has a concept of locality (shepherds), to reason about shared resources. Building on the shepherd concept, MAESTRO supports hierarchical work steal- ing both intra- and inter- shepherd. Improving dynamic cache hit rates while reducing the number of expensive remote steals operations.
MAESTRO also extended Qthreads with the XOMP interface. XOMP is generated by the ROSE source-to-source translator to handle OpenMP (version 3.0) input files. The ROSE/Qthreads extension allow most C and C++ OpenMP applications to use the Qthreads runtime.
-
TR-11-01 Secure Medical Research Workspace
Phillips Owen, Michael Shoffner, Xiaoshu Wang, Charles Schmitt, Brent Lamm, Javed Mostafa. Report on SMRW, Technical Report TR-11-01, Secure Medical Research Workspace, February 2011.
AbstractSMRW, the Secure Medical Research Workspace, is a comprehensive solution to protect Electronic Health Records (EHR). The SMRW utilizes virtualization technologies to facilitate the setup, data provisioning, management, and tear down of protected virtual workspaces. The virtual workspace will incorporate Data Loss Prevention (DLP) technologies and techniques to prevent unauthorized use and transmission of data in order to maintain compliance with Institutional policies and HIPAA data regulations.
-
TR-10-06 North Carolina Floodplain Mapping Program Coastal Flood Insurance Study
Brian Blanton, Rick Luettich, Technical Report TR-10-06, North Carolina Floodplain Mapping Program
Coastal Flood Insurance Study, Renaissance Computing Institute, 2012.AbstractIn this section, water levels and inundation produced by the ADCIRC model are compared to available data from NOAA gauge stations and high water marks for four tropical cyclones and two extratropical stormes. Results are presented for the ADCIRC model response to tidal forcing, atmospheric pressure forcing and wind stress forcing (SWEL) and for the ADCIRC response to these forcings plus the additional forcing cause by wave radiation stress gradients (SWEL+SETUP). This set of six storms provides a reasonably comprehensive evaluation of the ADCIRC model along the North Carolina coast.
-
TR-10-05 Report on Engage Survey
James Howison and Jim Herbsleb. Report on Engage Survey, Technical Report TR-10-05, Carnegie Mellon Computer Science, February 2010.
AbstractThe VOSS SciSoft research project at CMU, conducted a survey of the Engage VOs contacts. The survey was prepared by James Howison, with input from Jim Herbsleb and John McGee, from Engage. The individually identifiable results are confidential to CMU (this was done to ensure participants were comfortable speaking honestly on the survey) and this report thus avoids identifying individual responses. The report presents an overall summary of the respondents answers to the questions, using quotes where they are not personally identifiable.
-
TR-10-04 NARA Transcontinental Persistent Archive Prototype
Reagan W. Moore, Arcot Rajasekar, Antoine de Torcy, Mike Conway, Jewel Ward, Jon Crabtree, Mason Chua, Wayne Schroeder, Michael Wan, Sheau-Yen Chen. NARA Transcontinental Persistent Archive Prototype, Technical Report TR-10-04, RENCI, North Carolina, October 2010.
AbstractThe concepts required for preservation are explored within the NARA TPAP testbed and tested on selected NARA digital holdings. The TPAP testbed implements a policy-based preservation environment based on the iRODS integrated Rule-Oriented Data System. All aspects of the preservation environment are explored, including development of preservation policies, migration to new technology, automation of administrative functions, and validation of assessment criteria.
-
TR-10-03 Effects of Multi-core Memory Concurrency Limits on Multi-threaded Applications
Anirban Mandal, Min Yeol Lim, Allan Portereld, Rob Fowler. Effects of Multi-core Memory Concurrency Limits on Multi-threaded Applications, Technical Report TR-10-03, RENCI, North Carolina, September 2010.
AbstractMemory access is becoming an increasingly significant impediment to extracting performance out of multi-core systems. More than ever, the effectiveness of memory system use by an application is becoming a critical determinant of performance. In previous work, we demonstrated how explicit consideration of memory concurrency provides a better model for memory performance on multi-socket, multi-core systems than just using best case latency and bandwidth. This paper investigates some of the implications of this on application structure and compiler optimization. We developed a methodology to use hardware performance counters in a performance reflection tool, RCRTool, to measure achieved memory concurrency. We applied this to several important memory-bound scientificc applications and kernels compiled with varying levels of optimization. We convolve the observed application concurrency with available system memory concurrency to derive insights for compilers and application tuners. The models provide compilers and runtimes with information about how load on the memory sub-system changes the effectiveness of various optimizations. As the number of hardware cores/threads increases, and as o-chip memory bandwidth per core remains constant or decreases, these measurements and analysis can provide insights to compiler and application writers. For example, on highly-threaded systems the system can be saturated if each software thread offers only 2 or 3 concurrent memory references. The implication is that optimizations to improve cache usage are more important than ever, while program transformations designed to increase memory concurrency may lose their utility.
-
TR-10-02 Using high performance computing and domain-based functional annotation of proteins to enhance discovery of novel proteins, identify functional homology, and characterize phylogenetic relatedness
Jeffrey L Tilson, Gloria Rendon, Eric Jakobsson. Using high performance computing and domain-based functional annotation of proteins to enhance discovery of novel proteins, identify functional homology, and characterize phylogenetic relatedness, Technical Report TR-10-02, RENCI, North Carolina, June 2010.
AbstractBackground
Next generation sequencing technology is putting significant pressure on computational researchers to implement software tools for analysis (identification, annotation, homology/orthology assignment, phylogeny, etc.) of the genes and gene products “on-the-fly” in parallel with the sequencing machines. This requires both leveraging supercomputing systems and alternative kinds of analyses. We seek to contribute to the solution of these problems through the deployment of high speed explicitly functional domain-based solutions through the system called MotifNetwork. We present case select studies of domain-based approaches to gene analysis that range from homology assessment to phylogeny reconstruction to pangenomic analysis as a demonstration of potential benefits of such approaches. For analyses, we used grid-computing to enable the computations necessary to apply these techniques to genome-size systems.Results
We used MotifNetwork to apply functional domain-based methods to three biological test cases that represent broad biological areas of research.
First, we assess functional homology of over 3000 eukaryotic proteins with respect to the ligand-gated ion channel family by calculating domain-based similarity of genes with four different metrics: distinct-partners, inverse document coefficients, cumulative association coefficients, and the Jaccard function.Second, we illustrate a methodology for predicting phylogenetic relatedness based on evolutionary domain analysis. It is applied to over 40 prokaryotic proteins that were identified as likely functional homologs with respect to the same family of ion channels.
Lastly, comparative genomics studies are conducted between. H. sapiens and 23 different strains of E. coli. The domain-based pangenome of E. coli is analyzed and compared against that of H. sapiens in a context of drug target identification and potential side effects.
Benchmarks of MotifNetwork indicate that execution times achieve reasonable performance scaling when using up to 256 processors available to this work and that our use of a data-grid for storage of the results, as implemented with iRODS, is well-suited for large-scale biological pipelines.Conclusions
The combination of domain-based analyses and fast processing enabled by MotifNetwork should permit researchers to more accurately and efficiently perform research on a wide range of biological problems and thus alleviate the bottlenecks that now exist between sequencing of genes and their subsequent characterization. Our approach is especially suitable for biological problems that can be formulated as the identification of functional correspondences among a large set of proteins such as the three illustrative examples that are discussed in the paper which range from E. coli pangenomics, to functional homology and phylogenetic relatedness of the LIC family of ion channels. -
TR-10-01 RCRTool: Design Document; Version 0.1
Allan Porterfield, Rob Fowler, Min Yeol Lim. RCRTool: Design Document; Version 0.1, Technical Report TR-10-01, RENCI, North Carolina, February 2010.
AbstractRCRTool, Resource Centric Reflection Tool, will allow application programmers to better understand resource contention between multiple threads of a single application or between simultaneously active applications sharing varying levels of hardware. The improved knowledge of how the entire system is performing will be available to applications and runtimes for dynamic performance tuning. This document provides some of the motivation and the initial design of the entire system including access of hardware and OS performance counters, system modeling with that data, API that allow access to the data by runtimes and applications, and a data logging facility for post-run analysis.
The design attempts to allow the same tool to be used with a future single shared address node (with tens of cores) and with a distributed memory system with tens of thousands of nodes and hundreds of thousands of cores. The dierence between these systems, should be contained by dierence in what parts of the system are watched for potential bottlenecks and the granularity of available dynamic feedback.
At the center of RCRTool will be the RCRdaemon. It will have several jobs, including watching the hardware and OS for performance bottlenecks using performance models. RCRTool will supply some models, but mechanisms for the user to add their own will exist. RCRdaemon will also be responsible for transmitting the current state of the system to applications and the OS for dynamic tuning. A third function of the daemon will be logging the information for post-execution analysis.
-
TR-09-04 PowerMon 2: Fine-grained, Integrated Power Measurement
Daniel Bedard, Allan Porterfield, Rob Fowler, Min Yeol Lim. PowerMon 2: Fine-grained, Integrated Power Measurement, Technical Report TR-09-04, RENCI, North Carolina, October 2009
AbstractWe describe version 2 of RENCI PowerMon, a device that can be inserted between a computer power supply and the computer’s main board to measure power usage at each of the DC power rails supplying the board. PowerMon 2 provides a capability to collect accurate, frequent, and time-correlated measurements. Since the measurements occur after the AC power supply, this approach eliminates power supply efficiency and time-domain filtering perturbations of the power measurements. PowerMon 2 provides detail about the power consumption of the hardware subsystems connected to each of its eight measurement channles. The device fits in an internal 3.5” hard disk drive bay, thus allowing it to be used in a 1U server chassis. It cost less than $150 per unit to fabricate our small quantity of prototypes.
-
TR-09-03 Calculating All Pairwise Similarities from the RCSB Protein Data Bank: Client/Server Work Distribution on the Open Science Grid
Chris Bizon, Andreas Prlic. Calculating All Pairwise Similarities from the RCSB Protein Data Bank: Client/Server Work Distribution on the Open Science Grid, Technical Report TR-09-03, RENCI, North Carolina, December 2009.
AbstractProteins can have various degrees of similarity. If two proteins show high similarity in their amino acid sequence, it is generally assumed that they are closely evolutionary related. With increasing evolutionary distance the degree of similarity usually drops, but proteins can still show similar activity in the cell and have an overall similar 3D structure, even if the sequence similarity is low. The detection of such remote similarities is important in order to infer functional and evolutionary relationships between protein families and is a core technique used in protein structure bioinformatics. The goal is to establish regions of equivalence between two or more molecules.
The RCSB Protein Data Bank (PDB) is a leading primary database that provides access to experimentally determined protein structures, nucleic acids, and complex assemblies. PDB is a vital part of the infrastructure supporting biomedical science worldwide and is used by around 200,000 unique scientists per month.
While protein sequence comparisons can be computed quickly, the calculation of protein structure alignments is much more time consuming. The RCSB PDB has recently started to add new tools to the site, that allow users to quickly identify protein sequence neighbors and run pairwise protein structure comparisons. In order to allow users to also quickly identify more distant 3D relationships the goal of this project is to provide a pre-calculated set of all vs. all 3D protein structure alignments.
-
TR-09-02 Algorithms and Performance measurements for MotifNetwork analysis programs
Jeffrey L. Tilson, Gloria Rendon, and Eric Jakobsson. Algorithms and Performance measurements for MotifNetwork analysis programs, Technical Report TR-09-02, RENCI, North Carolina, July 2009.
AbstractThe MotifNetwork system is a high performance system for the fast scanning and interpretation of large numbers of proteins into their constituent domains. Once transformed into a domain dataset, several levels of analysis such as domain-domain and protein-protein co-location graphs are constructed. These basic data products form the beginning of a comprehensive environment for work in evolutionary processes with particular support for comparative analysis. MotifNetwork is based on a distributed architecture that has evolved into a reasonably secure system and is currently supporting researchers in drug target identification, ion-channels biophysics, functional orthologs, and socio-genomic processes.
To better support a broader research community, detailed analysis of the performance of several aspects of MotifNetwork are presented. Further, illustrative examples of using the data products in various matrix analyses are provided. Lastly, in conjunction with a recent submission to the BIOCOMP’09 conference, several remaining details on access, usage, and data archiving are summarized rendering fairly complete the technical details, architecture, and software components of the system as well as expected runtimes and results.
-
TR-09-01 Empirical Evaluation of Multi-Core Memory Concurrency
Allan Porterfield, Rob Fowler, Anirban Mandal, Min Yeol Lim. Empirical Evaluation of Multi-Core Memory Concurrency, Technical Report TR-09-01, RENCI, North Carolina, January 2009.
AbstractMulti-socket, multi-core computers are becoming ubiquitous, especially as nodes in compute clusters of all sizes. Common memory benchmarks and memory performance models treat memory as characterized by well-defined maximum bandwidth and average latency parameters. In contrast, current and future systems are based on deep hierarchies and NUMA memory systems, which are not easily described this simply. Memory performance characterization of multi-socket, multi-core systems require measurements and models more sophisticated than than simple peak bandwidth/minimum latency models. To investigate this issue, we performed a detailed experimental study of the memory performance of a variety of AMD multi-socket quad-core systems. We used the pChase benchmark to generate memory system loads with a variable number of concurrent memory operations in the system across a variable number of threads pinned to specific chips in the system. While processor differences had minor but measurable impact on bandwidth, the make-up and structure of the memory has major impact on achievable bandwidth. Our experiments exposed 3 different bottlenecks at different levels of the hardware architecture: limits on the number of references outstanding per thread; limits to the memory requests serviced by a single memory channel; and limits on the total global memory references outstanding were observed. We discuss the impact of these limits on constraints in tuning code for these systems, the impact on compilers and operating systems, and on future system implementation decisions.
BibTeX@TECHREPORT{LFRT2009:tr0901,
AUTHOR = {Allan Porterfield, Rob Fowler, Anirban Mandal, Min Yeol Lim},
TITLE = {Empirical Evaluation of Multi-Core Memory Concurrency},
INSTITUTION = {RENCI},
YEAR = {2009},
NUMBER = {TR-09-01},
ADDRESS = {North Carolina},
MONTH = {January},
ABSTRACT = {Multi-socket, multi-core computers are becoming ubiquitous, especially as nodes in compute clusters of all sizes. Common memory benchmarks and memory performance models treat memory as characterized by well-defined maximum bandwidth and average latency parameters. In contrast, current and future systems are based on deep hierarchies and NUMA memory systems, which are not easily described this simply. Memory performance characterization of multi-socket, multi-core systems require measurements and models more sophisticated than than simple peak bandwidth/minimum latency models. To investigate this issue, we performed a detailed experimental study of the memory performance of a variety of AMD multi-socket quad-core systems. We used the pChase benchmark to generate memory system loads with a variable number of concurrent memory operations in the system across a variable number of threads pinned to specific chips in the system. While processor differences had minor but measurable impact on bandwidth, the make-up and structure of the memory has major impact on achievable bandwidth. Our experiments exposed 3 different bottlenecks at different levels of the hardware architecture: limits on the number of references outstanding per thread; limits to the memory requests serviced by a single memory channel; and limits on the total global memory references outstanding were observed. We discuss the impact of these limits on constraints in tuning code for these systems, theimpact on compilers and operating systems, and on future system implementation decisions.},
URL = {http://www.renci.org/publications/techreports/TR-09-01.pdf}
} -
TR-08-07 Performance Consistency on Multi-socket AMD Opteron Systems
Allan Porterfield, Robert J. Fowler, Anirban Mandal and Min Yeol Lim. Performance Consistency on Multi-socket AMD Opteron Systems. Technical Report TR-08-07, RENCI, North Carolina, December 2008.
AbstractCompute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bigger for memory-bound applications but still noticeable for all but the most compute-intensive applications. Memory and thread placement across sockets has significant impact on performance of these systems. We test overall performance and performance consistency for a number of OpenMP and pthread benchmarks including Stream, pChase , the NAS Parallel Benchmarks and SPEC OMP. The tests are run on a variety of multi-socket quad-core AMD Opteron systems. We examine the benefits of explicitly pinning each thread to a different core before any data initialization, thus improving and reducing the variability of performance due to data-to-thread co-location. Execution time variability falls to less than 2% and for one memory-bound application peak performance increases over 40%. For applications running on hundreds or thousands of nodes, reducing variability will improve load balance and total application performance. Careful memory and thread placement is critical for the successful performance tuning of nodes on a modern scientific compute cluster.
BibTeX@TECHREPORT{LFRT2008:tr0807,
AUTHOR = {Allan Porterfield, Robert J. Fowler, Anirban Mandal and Min Yeol Lim},
TITLE = {Performance Consistency on Multi-socket AMD Opteron Systems.},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-07},
ADDRESS = {North Carolina},
MONTH = {December},
ABSTRACT = {Compute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bigger for memory-bound applications but still noticeable for all but the most compute-intensive applications. Memory and thread placement across sockets has significant impact on performance of these systems. We test overall performance and performance consistency for a number of OpenMP and pthread benchmarks including Stream, pChase , the NAS Parallel Benchmarks and SPEC OMP. The tests are run on a variety of multi-socket quad-core AMD Opteron systems. We examine the benefits of explicitly pinning each thread to a different core before any data initialization, thus improving and reducing the variability of performance due to data-to-thread co-location. Execution time variability falls to less than 2% and for one memory-bound application peak performance increases over 40%. For applications running on hundreds or thousands of nodes, reducing variability will improve load balance and total application performance. Careful memory and thread placement is critical for the successful performance tuning of nodes on a modern scientific compute cluster.},
URL = {http://www.renci.org/publications/techreports/TR-08-07.pdf}
} -
TR-08-06 North Carolina Coastal Flood Analysis System Hurricane Parameter Development
Peter J. Vickery and Brian O. Blanton. North Carolina Coastal Flood Analysis System Hurricane Parameter Development. Technical Report TR-08-06, RENCI, North Carolina, September 2008.
AbstractThe simulation system for the North Carolina floodplain-mapping project uses a suite of state-ofthe- art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. This technical report describes the development of the tropical storm statistical representation. This constitutes Section 5 of Submittal Number One, which the State of North Carolina, Division of Emergency Management has tendered for review to the Federal Emergency Management Agency.
BibTeX@TECHREPORT{LFRT2008:tr0806,
AUTHOR = {},
TITLE = {},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-06},
ADDRESS = {North Carolina},
MONTH = {September},
ABSTRACT = {The simulation system for the North Carolina floodplain-mapping project uses a suite of state-ofthe- art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. This technical report describes the development of the tropical storm statistical representation. This constitutes Section 5 of Submittal Number One, which the State of North Carolina, Division of Emergency Management has tendered for review to the Federal Emergency Management Agency.},
URL = {http://www.renci.org/publications/techreports/TR-08-06.pdf}
} -
TR-08-05 North Carolina Coastal Flood Analysis System Model Grid Generation
Brian O. Blanton and Richard A. Luettich. North Carolina Coastal Flood Analysis System Model Grid Generation. Technical Report TR-08-05, RENCI, North Carolina, September 2008.
AbstractThe simulation system for the North Carolina floodplain-mapping project uses a suite of state-of-the-art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. Bathymetric and topographic representations in the ADCIRC and SWAN models are derived from the Digital Elevation Model developed as part of the project. This report describes the computational model grids, and in particular details the methods used to generate the unstructured finite-element ADCIRC grid. This report is Section 4 of Submittal Number One, which the State of North Carolina, Division of Emergency Management has tendered for review to the Federal Emergency Management Agency.
BibTeX@TECHREPORT{LFRT2008:tr0805,
AUTHOR = {Brian O. Blanton and Richard A. Luettich},
TITLE = {North Carolina Coastal Flood Analysis System Model Grid Generation},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-05},
ADDRESS = {North Carolina},
MONTH = {September},
ABSTRACT = {The simulation system for the North Carolina floodplain-mapping project uses a suite of state-of-the-art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. Bathymetric and topographic representations in the ADCIRC and SWAN models are derived from the Digital Elevation Model developed as part of the project. This report describes the computational model grids, and in particular details the methods used to generate the unstructured finite-element ADCIRC grid. This report is Section 4 of Submittal Number One, which the State of North Carolina, Division of Emergency Management has tendered for review to the Federal Emergency Management Agency.},
URL = {http://www.renci.org/publications/techreports/TR-08-05.pdf}
} -
TR-08-04 North Carolina Coastal Flood Analysis System Computational System
Brian O. Blanton. North Carolina Coastal Flood Analysis System Computational System. Technical Report TR-08-04, RENCI, North Carolina, September 2008.
AbstractThe simulation system for the North Carolina floodplain-mapping project uses a suite of state-ofthe- art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. The model suite consists of the Hurricane Boundary Layer (HBL) wind model for tropical storms (hurricanes) and OceanWeather Inc’s Planetary Boundary Layer (PBL) model for extra-tropical storms; the wave-field models WaveWatch3 (WW3) and Simulating Waves Nearshore (SWAN), and the storm surge and tidal model ADvanced CIRCulation for Model for Oceanic, Coastal and Estuarine Waters (ADCIRC). This modeling approach is very similar to recent FEMA-sponsored projects in Louisiana and Mississippi. Each model in the system is linked through scripts that manage the simulation process on a highperformance computer at RENCI.
BibTeX@TECHREPORT{LFRT2008:tr0804,
AUTHOR = {Brian O. Blanton},
TITLE = {North Carolina Coastal Flood Analysis System Computational System},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-04},
ADDRESS = {North Carolina},
MONTH = {September},
ABSTRACT = {The simulation system for the North Carolina floodplain-mapping project uses a suite of state-ofthe- art numerical wind, wave, and surge models to compute stillwater and wave setup elevations along the North Carolina coast. The model suite consists of the Hurricane Boundary Layer (HBL) wind model for tropical storms (hurricanes) and OceanWeather Inc’s Planetary Boundary Layer (PBL) model for extra-tropical storms; the wave-field models WaveWatch3 (WW3) and Simulating Waves Nearshore (SWAN), and the storm surge and tidal model ADvanced CIRCulation for Model for Oceanic, Coastal and Estuarine Waters (ADCIRC). This modeling approach is very similar to recent FEMA-sponsored projects in Louisiana and Mississippi. Each model in the system is linked through scripts that manage the simulation process on a highperformance computer at RENCI.},
URL = {http://www.renci.org/publications/techreports/TR-08-04.pdf}
} -
TR-08-03 Workflows for Performance Evaluation and Tuning
Jeffrey L. Tilson, Mark S.C. Reed, and Robert J. Fowler. Workflows for Performance Evaluation and Tuning. Technical Report TR-08-03, RENCI, North Carolina, May 2008.
AbstractWe report our experiences with using highthroughput techniques to run large sets of performance experiments on collections of grid accessible parallel computer systems for the purpose of deploying optimally compiled and configured scientific applications. In these environments, the set of variable parameters (compiler, link, and runtime flags; application and library options; partition size) can be very large, so running the performance ensembles is labor intensive, tedious, and prone to errors. Automating this process improves productivity, reduces barriers to deploying and maintaining multi-platform codes, and facilitates the tracking of application and system performance over time. We describe the design and implementation of our system for running performance ensembles and we use two case studies as the basis for evaluating the long term potential for this approach. The architecture of a prototype benchmarking system is presented along with results on the efficacy of the workflow approach.
BibTeX@TECHREPORT{LFRT2008:tr0803,
AUTHOR = {Jeffrey L. Tilson, Mark S.C. Reed, Robert J. Fowler},
TITLE = {Workflows for Performance Evaluation and Tuning },
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-03},
ADDRESS = {North Carolina},
MONTH = {May},
ABSTRACT = {We report our experiences with using highthroughput techniques to run large sets of performance experiments on collections of grid accessible parallel computer systems for the purpose of deploying optimally compiled and configured scientific applications. In these environments, the set of variable parameters (compiler, link, and runtime flags; application and library options; partition size) can be very large, so running the performance ensembles is labor intensive, tedious, and prone to errors. Automating this process improves productivity, reduces barriers to deploying and maintaining multi-platform codes, and facilitates the tracking of application and system performance over time. We describe the design and implementation of our system for running performance ensembles and we use two case studies as the basis for evaluating the long term potential for this approach. The architecture of a prototype benchmarking system is presented along with results on the efficacy of the workflow approach.},
URL = {http://www.renci.org/publications/techreports/TR0803.pdf}
} -
TR-08-02 Stateful grid resource selection for related asynchronous tasks
Howard M. Lander, Robert J. Fowler, Lavanya Ramakrishnan, and Steven R. Thorpe. Stateful grid resource selection for related asynchronous tasks. Technical Report TR-08-02, RENCI, North Carolina, April 2008.
AbstractIn today’s grid deployments, resource selection is based on the prior knowledge of the performance characteristics of the application on a particular resource and on real-time monitoring status of the resource such as load on the system, network bandwidth, etc. Any lag between a resource selection decision and the time the job appears in the system’s monitoring facility will cause subsequent decisions to be based on incorrect information. If two or more jobs arrive within this hysteresis window, the incorrect assessment of system state can have negative consequences on job response time and system throughput. In this paper we describe a stateful resource selection protocol we designed to mitigate this problem for a real time storm surge modeling project. We present results from real experiments on a regional grid. We use emulation to compare and study the effect of our protocol under varying load conditions. Based on our evaluation we argue that the enhanced protocol should be made available as a globally-aware grid resource selection service.
BibTeX@TECHREPORT{LFRT2008:tr0802,
AUTHOR = {Howard M. Lander and Robert J. Fowler and Lavanya
Ramakrishnan and Stevern R. Thorpe},
TITLE = {Stateful Grid Resource Selection for Related Asynchronous Tasks},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-02},
ADDRESS = {North Carolina},
MONTH = {April},
OPTNOTE = {also submitted for publication},
ABSTRACT = {In today’s grid deployments, resource selection is based on the prior knowledge of the performance characteristics of the application on a particular resource and on real-time monitoring status of the resource such as load on the system, network bandwidth, etc. Any lag between a resource selection decision and the time the job appears in the system’s monitoring facility will cause subsequent decisions to be based on incorrect information. If two or more jobs arrive within this hysteresis window, the incorrect assessment of system state can have negative consequences on job response time and system throughput. In this paper we describe a stateful resource selection protocol we designed to mitigate this problem for a real time storm surge modeling project. We present results from real experiments on a regional grid. We use emulation to compare and study the effect of our protocol under varying load conditions. Based on our evaluation we argue that the enhanced protocol should be made available as a globally-aware grid resource selection service.},
URL = {http://www.renci.org/publications/techreports/TR0802.pdf}
} -
TR-08-01 MAESTRO: Program Thread and Synchronizaton Interface (version 0.1)
Allan Porterfield. MAESTRO: Program Thread and Synchronizaton Interface (version 0.1) Technical Report TR-08-01, RENCI, North Carolina, March 2008.
AbstractThe MAESTRO API is intended to support a simple thread programming model for compilers and other automated tools. It is not expected to be visible to application programmers. MAESTRO must support other programming models, such as MPI and OpenMP, ef?ciently. The goal is to provide light weight threads than can run quickly and ef?ciently on a dynamic number of hardware resources running within a single address space. MAESTRO allows more threads to be created than resources exist and is responsible for mapping the work to existing resources. The goal is to allow easier parallel programming and porting between different machines by removing, knowledge of system size from the application. MAESTRO supplies explicit thread and parallel loop creation and uses an explicit join point for thread synchronization. In addition the MAESTRO API has a variety of synchronization mechanisms, ranging from optimistic point-to-point though global barriers. This draft is expected to be heavily modi?ed as initial implementation shows the numerous shortfalls.
BibTeX@TECHREPORT{LFRT2008:tr0801,
AUTHOR = {Allan Porterfield},
TITLE = {MAESTRO: Program Thread and Synchronization Interface, version 0.1},
INSTITUTION = {RENCI},
YEAR = {2008},
NUMBER = {TR-08-01},
ADDRESS = {North Carolina},
MONTH = {March},
ABSTRACT = {The MAESTRO API is intended to support a simple thread programming model for compilers and other automated tools. It is not expected to be visible to application programmers. MAESTRO must support other programming models, such as MPI and OpenMP, ef?ciently. The goal is to provide light weight threads than can run quickly and ef?ciently on a dynamic number of hardware resources running within a single address space. MAESTRO allows more threads to be created than resources exist and is responsible for mapping the work to existing resources. The goal is to allow easier parallel programming and porting between different machines by removing, knowledge of system size from the application. MAESTRO supplies explicit thread and parallel loop creation and uses an explicit join point for thread synchronization. In addition the MAESTRO API has a variety of synchronization mechanisms, ranging from optimistic point-to-point though global barriers. This draft is expected to be heavily modi?ed as initial implementation shows the numerous shortfalls.},
URL = {http://www.renci.org/publications/techreports/TR0801.pdf}
}



















