Characterizing
I/O: A Primer in the Research and Challenges Surrounding High-Performance I/O
Tuning for Performance: Analysis and Optimization
Optimizing
I/O-intensive applications for high-performance is fundamentally a process of minimizing
the amount of time that processors spend waiting for work. It can involve staging data
from vast datasets for quick access. It can also involve implementing prefetching and distributed caching, where I/O
caches at multiple sites, for instance, service a group of tasks, each reading a discreet
portion of a file. In such an instance, prefetching the data associated with each task and
caching it near the processor completing that task, enables better
performance. The key
is to engage mechanisms that exploit application access patterns to mask hardware latencies, consolidating I/O requests and
distributing the workload optimally.
To tune the performance of demanding applications in high-performance computing, developers and analysts need a detailed characterization of the I/O behavior of both the parallel application code and the file system supporting it. Understanding the application's I/O requirements better enables them to specify acceptable system configurations, minimum I/O subsystem performance levels, optimal data storage and transmission volumes, and tactical approaches to accessing data on secondary and tertiary storage devices. Understanding the behavior and performance limitations of current file system software provides analysts with a reference point for weighing the suitability of proposed software enhancements and architectural variations.
But applications interact with file system software in increasingly complex ways. They use the file system to move their data between the CPU and disk. This is facilitated by the file system's cache modules, buffers that afford temporary storage quarters for an application's data as it written to and read from disk. These modules can also house data retrieved by the file system in anticipation of upcoming application requests for I/O. Using the cache modules for this purpose, the file system mediates what, where and when disk I/O is undertaken and influences the access and arrival patterns of these requests. This renders I/O patterns at the disk level very different from those evident within an application.
For optimal performance, the data storage and access policies most conducive to the application's knowledge discovery objectives must be reflected in the file system's I/O configuration. Based on our I/O characterization and analysis experiences, we believe I/O systems must learn data access patterns automatically during system execution. Moreover, because current I/O patterns necessarily reflect only those behaviors feasible on current systems, analysts must be wary about extrapolating from analyses of existing application codes when planning the design of next generation systems. They must be able to parcel out sources of the various I/O behaviors exhibited. Is a given behavior the result of an inefficient resource management algorithm or does it reflect a system limitation or maybe a work-around to compensate for a system limitation? Or could the hardware configuration be better balanced? Understanding and exploiting I/O behavior to enhance performance is only possible if analysts can pinpoint the catalysts of interactions among system components that support I/O and monitor the ways these components respond to application I/O request patterns.
The I/O Analysis component of the Pablo Performance Analysis Environment contains programs that produce reports summarizing an application's I/O activity. After program execution, the data can be analyzed with a toolkit of data transformation modules and a graphical programming model that allows users to interactively connect and configure a data analysis graph. Using descriptors embedded in SDDF files, analysis tools interpret the event data records. All Pablo data analysis tools, including the virtual reality environment Virtue, are able to accept a variety of record types represented in the SDDF format -- because both the format and the tools are general and extensible.
The Pablo Research Group's toolkit for I/O analysis includes software to generate:
|
Statistical Report |
|||||||
| Operation | Count | % Count | I/O time | %I/O time | % EXtim | Bytes | %Bytes |
| Open |
516 |
1.08 |
129.00 | 0.45 | 0.02 | 0.00 | |
| Read |
512 |
1.07 | 3.57 | 0.01 | 0.00 | 115712 | 0.00 |
| Seek |
1284 |
2.68 | 1.88 | 0.01 | 0.00 | 0.00 | |
| Write |
44989 |
93.98 | 28696.38 | 99.50 | 5.05 | 17930708366 | 100.00 |
| Flush |
56 |
0.12 | 8.92 | 0.03 | 0.00 | 0.00 | |
| Close |
512 |
1.07 | 0.17 | 0.00 | 0.00 | 0.00 | |
| All I/O |
47869 |
100.00 | 28839.92 | 100.00 | 5.07 | 17930824078 | 100.00 |
|
|
|||||||
|
Distribution of Read and Write Requests by Size (K=1024bytes) |
||||
|
Operation |
Sz < 4K |
4K<= SZ <=64K |
64K <= Sz < 256K |
256K <= Sz |
| Read | 512 | 0 | 0 | 0 |
| Write | 1213 | 0 | 0 | 43776 |
|
Static Graph |
Dynamic Graph |
![]() |
|
|
This scatterplot graphs the duration of each color-coded I/O activity as a function of execution time. |
This screen shot of a visualization of the Continuum code illustrates round-robin I/O as the code runs on 16 processors of the SGI Origin2000 at the National Center for Supercomputing Applications. |
Return to Primer Table of Contents