This glossary contains terms useful to researchers and students interested in learning more about characterizing and optimizing the I/O operations of applications demanding high performance. Other useful reference sites include
To submit either terms for definition or missing terms and accompanying definitions, please send mail to info@pablo-cadre.cs.uiuc.edu.
| AFS | The Andrew File System (AFS) was originally developed as part of the Andrew project at Carnegie-Mellon University. AFS is a distributed file system that provides shared access to directories and files across local and wide area networks. It is intended for file sharing across systems, rather than for parallel access. In contrast to NFS, AFS stages files to the local system for faster access. |
| Asynchronous I/O | I/O that takes place concurrent with the computation using it. |
| Bandwidth | Measured in bytes/second, bandwidth is the rate that data that can be transferred from one location or device to another. In an I/O system, bandwidth can be measured at many levels, ranging from the rate data can be retrieved from the physical media, through transfer rates across interfaces like SCSI, to software-mediated transfers. Contrast with latency, the fixed overhead before the first byte of a transfer arrives. |
| Checkpoint | Many large, parallel computations require days or weeks to complete. Because it is not practical to execute without interruption for that long (e.g., due to scheduled maintenance, resource sharing, or allocation limits), large computations checkpoint their computation state. These checkpoints include enough data to restart the computation at a later date. |
| Collective I/O | In general, each task in a parallel program issues I/O requests independently. Consequently, these requests can contend for access to the file system and storage devices. Collective I/O relies on a group request from a set of tasks to capture the global I/O pattern from the task group, then reorders the request components and issues a single aggregate request to the I/O system. This reordering reduces device contention and generally leads to higher performance. |
| Controller | A collection of electronics, or subsystem, that can operate a port, a bus, or a device to govern the functions of attached devices but which generally does not change the meaning of the data that may pass through it. The attached devices are usually peripherals or communication channels. In I/O systems SCSI or IDE controllers are common. |
| Cylinder | A cylinder is a group of tracks at the same radial position on a disk but each on a different platter. Each disk can be viewed as a set of stacked platters, with a cylinder as a set of tracks, each on a distinct platter when viewed from above. |
| Data distribution | A data distribution defines how striping units are distributed across storage devices in a parallel file system. Ideally, a file system chooses a data distribution so that application or I/O library accesses concurrently retrieve stripes from all or a substantial fraction of the storage devices. If so, the file system will deliver high performance by concurrently reading or writing data from/to all the devices. Conversely, if the data distribution does not match the I/O pattern, the potential performance advantage of parallel storage devices will be lost. |
| Declustering | Declustered files are those that are striped across multiple storage devices. A declustering is defined by a striping unit, a logically contiguous portion of the file stored on a device, and a distribution, the pattern for distributing successive striping units. The most common distribution is round robin, where successive striping units (stripes) are placed on successively numbered devices. For example, one might choose a striping unit of 64 KB and distribute successive stripes across eight disks. |
| Device driver | A program, or part of a program, used to control the detailed operation of an input or output device connected to a computer system. In many cases the device drivers are embedded as part of the operating system and different device drivers are written to conform to standards governing the way in which the user's application program communicates with the device driver. This allows programs to be written in such a way as to be able to use any device for which a suitable device driver has been produced. |
| Disk | An item of storage medium in the form of a circular plate. These devices, historically, have been principally magnetic disks, in which the information is stored via magnetic encoding. |
| Disk directed I/O | Disk directed I/O coordinates data transfers between storage devices and processor memories, attempting to organize transfers within the memory of the processors before initiating collective requests. Data organization is driven by storage constraints, with a goal of organizing the request in a way that maximizes I/O device performance. See collective I/O. |
| Distributed caching | Distributed caching exploits I/O caches at multiple sites to improve I/O performance.. Consider a simple example: a group of tasks each reading a disjoint portion of a file. If we prefetch data associated with each task and cache it near that task, we get great performance. |
| Fibre Channel | Fibre Channel is a network interconnect that operates on both optical fibre and copper (coax) connections. Originally specified for operation at 100 Mb/s, Fibre Channel can scale up1.6 Gb/s. In the context of I/O Fibre Channel is a mechanism to interconnect systems and storage devices, primarily in storage area networks (SANs). |
| File cache | A file cache holds data prefetched in anticipation of future read requests or stored temporarily in preparation for write behind. The goal is to hide the high latency of physical I/O devices by providing "smoothing" of data flow to and from devices. |
| File system | A file system is typically (though not always) an integral part of the operating system that manages accesses to physical storage devices. It coordinates placement of data and metadata on the devices, issuing operations via a device driver to the disks and storing data in a file cache. |
| Global file pointer | When multiple tasks in a parallel application open the same file, a file system may provide a local or a global file pointer. In the local file pointer case, each task can separately read, write, or seek to different portions of the file without affecting the file pointer values of any other tasks. (Of course, the tasks must collectively ensure that they do not write to the same locations, else data may be corrupted). A global file pointer is shared among all the tasks that open the file -- an I/O operation by any task changes the file pointer value in all other tasks. |
| GPFS | The IBM Global Parallel File System (GPFS) is a shared, parallel file system for the IBM SP parallel system. Unlike NFS and AFS, which provide only a shared name space for files across systems, GPFS supports parallel access and storage of files across SP nodes. GPFS is intended for use on a single system -- it is not a wide area distributed file system. |
| HDF | The Hierarchical Data Format (HDF) is a library and platform independent data format for scientific data exchange and storage.. It was developed and supported by the National Center for Supercomputing Applications (NCSA). The earlier HDF4 is both a physical file format for storing scientific data and a collection of utilities and applications for manipulating, viewing, and analyzing scientific data. The more recent HDF5 is a new design and implementation, intended to eliminate the limitations of HDF4 and provide new, high-performance features. |
| HPSS | The High Performance Storage System (HPSS) is a software infrastructure for hierarchical storage management and services for environments with very large storage requirements, typically multiple terabytes. Based on the IEEE Mass Storage Reference Model, HPSS supports staging and data movement to and from data archives and network-connect computer systems and storage devices. |
| IDE | An Integrated Device Electronics (IDE) interface is one of several standard controllers for commodity disks. IDE is designed for "inside the box" PC disk connections. Unlike SCSI, it is limited in its extensibility and the number of devices one can connect. However, because it is common on PCs, IDE disks are very inexpensive. |
| I/O library | An I/O library typically sits atop a file system, providing a higher level interface for I/O operations. I/O libraries, including those for UNIX and MPI-IO, mediate application I/O requests, buffering data and managing the movement of data to and from the file system. |
| I/O Patterns | Scientific and engineering applications often exhibit regular (repeating) I/O patterns that are a consequence of nested loops and periodic I/O checkpoints. For example, strided accesses are due to loop-based file/reads or writes that access array sections with fixed offsets (e.g., reading columns of an array that has been linearized on disk by rows. |
| Instrumentation | To identify the performance bottlenecks that are prime candidates for optimization, analysts and developers use instrumentation software that tracks I/O behavior. When attached to the software targeted for optimization, instrumentation acts as an assortment of probes, capturing performance metrics during program execution. These probes feed the performance metrics they catch into format that can portray I/O behavior to analysts and developers. |
| Latency | Latency is the complement of bandwidth. Intuitively, latency is the overhead or delay until an I/O operation begins transferring data. For example, a disk might have a 10 MB/s transfer rate (the bandwidth) but a 10 millisecond latency (the time until the first byte of data is returned). |
| Meta-format | A meta-format is often used to encode the structure and perhaps the semantics of data in a file. HDF can be viewed as a meta-format because the API allows one to store multiple logical entities in a single file, retrieving them base on a metadata specification. |
| Metadata | File metadata relates a collection of disk blocks to logical file blocks. On UNIX systems, i-nodes are the metadata that define the the set of physical disk blocks and their locations that constitute a file. Intuitively, the i-nodes point to disk blocks or other i-nodes, collectively defining a disk block pointer graph that is stored on disk along with the data. File systems navigate the i-node graph to retrieve disk blocks in response to application or library requests. |
| MPI-IO | MPI-IO is an application programming interface (API) for parallel I/O that is part of the MPI2 standard. MPI-IO defines a set of synchronous, asynchronous, and collective I/O routines for parallel systems, building on experiences from a variety of I/O characterization and I/O library studies. |
| NASD | Network Attached Storage Devices or Network Attached Secure Disks (NASD) are are network-enabled storage systems that shift some of the burden of I/O processing from the processor to the I/O device. By co-locating some of the software intelligence with the storage device, one can offload some of the processing and also enable storage devices to communicate and operate as peers with remote systems. |
| NFS | The Network File System (NFS) was originally developed by SUN Microsystems to support distributed access to files. Like AFS, NFS enables users and system administrators to mount remote file systems for local access. Unlike AFS, which stages entire files, NFS issues remote I/O requests to files at their permanent storage location. |
| Nonblocking I/O | A form of asynchronous I/O in which an application requests the execution of an I/O operation and then, rather than standing by until the data transfer is complete, proceeds with other work while the I/O operation takes place concurrently. |
| Out-of-core | Many large-scale applications generate intermediate data too large to retain in primary memory. To access this data, they read and write the requisite data to secondary storage. The name persists from the time when primary memories were constructed from magnetic cores, hence "out of core" for outside of main memory. These out of core accesses often lead to strided I/O requests. |
| Parallel file system | A parallel file system, sometimes called a global parallel file system, provides a shared namespace for files that are stored on multiple disks. SGI's XFS and IBM's GPFS, two of the best-known commercial examples, support striping of files across disks and define APIs for manipulating parallel files. |
| Prefetch | Because the latency for physical access to secondary or tertiary storage is orders of magnitude higher than that for a memory access, file systems and I/O libraries often stage data from storage to memory (a file cache) in anticipation of future file read requests. If the staged (prefetched) data is then accessed from memory buffers, the physical I/O latency can be hidden. Because sequential requests are the most common, the default prefetching policy often is sequential prefetch -- prefetching file blocks logically ahead of the current point in the file. Prefetch is sometimes called read ahead, the complement to write behind for output operations. |
| RAID | A RAID (Redundant Array of Independent Disks) groups a set of disks to create the illusion of a single, faster disk. The redundancy aspect was introduced to increase reliability by storing the same data on multiple disks. Just as parity is used in semiconductor memory to detect and correct bit errors, RAID systems rely on parity and redundant storage to recover from disk failures. As originally defined, RAIDs ranged from RAID-0 (no redundancy) through RAID-5, distributed parity. |
Random I/O patterns |
Random I/O patterns are those generated by applications that lack regularity. In contrast to sequential or strided patterns, where knowledge of the previous access allows one to predict the next access, random patterns cannot be anticipated. Consequently, file systems are rarely able to provide the same performance for these patterns are for those with more regularity. |
| Real-time tracing | Real-time tracing is a method of data capture in which the metrics are kept in a table in memory during program execution and written to an SDDF file at program termination. This approach reduces the overhead of instrumentation by eliminating the performance cost of writing individual records. However, it tends to give less detailed profiling information because the time at which specific events occurred can not be dervived from such traces. |
| ROMIO | ROMIO is a publicly available implementation of MPI-IO, developed by Argonne National Laboratory. ROMIO implements the MPI-IO subset of MPI2, and can operate atop a variety of public and vendor MPI implementations. |
| Rotational latency | The amount of time taken for a particular region of a track on a storage disk or drum to rotate to the read/write head of the device. This is measured from the time the head reaches the correct track and depends strongly on the rotational speed of the disk or drum surface. |
| Runtime tracing | Runtime tracing is a method of data capture in which the metrics are recorded for each instance of a trace event. |
| SDDF | The Pablo Self-Defining Data Format (SDDF) is a data description language that specifies both data record structures and data record instances. Because the format can describe general data records, as opposed to a predefined set of records, it is best viewed as a data meta-format. Intuitively, the format supports the definition of records containing scalars and arrays of the base types found in most programming languages (i.e., byte/character, integer, and single and double precision floating point). |
| Sequential I/O patterns | Sequential I/O patterns are a special case of strided I/O, where the stride is one (i.e., each access immediately follows the previous one). |
| Strided I/O patterns | A strided I/O pattern is one where the offset between successive accesses is a constant. The stride of the pattern is the distance between the start of one access and the start of the following access. An example of a strided I/O pattern is the pattern formed by accessing every third block in a file starting with the first block. In this case, the amount of data read is one block and the stride is three blocks. Strided I/O patterns often arise when accessing out-of-core arrays. |
| SCSI | The Small Computer Systems Interface (SCSI), pronounced SCUZZI, is a controller interface standard for connecting peripheral devices, such as disks and tapes to small and medium-sized computers. The SCSI standard can be divided into SCSI (SCSI1) and SCSI2 (SCSI wide and SCSI wide and fast) and now SCSI-3, which is made up of at least 14 separate standards documents. SCSI2 is the most popular version of the SCSI command specification; SCSI-3 resolves many long time "gray areas" and adds much new functionality and performance improvements. It also adds new types of SCSI busses like fibre channel/ |
| Seek time | Seek time is the time taken for a particular track on a disk be located by moving the disk head to the desired radial location on a disk platter. Typical average seek times for disks are in the range of 8-20 milliseconds. Seek time is one component of access latency. |
| Striping | The process of spreading logically contiguous data across separate devices to improve bandwidth. Storing the data on separate devices can allow the use of the bandwidth of each device in parallel. In ideal conditions, striping data across N identical devices yields N times the bandwidth of a single device. In practice, contention for the devices, inefficient access patterns, and network constraints all act to reduce this improvement. |
| Small write problem | Because they maintain parity for data recovery and faulty tolerance, RAID systems must support a read-modify-write protocol for small disk writes. First, they must read striped data, recompute the parity, and then update the appropriately modified portion of the data and the parity. This protocol means that RAID systems rarely deliver the same performance for small disk writes as non-redundant systems. |
| Storage Area Network (SAN) | A storage area network, or SAN, connects one or more I/O devices via a local network to a computer system. Fibre Channel is among the most common SANs. a SAN allows multiple processors to share the storage devices and access them directly, in contrast to local storage (e.g., via SCSI) where the processor to which the devices are attached must mediate remote I/O requests. |
| Trace Event | In profiling the performance of an application's I/O behavior, Pablo tools track, or trace, the I/O events that take place during program execution. An event, in this context, occurs each time a specific statement or instruction (open, read, write, seek, flush, close, etc.) is executed. |
| Track | A disk contains a group of platters, each of which in turn contains a set of concentric data tracks. A disk head seeks to a particular track, then reads data from the track into a track buffer (or writes from the track buffer to the track) as the track rotates beneath the disk head. A group of tracks in the same position on each platter form a cylinder. |
| Track buffer | A track buffer is a semiconductor memory in a disk that stores one or more tracks of data. Current disks cache tracks of data for rapid response to sequential reads via hardware (disk) prefetching, or for impedence matching when accepting data for subsequent writes to the disk medium. |
| Transfer time | Transfer time is the time needed to transfer data from the disk medium to either a track buffer or via the disk interface to the remote requestor. If the requested data for a read is already in the track buffer, it is simply the time to read the data from the buffer and transfer it via the interface. Otherwise, the transfer time is a function of the recording density and the size of the request -- the appropriate section of the track must rotate under the disk head to read (write) the data. Transfer time is one component of disk access latency. |
| Write behind | Write behind is a common technique for reducing apparent latency for file writes by buffering the written data for asynchronous write to disk. By buffering the data in memory, file writes proceed at memory speeds. The data is then written to disk during a subsequent computation interval. |
| XFS | XFS is SGI's shared, parallel file system for the SGI Origin system. Unlike NFS and AFS, which provide only a shared name space for files across systems, XFS supports parallel access and storage of files across SP nodes. XFS is intended for use on a single system -- it is not a wide area distributed file system. |