|
OpenVMS I/O User's Reference Manual
9.19 References
The following publications provide more information on local area
networks.
- IEEE Standards 802.1 (A, B, C, and D), 802.2, and 802.3.
- The Ethernet--Data Link Layer and Physical Layer
Specification
- ANSI X3t9.5 and X3.139
- Digital FDDI Network Architecture
Chapter 10 Optional Features for Improving I/O Performance
This chapter includes updated information for OpenVMS Version
7.2.
As of Version 7.0, OpenVMS Alpha includes two features that provide
dramatically improved I/O performance: Fast I/O and Fast Path. These
features are designed to promote OpenVMS as a leading platform for
database systems. Performance improvement results from reducing the CPU
cost per I/O request and improving symmetric multiprocessing (SMP)
scaling of I/O operations. The CPU cost per I/O is reduced by
optimizing code for high-volume I/O and by using better SMP CPU memory
cache. SMP scaling of I/O is increased by reducing the number of
spinlocks taken per I/O and by substituting finer-granularity spinlocks
for global spinlocks.
The improvements follow a natural division that already exists between
the device-independent and device-dependent layers in the OpenVMS I/O
subsystem. The device-independent overhead is addressed by Fast I/O,
which is a set of lean system services that can substitute for certain
$QIO operations. Using these services requires some coding changes in
existing applications, but the changes are usually modest and well
contained. The device-dependent overhead is addressed by Fast Path,
which is an optional performance feature that creates a "fast
path" to the device. It requires no application changes.
Fast I/O and Fast Path can be used independently. However, together
they can provide a 45% reduction in CPU cost per I/O on uniprocessor
systems and a 52% reduction on multiprocessor systems.
10.1 Fast I/O
Fast I/O is a set of three system services that were developed as a
$QIO alternative built for speed. These services are not a $QIO
replacement; $QIO is unchanged, and $QIO interoperation with these
services is fully supported. Rather, the services substitute for a
subset of $QIO operations, namely, only the high-volume read/write I/O
requests.
The Fast I/O services support 64-bit addresses for data transfers to
and from disk and tape devices.
While Fast I/O services are available on OpenVMS VAX, the performance
advantage applies only to OpenVMS Alpha. OpenVMS VAX has a run-time
library (RTL) compatibility package that translates the Fast I/O
service requests to $QIO system service requests, so one set of source
code can be used on both VAX and Alpha systems.
10.1.1 Fast I/O Benefits
The performance benefits of Fast I/O result from streamlining
high-volume I/O requests. The Fast I/O system service interfaces are
optimized to avoid the overhead of general-purpose services. For
example, I/O request packets (IRPs) are now permanently allocated and
used repeatedly for I/O rather than allocated and deallocated anew for
each I/O.
The greatest benefits stem from having user data buffers and user I/O
status structures permanently locked down and mapped using system
space. This allows Fast I/O to do the following:
- For direct I/O, avoid per-I/O buffer lockdown or unlocking.
- For buffered I/O, avoid allocation and deallocation of a separate
system buffer, since the user buffer is always addressable.
- Complete Fast I/O operations at IPL 8, thereby avoiding the
interrupt chaining usually required by the more general-purpose $QIO
system service. For each I/O, this eliminates the IPL 4 IOPOST
interrupt and a kernel AST.
In total, Fast I/O services eliminate four spinlock acquisitions per
I/O (two for the MMG spinlock and two for the SCHED spinlock). The
reduction in CPU cost per I/O is 20% for uniprocessor systems and 10%
for multiprocessor systems.
10.1.2 Using Buffer Objects
The lockdown of user-process data structures is accomplished by buffer
objects. A "buffer object" is process memory whose physical
pages have been locked in memory and double-mapped into system space.
After creating a buffer object, the process remains fully pageable and
swappable and the process retains normal virtual memory access to its
pages in the buffer object.
If the buffer object contains process data structures to be passed to
an OpenVMS system service, the OpenVMS system can use the buffer object
to avoid any probing, lockdown, and unlocking overhead associated with
these process data structures. Additionally, double-mapping into system
space allows the OpenVMS system direct access to the process memory
from system context.
To date, only the $QIO system service and the Fast I/O services have
been changed to accept buffer objects. For example, a buffer object
allows a programmer to eliminate I/O memory management overhead. On
each I/O, each page of a user data buffer is probed and then locked
down on I/O initiation and unlocked on I/O completion. Instead of
incurring this overhead for each I/O, it can be done once at buffer
object creation time. Subsequent I/O operations involving the buffer
object can completely avoid this memory management overhead.
Two system services can be used to create and delete buffer objects,
respectively, and can be called from any access mode. To create a
buffer object, the $CREATE_BUFOBJ system service is called. This
service expects as inputs an existing process memory range and returns
a buffer handle for the buffer object. The buffer handle is an opaque
identifier used to identify the buffer object on future I/O requests.
The $DELETE_BUFOBJ system service is used to delete the buffer object
and accepts as input the buffer handle. Although image rundown deletes
all existing buffer objects, it is good form for the application to
clean up properly.
A 64-bit equivalent version of the $CREATE_BUFOBJ system service
($CREATE_BUFOBJ_64) can be used to create buffer objects from the new
64-bit P2 or S2 regions. The $DELETE_BUFOBJ system service can be used
to delete 32-bit or 64-bit buffer objects.
Buffer objects require system management. Because buffer objects tie up
physical memory, extensive use of buffer objects require system
management planning. All the bytes of memory in the buffer object are
deducted from a systemwide SYSGEN parameter called MAXBOBMEM (maximum
buffer object memory). System managers must set this parameter
correctly for the application loads that run on their systems.
The MAXBOBMEM parameter defaults to 100 Alpha pages, but for
applications with large buffer pools it will likely be set much larger.
To prevent user-mode code from tying up excessive physical memory,
user-mode callers of $CREATE_BUFOBJ must have a new system identifier,
VMS$BUFFER_OBJECT_USER, assigned. This new identifier is automatically
created in an OpenVMS Version 7.0 upgrade if the file
SYS$SYSTEM:RIGHTSLIST.DAT is present. The system manager can assign
this identifier with the DCL command SET ACL command to a protected
subsystem or application that creates buffer objects from user mode. It
may also be appropriate to grant the identifier to a particular user
with the Authorize utility command GRANT/IDENTIFIER (for example, to a
programmer who is working on a development system).
There is currently a restriction on the type of process memory that can
be used for buffer objects. Global section memory cannot be made into a
buffer object.
10.1.3 Differences Between Fast I/O Services and $QIO
The precise definition of high-volume I/O operations optimized by Fast
I/O services is important. I/O that does not comply with this
definition either is not possible with the Fast I/O services or is not
optimized. The characteristics of the high-volume I/O optimized by Fast
I/O services can be seen by contrasting the operation of Fast I/O
system services to the $QIO system service as follows:
- The $QIO system service I/O status block (IOSB) is replaced by an
I/O status area (IOSA) that is larger and quadword aligned. The
transfer byte count returned in IOSA is 64 bits, and the field is
aligned on a quadword boundary. Unlike the IOSB, which is optional, the
IOSA is required.
- User data buffers must be aligned to a 512-byte boundary.
- All user process structures passed to the Fast I/O system services
must reside in buffer objects. This includes the user data buffer and
the IOSA.
- Only transfers that are multiples of 512 bytes are supported.
- Only the following function codes are supported: IO$_READVBLK,
IO$_READLBLK, IO$_WRITEVBLK, and IO$_WRITELBLK.
- Only I/O to disk and tape devices is optimized for performance.
- No event flags are used with Fast I/O services. If application code
must use an event flag in relation to a specific I/O, then the Event No
Flag EFN (EFN$C_ENF) can be used. This event flag is a no-overhead EFN
that can be used in situations when an EFN is required by a system
service interface but has no meaning to an application.
For
example, Fast I/O services do not use EFNs, so the application cannot
specify a valid EFN associated with the I/O to the $SYNCH system
service with which to synchronize I/O completion. To resolve this
issue, the application can call the $SYNCH system service passing as
arguments: EFN$C_ENF and the address of the appropriate IOSA.
Specifying EFN$C_ENF signifies to $SYNCH that no EFN is involved in the
synchronization of the I/O. Once the IOSA has been written with a
status and byte count, return from the $SYNCH call occurs. The IOSA is
now the central point of synchronization for a given Fast I/O (and is
the only way to determine whether the asynchronous I/O is complete).
- To minimize argument passing overhead to these services, the $QIO
parameters P3 through P6 are replaced by a single argument that is
passed directly by the Fast I/O system services to device drivers. For
disk-like devices, this argument is the media address (VBN or LBN) of
the transfer. For drivers with complex parameters, this argument is the
address of a descriptor or of a buffer specific to the device and
function.
- Segmented transfers are supported by Fast I/O but are not fully
optimized. There are two major causes of segmented transfers. The first
is disk fragmenting. While this can be an issue, it is assumed that
sites seeking maximum performance have eliminated the overhead of
segmenting I/O due to fragmentation.
A second cause of segmenting
is issuing an I/O that exceeds the port's maximum limit for a single
transfer. Transfers beyond the port maximum limit are segmented into
several smaller transfers. Some ports limit transfers to 64K bytes. If
the application limits its transfers to less than 64K bytes, this type
of segmentation should not be a concern.
10.1.4 Using Fast I/O Services
The three Fast I/O system services are:
- $IO_SETUP---Sets up an I/O.
- $IO_PERFORM[W]---Performs an I/O request.
- $IO_CLEANUP--Cleans up an I/O request.
10.1.4.1 Using Fandles
A key concept behind the operation of the Fast I/O services is the file
handle or fandle. A fandle is an opaque token that
represents a "setup" I/O. A fandle is needed for each I/O
outstanding from a process.
All possible setup, probing, and validation of arguments is performed
off the mainline code path during application startup with calls to the
$IO_SETUP system service. The I/O function, the AST address, the buffer
object for the data buffer, and the IOSA buffer object are specified on
input to $IO_SETUP service, and a fandle representing this setup is
returned to the application.
To perform an I/O, the $IO_PERFORM system service is called, specifying
the fandle, the channel, the data buffer address, the IOSA address, the
length of the transfer, and the media address (VBN or LBN) of the
transfer.
If the asynchronous version of this system service, $IO_PERFORM, is
used to issue the I/O, then the application can wait for I/O completion
using a $SYNCH specifying EFN$C_ENF and the appropriate IOSA. The
synchronous form of the system service, $IO_PERFORMW, is used to issue
an I/O and wait for it to complete. Optimum performance comes when the
application uses AST completion; that is, the application does not
issue an explicit wait for I/O completion.
To clean up a fandle, the fandle can be passed to the $IO_CLEANUP
system service.
10.1.4.2 Modifying Existing Applications
Modifying an application to use the Fast I/O services requires a few
source-code changes. For example:
- A programmer adds code to create buffer objects for the IOSAs and
data buffers.
- The programmer changes the application to use the Fast I/O
services. Not all $QIOs need to be converted. Only high-volume
read/write I/O requests should be changed.
A simple example is a
"database writer" program, which writes modified pages back
to the database. Suppose the writer can handle up to 16 simultaneous
writes. At application startup, the programmer would add code to create
16 fandles by 16 $IO_SETUP system service calls.
- In the main processing loop within the database writer program, the
programmer replaces the $QIO calls with $IO_PERFORM calls. Each
$IO_PERFORM call uses one of the 16 available fandles. While the I/O is
in progress, the selected fandle is unavailable for use with other I/O
requests. The database writer is probably using AST completion and
recycling fandle, data buffer, and IOSA once the completion AST
arrives.
If the database writer routine cannot return until all
dirty buffers are written (that is, it must wait for all I/O
completions), then $IO_PERFORMW can be used. Alternatively $IO_PERFORM
calls can be followed by $SYNCH system service calls passing the
EFN$C_ENF argument to await I/O completions. The database writer
will run faster and scale better because I/O requests now use less CPU
time.
- When the application exits, an $IO_CLEANUP system service call is
done for each fandle returned by a prior $IO_SETUP system service call.
Then the buffer objects are deleted. Image rundown performs fandle and
buffer object cleanup on behalf of the application, but it is good form
for the application to clean up properly.
10.1.4.3 I/O Status Area (IOSA)
The central point of synchronization for a given Fast I/O is its IOSA.
The IOSA replaces the $QIO system service's IOSB argument. Larger than
the IOSB argument, the byte count field in the IOSA is 64 bits and
quadword aligned. Unlike the $QIO system service, Fast I/O services
require the caller to supply an IOSA and require the IOSA to be part of
a buffer object.
The IOSA context field can be used in place of the $QIO system service
ASTPRM argument. The $QIO ASTPRM argument is typically used to pass a
pointer back to the application on the completion AST to locate the
user context needed for resuming a stalled user-thread. However, for
the $IO_PERFORM system service, the ASTPRM on the completion AST is
always the IOSA. Since there is no user-settable ASTPRM, an application
can store a pointer to the user thread context for this I/O in the IOSA
context field and retrieve the pointer from the IOSA in the completion
AST.
10.1.4.4 $IO_SETUP
The $IO_SETUP system service performs the setup of an I/O and returns a
unique identifier for this setup I/O, called a fandle, to be used on
future I/Os. The $IO_SETUP arguments used to create a given fandle
remain fixed throughout the life of the fandle. This has implications
for the number of fandles needed in an application. For example, a
single fandle can be used only for reads or only for writes. If an
application module has up to 16 simultaneous reads or writes pending,
then potentially 32 fandles are needed to avoid any $IO_SETUP calls
during mainline processing.
The $IO_SETUP system service supports an expedite flag, which is
available to boost the priority of an I/O among the other I/O requests
that have been handed off to the controller. Unrestrained use of this
argument is useless, because if all I/O is expedited, nothing is
expedited. Note that this flag requires the use of ALTPRI and PHY_IO
privilege.
10.1.4.5 $IO_PERFORM[W]
The $IO_PERFORM[W] system service accepts a fandle and five other
variable I/O parameters for the high-performance I/O operation. The
fandle remains in use to the application until the $IO_PERFORMW returns
or if $IO_PERFORM is used until a completion AST arrives.
The CHAN argument to the fandle contains the data channel returned to
the application by a previous file operation. This argument allows the
application the flexibility of using the same fandle for different open
files on successive I/Os. However, if the fandle is used repeatedly for
the same file or channel, then an internal optimization with
$IO_PERFORM is taken.
Note that $IO_PERFORM was designed to have no more than six arguments
to take advantage of the OpenMS Calling Standard, which
specifies that calls with up to six arguments can be passed entirely in
registers.
10.1.4.6 $IO_CLEANUP
A fandle can be cleaned up by passing the fandle to the $IO_CLEANUP
system service.
10.1.4.7 Fast I/O FDT Routine (ACP_STD$FASTIO_BLOCK)
Because $IO_PERFORM supports only four function codes, this system
service does not use the generalized function decision table (FDT)
dispatching that is contained in the $QIO system service. Instead,
$IO_PERFORM uses a single vector in the driver dispatch table called
DDT$PS_FAST_FDT for all the four supported functions. The
DDT$PS_FAST_FDT field is a FDT routine vector that indicates whether
the device driver called by $IO_PERFORM is set up to handle Fast I/O
operations. A nonzero value for this field indicates that the device
driver supports Fast I/O operations and that the I/O can be fully
optimized.
If the DDT$PS_FAST_FDT field is zero, then the driver is not set up to
handle Fast I/O operations. The $IO_PERFORM system service tolerates
such device drivers, but the I/O is only slightly optimized in this
circumstance.
The OpenVMS disk and tape drivers that ship as part of OpenVMS Version
7.0 have added the following line to their driver dispatch table
(DDTAB) macro:
FAST_FDT=ACP_STD$FASTIO_BLOCK,- ; Fast-IO FDT routine
|
This line initializes the DDT$PS_FAST_FDT field to the address of the
standard Fast I/O FDT routine, ACP_STD$FASTIO_BLOCK.
If you have a disk or tape device driver that can handle Fast I/O
operations, you can add this DDTAB macro line to your driver. If you
cannot use the standard Fast I/O FDT routine, ACP_STD$FASTIO_BLOCK, you
can develop your own based on the model presented in this routine.
10.1.5 Additional Information
For complete information about the following Fast I/O system services,
refer to the OpenVMS System Services Reference Manual: A--GETMSG and OpenVMS System Services Reference Manual: GETQUI--Z.
- $CREATE_BUFOBJ
- $DELETE_BUFOBJ
- $CREATE_BUFOBJ_64
- $IO_SETUP
- $IO_PERFORM
- $IO_CLEANUP
To see a sample program that demonstrates the use of buffer objects and
the Fast I/O system services, refer to the IO_PERFORM.C program in the
SYS$EXAMPLES directory.
10.2 Fast Path
Fast Path is an optional, high-performance feature designed to improve
I/O performance. Fast Path creates a streamlined path to the device.
Fast Path is of interest to any application where enhanced I/O
performance is desirable. Two examples are database systems and
real-time applications, where the speed of transferring data to disk is
often a vital concern.
Using Fast Path features does not require source-code changes. Minor
interface changes are available for expert programmers who want to
maximize Fast Path benefits.
Beginning with OpenVMS Alpha Version 7.1, Fast Path supports disk I/O
for the CIXCD and the CIPCA ports. These ports provide access to CI
storage for XMI- and PCI-based systems. In Version 7.0, Fast Path
supported disk I/O for the CIXCD port only.
Fast Path is not available on the OpenVMS VAX operating system.
10.2.1 Fast Path Features and Benefits
Fast Path achieves dramatic performance gains by reducing CPU time for
I/O requests on both uniprocessor and SMP systems. These savings are on
the order of 25% less CPU cost per I/O request on a uniprocessor and
35% less on a multiprocessor system. The performance benefits are
produced by:
- Reducing code paths through streamlining for the case of
high-volume I/O
- Substituting port-specific spinlocks for global I/O subsystem
spinlocks
- Executing I/O requests for a given port on a specific CPU
The performance improvement can best be seen by contrasting the current
OpenVMS I/O scheme to the new Fast Path scheme. While transparent to an
OpenVMS user, each disk and tape device is tied to a specific port. All
I/O for a device is sent out over its assigned port. Under the current
OpenVMS I/O scheme, an I/O can be initiated on any CPU, but I/O
completion must occur on the primary CPU. Under Fast Path, all I/O for
a given port is assigned to a specific CPU, eliminating the requirement
for completing the I/O on the primary CPU. This means that the entire
I/O can be initiated and completed on a single CPU. Because I/O
operations are no longer split among different CPUs, performance
increases as memory cache thrashing between CPUs decreases.
Fast Path also removes the primary CPU as a possible SMP bottleneck.
Without Fast Path, the primary CPU must be involved in all I/O. Once
this CPU becomes saturated, no further increase in I/O throughput is
possible. Spreading the I/O load evenly among CPUs in a multiprocessor
system provides greater maximum I/O throughput. This is achieved by
assigning each Fast Path port to a specific CPU referred to as the
port's preferred CPU.
With most of the I/O code path executing under port-specific spinlocks
and on each port's preferred CPU, a highly scalable SMP model of
parallel I/O operation exists. Given multiple ports and CPUs, I/Os can
be issued and processed in parallel to a large degree.
Preferred CPU Selection
All Fast Path ports are assignable to CPUs. You can set a SYSGEN
parameter specifying the set of CPUs that are allowed to serve as
preferred CPUs. This set is called the set of allowable
CPUs. At any point in time, the set of CPUs that currently can
have ports assigned to them, called the set of usable
CPUs, is the intersection of the set of allowable CPUS, and
the current set of running CPUs.
Each Fast Path Port is initially assigned to a CPU by the
FASTPATH_SERVER process that runs at port
initialization time. This process executes an automatic assignment
algorithm that spreads Fast Path ports evenly among the usable CPUs.
The FASTPATH_SERVER process also runs whenever a secondary CPU is
started, and whenever the set of SYSGEN parameters specifying the
allowable CPUs is changed.
If the primary CPU is in the set of allowable CPUs, the initial
distribution will be biased against the primary CPU in that a port will
only be assigned to the primary after ports have been assigned to each
of the other usable CPUs.
To identify a device or port's current preferred CPU, you can use
either $GETDVI or the SHOW DEVICE/FULL command. To identify the Fast
Path ports currently assigned to a CPU, you use the SHOW CPU /FULL
command.
You can directly assign a Fast Path port to a CPU, or request the
system to automatically select the port's preferred CPU from a specific
set of CPUs. To do this, you either issue a $QIO or use the SET
DEVICE/PREFERRED_CPU command. This will also set the port's User
Preferred CPU to be the selected CPU.
You can clear the port's User Preferred CPU by issuing either a $QIO,
or by using the SET DEVICE/NOPREFERRED CPU DCL command.
You can redistribute the system assignable Fast Path ports across a
subset of the set of usable CPUs by calling the $IO_FASTPATH system
service.
Optimizing Application Performance
Processes running on a port's preferred CPU have an inherent advantage
when issuing I/O to a port in that the overhead to assign the I/O to
the preferred CPU can be avoided. An application process can use the
$PROCESS_AFFINITY system service to assign itself to the preferred CPU
of the device to which the majority of its I/O is sent.
With proper attention to assignment, a process's execution need never
leave the preferred CPU. This presents a scalable process and I/O
scheme for maximizing multiprocessor system operation. Like most RISC
systems, Alpha system performance is highly dependent on the
performance of CPU memory caches. Process assignment and preferred CPU
assignment are two keys to minimizing the memory stalls in the
application and in the operating system, thereby maximizing
multiprocessor system throughput.
|