High Performance
         Network and
         Channel-Based
         Storage
         
         
         
         
         
         by
         
         
         
         Randy H. Katz
         
         
         
         
         
         SEQUOIA 2000
         Technical Report
         91/2

         ** No page found **

          HIGH PERFORMANCE NETWORK
         
         
         
         AND CHANNEL-BASED STORAGE
         
         
         
         
         
         bY
         
         
         Randy H. Katz
         
         
         
         
         
         Report #9112
         
         
         
         
         
         SEQUOIA 2000
         557 Evans Hall
         Berkeley, California 94720
         (4 1 5) 642-4662
      
         
                  Admitte~dly, centralized storage also has its weaknesses. A server or network failure renders
         the client workstations unusable and the network represents the critical performance bottleneck. A
         highly tuned remote file system on a 10 megabit (Mbit) per second Etherne~ can provide perhaps
         500K bytes per second to remote client applications. Sixty 8K byte VOs per second would fully
         utilize this bandwidth. Obtaining the right balance of workstations to servers depends on their rel-
         ative processing power, the amount of memory dedicated to file caches on workstations and serv-
         crs, the available network bandwidth, and the VO bandwidth of the server. It is interesting to note
         that today's servers are not UO limited: the Ethernet bandwidth can be fully utilized by the VO
         bandwidth of only two magnetic disks!
             Meanwhile, other technology developments in processors, networks, and storage systems
         are affecting the relationship between clients to servers. It is well known that processor perfor-
         mance, as measured in MIPS ratings, is increasing at an astonishing rate, doubling on the order of
         once every eighteen months to two years. The newest generation of RISC processors have perfor-
         mance in the 50 to 60 MIPS range. For example, a recent workstation announced by Hewlett-
         Packard Corporanon, the HP sooon30, has been rated at 72 SPECMarks (1 SPECMark is
         roughly the processing power of a single Digital Equipment Corporation VAX 1 ln80 on a partic-
         ular benchmark set). Powerful shared memory multiprocessor systems, now available from com-
         panies such as Silicon Graphics and Solborne, provide well over 100 MIPS performance. One of
         Amdahl's famous laws equated one MIPS of processing power with one megabit of VO per sec-
         ond. Obviously such processing rates far exceed anything that can be delivered by existing server,
         network, or storage architectures.
             Unlike processor power, network technology evolves at a slower rate, but when it advances,
         it does so in order of magnitude steps. In the last decade we have advanced from 3 Mbit/second
         Ethernet to 10 Mbit/second Ethemet. We are now on the verge of a new generation of network
         technology, based on fiber optic interconnect, called FDDI. This technology promises 100 Mbits
         per second, and at least initially, it will move the server bottleneck from the network to the server
         CPU or its storage system. With more powerful processors available on the horizon, the perfor-
         mance challenge is very likely to be in the storage system, where a typical magnetic disk can ser-
         vice thirty 8K byte VOs per second and can sustain a data rate in the range of 1 to 3 MBytes per
         second. And even faster networks and interconnects, in the gigabit range, are now commercially
         available and will become more widespread as their costs begin to drop [UltraNet 90].
             To keep up with the advances in processors and networks, storage systems are also experi-
         encing rapid improvements. Magnetic disks have been doubling in storage capacity once every
         three years. As disk form factors shrink from 14" to 3.5" and below, the disks can be made to spin
         faster, thus increasing the sequential transfer rate. Unfortunately, the random VO rate is improving
         only very slowly, due to mechanically-limited positioning delays. Since VO and data rates are pri-
         marily disk actuator limited, a new storage system approach called disk arrays addresses this
         problem by replacing a small number of large format disks by a very large number of small format
         disks. Disk arrays maintain the high capacity of the storage system, while enormously increasing
         the system's disk actuators and thus the aggregate VO and data rate.
             The confluence of developments in processors, networks, and storage offers the possibility
         of extending the client-server model so effectively used in workstation environments to higher
         performance environments, which integrate supercomputer, near supercomputers, workstations,
         and storage services on a very high performance network. The technology is rapidly reaching the
         point where it is possible to think in terms of diskless supercomputers in much the same way as
         we think about diskless workstations. Thus, the network is emerging as the future "backplane" of
         high performance systems. The challenge is to develop the new hardware and software architec-
         
         
         High Perfonn~nce Nawor~    B-sed Slo -ge  Seplanber 27, 1991

                                            High Performance Network and Channel-Based Storage
                           
            Randy H. Katz
                           
                           
         Computer Science Division
         Department of Electrical Engineering and Computer Sciences
               University of California
              Berkeley, California 94720
                           
                           
         Abstract: In lhe tradi~ional mainframe-cenlered view of a compuler systern, s orage devices are coupled lo ~he system
         through comple~ hardware subsys~ems called VO channels. Wlth the dramatic shih towards workstation-based com-
         puting, and its associated clienllserver model of compuLation, storage facilities are now found aLtached to file servers
         and distributed Lhroughou~ Lhe network. In ~his paper, we discuss Lhe underlying technology trends Lhat are leading to
         high perforrnance network-based-storage, namely advances in networks, storage dev ces, and IIO controller and
         server archtect~res. we review several cornmercial systems and research prototypes tha~ are leading to a new
         approach to high perforrnance computing based on network-attached storage.
         
         Key Words and Phrases: High PerforlTlance CompuLing, Computer Ne~works, File and Storaye Servers, Secondary
         and Tertiary Storage Device
         
         
         1. Introduction
         
         The traditional mainframe-centered model of computing can be characterized by small numbers
         of large-scale mainframe computers, with shared storage devices attached via VO charmel hard-
         ware. Today, we are experiencing a major paradigm shift away from centralized mainframes to a
         distributed model of computation based on workstations and file servers connected via high per-
         formance networks.
             What makes this new paradigm possible is the rapid development and acceptance of the cli-
         ent-server model of computation. The client/server model is a message-based protocol in which
         clients make requests of service providers, which are called servers. Perhaps the most successful
         application of this concept is the widespread use of file servers in networks of computer worksta-
         tions and personal computers. Even a high-end workstation has rather limited capabilities for data
         storage. A distinguished machine on the network, customized either by hardware, software, or
         both, provides af~le service. It accepts network messages from client machines containing open/
         close/read/write file requests and processes these, transmitting the requested data back and forth
         across the network.
             This is in contrast to the pure distributed storage rnodel, in which the files are dispersed
         among the storage on workstations rather than centralized in a server. The advantages of a distrib-
         uted organization are that resources are placed near where they are needed, leading to better per-
         formance, and that the environment can be more autonomous because individual machines
         continue to perform useful work even in the face of network failures. While this has been the
         more popular approach over the last few years, there has emerged a growing awareness of the
         advantages of the centralized view. That is, every user sees the same file system, independent of
         the machine they are currently using. The view of storage is pervasive and transparent. Further, it
         is much easier to administer a centralized system, to provide software updates and archival back-
         ups. The resulting organization combines distributed processing power with a centralized view of
         storage.
         
         
         HUh Perfonn~noe Nawor~ ~nd Clunr~l-B-sed Slonge Sep~nber 27, 1991

         Distance
         Bandwidth
         Latency
         Reliability
         
         Network    Channel       Backplane
         >lOOOm     10- lOOm          1 m
         10 - 100 Mb/s40 - 1000 Mb/s320 - 1000+ Mb/s
         high (~ms) medium        low (qls)
         
         low        medium          high
         Extensive CRCByte ParityByte Parity
         
         Table 2.1: Cornparison of Network, Channel, and Backplane Attnbutes
         The companson is based upon the interconnection distance, transmission bandwidth, transmission latency, inher-
         enl reliability, and typical techniques for improving data integrity.
         
             In the ~emainder of this section, we will look at each of the three kinds of interconnect, net-
         work, channel, and backplane, in more detail.
         
         
         2.2. Communications Networks and Network Controllers
         An excellent overview of networking technology can be found in [Cerf 91]. For a futuristic view,
         see [Tesla 91] and [Negraponte 91]. The decade of the 1980's has seen a slow maturation of net-
         work technology, but the 1990's promise much more rapid developments. 10 Mbit/second Ether-
         nets are pervasive today, with many environments advancing to the next generation of 100 Mbit/
         second networks based on the FDDI (Fiber Distributed Data Interface) standard [Joshi 86]. FDDI
         provides higher bandwidth, longer distances, and reduced error rates, due largely to the introduc-
         tion of fiber optics for data transmission. Unfortunately cost, especially for replacing the existing
         copper wire network with fiber, coupled with disappointing transmission latencies, have slowed
         the acceptance of these higher speed networks. The latency problems have more to do with
         FDDI's protocols, which are based on a token passing arbitration scheme, than anything intrinsic
         in fiber optic technology.
             A network system is decomposed into multiple protocol layers, from the application inter-
         face down to the method of physical communication of bits on the network. Figure 2.1 summa-
         rizes the popular seven layer ISO protocol model. The physical and link levels are closely tied to
         the underlying transport medium, and deal with the physical attachment to the network and the
         method of acquiring access to it. The network, transport, and session levels focus on the detailed
         formats of communications packets and the methods for transmitting them from one program to
         another. The presentation and applications layers define the formats of the data embedded within
         the packets and the application-specific semantics of that data.
             A number of performance measurements of network transmission services all point out that
         the significant overhead is not protocol interpretation (approximately 10% of instructions are
         spent in interpreting the network headers). The culprits are memory system overheads due to data
         movement and operating system overheads related to context switches and data copying [Clark
         89, Heatly 89, Kanakia 90, Watson 87]. We will see this again and again in the sections to follow.
             The network controller is the collection of hardware and firmware that implements the inter-
         face between the network and the host processor. It is typically implemented on a small printed
         circuit board, and contains its own processor, memory mapped control registers, interface to the
         network, and small memory to hold rnessages being transmitted and received. The on-board pro-
         cessor, usually in conjunction with VLSI components within the network interface, implements
         
         HiBh Per~onn-nce N~wor~ ùnd a~nnel-B~scd Slor~ge Seplanber ~7, 1991

         tures that will be suitable for this world of network-based storage.
             The emphasis of this paper is on the integration of storage and network services, and the
         challenges of managing the complex storage hierarchy of the future: file caches, on-line disk stor-
         age, near-line data libraries, and off-line archives. We specifically ignore existing mainframe UO
         architectures, as these are well described elsewhere (for example, in [Hennessy 90]. The rest of
         this paper is organized as follows. In the next three sections, we will review the recent advances in
         interconnect, storage devices, and distributed software, to better understand the underlying
         changes in network, storage, and software technologies. Section 5 contains detailed case studies
         of commercially available high performance networks, storage servers, and file servers, as well as
         a prototype high performance network-attached I/O controller being developed at the University
         of Califomia, Berkeley. Our surnmary, conclusions, and suggestions for future research are found
         in .Ct-~tinn ~
         
         
         2. InterconnectTrends
         
         
         2.1. Networks, Channels, and Backplanes
         Interconnect is a generic term for the "glue" that interfaces the components of a computer system.
         Interconnect consist of high speed hardware interfaces and the associated logical protocols. The
         former consists of physical wires or control registers. The latter may be interpreted by either hard-
         ware or software. From the viewpoint of the storage system, interconnect can be classified as high
         speed networks, processor-to-storage channels, or system backplanes that provide ports to a mem-
         ory system through direct memory access techniques.
             Networks, channels, and backplanes differ in terms of the interconnection distances they can
         support, the bandwidth and latencies they can achieve, and the fundamental assumptions about
         the inherent unreliability of data transmission. While no statement we can make is universally
         true, in general, backplanes can be characterized by parallel wide data paths, centralized arbitra-
         tion, and are oriented towards read/write "memory mapped" operations. That is, access to control
         registers is treated identically to memory word access. Networks, on the other hand, provide serial
         data, distributed arbitration, and support more message-oriented protocols. The latter require a
         more complex handshake, usually involving the exchange of high-level request and acknowledg-
         ment messages. Channels fall between the two extremes, consisting of wide datapaths of medium
         distance and often incorporating simplified versions of network-like protocols.
             These considerations are summarized in Table 2.1. Networks typically span more than 1 km,
         sustain 10 Mbit/second (Ethernet) to 100 Mbit/second (FDDI) and beyond, experience latencies
         measured in several ms, ar~d the network medium itself is considered to be inherently unreliable.
         Networks include extensive data integrity features within their protocols, including CRC check-
         sums at the packet and message levels, and the explicit acknowledgment of received packets.
             Channels span srnall 10's of meters, transmit at anywhere from 4.5 MBytes/s (IBM channel
         interfaces) to 100 MBytes/second (HiPPI channels), incur latencies of under 100 ~s per transfer,
         and have medium reliability. Byte parity at the individual transfer word is usually supported,
         although packet-level checksumming might also be supported.
             Backplanes are about 1 m in length, transfer from 40 (VME) to over 100 (FutureBus)
         MBytes/second, incur sub ~s latencies, and the interconnect is considered to be highly reliable.
         Backplanes typically support byte parity, although some backplanes (unfortunately) dispense with
         parity altogether.
         
         
         Hi8h PerSonn~nce Nawor~ u~d Clunrel-B-red S~ge             Scplanber 27, 1991     3

                  While this presents a particularly clean interface between the network controller and the
         operating systern, it points out some of the intrinsic memory system latencies that reduce network
         performance. Consider a message that will be transmitted to the network. First the contents of the
         message are created within a user application. A call to the operating system results in a process
         switch and a data copy from the user's address space to the op~rating system's area. A protocol-
         specific network header is then appended to the data to form a packaged network message. This
         must be copied one more time, to place the message into a request block that can be accessed by
         the network controller. The final copy is the DMA operation that moves the message within the
         request block to memory within the network controller.
             Data integrity is the aspect of system reliability concerned with the transmission of correct
         data and the explicit flagging of incorrect data. An overriding consideration of network protocols
         is their concern with reliable transmission. Because of the distances involved and the complexity
         of the transmission path, network transmission is inherently lossy. The solution is to append
         checksum protection bits to all network packets and to include explicit acknowledgment as part of
         the network protocols. For example, if the checksum computed at the receiving end does not
         match the transmitted checksum, the receiver sends a negative acknowledgment to the sender.
         
         
         2.3. Channel Architectures
         Channels provide the logical and physical pathways between VO controllers and storage devices.
         They are medium distance interconnect that carry signals in parallel, usually with some parity
         technique to provide data integrity. In this section, we will describe two alternative channel orga-
         nizations that characterize the low end and high end respectively: SCSI (Small Computer System
         Interface) and HiPPI (High Performance Parallel Interface).
         
         
         2.3.1. Small Computer System Interface
         SCSI is the channel interface most frequently encountered in small fo~nfactor (5.25" diameter
         and smaller) disk drives, as well as a wide variety of peripherals such as tape drives, optical disk
         readers, and image scanners. SCSI treats peripheral devices in a largely device-independent fash-
         ion. For example, a disk drive is viewed as a linear byte stream; its detailed structure in terms of
         sectors, tracks, and cylinders is not visible through the SCSI interface.
             A SCSI channel can support up to 8 devices sharing a common bus with an 8-bit wide data-
         path. In SCSI terrninology, the VO controller counts as one of these devices, and is called the host
         bus adapter ~HBA). Burst transfers at 4 to 5 MBytes/second are widely available today. In SCSI
         terminology, a device that requests service from another device is called the master or the initia-
         tor. The device that is providing the service is called the slave or the target.
             SCSI provides a high-level message-based protocol for comrnunications among initiators
         and targets. While this makes it possible to mix widely different kinds on devices on the same
         channel, it does lead to relatively high overheads. The protocol has been designed to allow initia-
         tors to manage multiple simultaneous operations. Targets are intelligent in the sense that they
         explicitly notify the initiator when they are ready to transrnit data or when they need to throttle a
         transfer.
             It is worthwhile to exarnine the SCSI protocol in some detail, to clearly distinguish what it
         does from the kinds of messages exchanged on a computer network. The SCSI protocol proceeds
         in a series of phases, which we summarize below:
         
         Bus Free: No device currently has the bus allocated
         
         
         
         Hi8h Perforrn~na: N~lwor~ ùnd C~uruYI-B-red Slonge Sq~lcrnber Zl, 1991

         Application  Detailed information about the data being exchanged
         Presentation Data representation
           Session    Management of connections between programs
         Transport    Delivery of packet sequences
          Network     Format of individual packets
           Link       Access to and control of tTansrnission medium
          Physical    Medium of transmission
           Figure 2.1: Seven I~yer ISO Protocol Model
           The figure shows the seven layers of the ISO prolocol model. The physical layer describes the actual trans-
           mission medium, be it coax cable, fiber optics, or a palallel backplane. The link layer describes how sta~ions
           gain access lo the medium. This layer deals with the protocols for arbitrating for and obtaining grant permis-
           sion to the media The network layer defines the fonnat of data packets to be transmitted over the media,
           including destination and sender inforrnation as well as any checksums. The transport layer is responsible
           for the reliable delivery of packets. The session layer establishes communica~ions between the sending pro-
           grarn and the receiving program. The presenta~ion layer determines the detailed formats of the data embed-
           ded within packets. The application layer has the responsibility of understanding how this data should be
           interpreted wi hin an appllcations contexL
         the physical and link level protocols of the network.
             The interaction between the network controller and the host's memory is depicted in Figure
         2.2. Lists of blocks containing packets to be sent and packets that have been received are main-
         tained in the host processor's memory. The locations of buffers for these blocks are made known
         to the network controller, and it will copy packets to and from the request/receive block areas
         using direct memory access (DMA) techniques. This means that the copy of data across the
         peripheral bus is under the control of the network controller, and does not require the intervention
         of the host processor. The controller will interrupt the host whenever a message has been received
         or sent.
         
         
         
         
         
         Media
         
          Network Controller
         
         
         
         
         
         Net
         VF
         
         - ~ontrol-
         -Reg. VF-
         
         
         ~equest
         Block
         
         Memory
       Receive
         lock
         
         nMA
         
         
         
         
         
            DMA
         
                        Processor Memory
                       List of requt st blocks
         
         
                        'ata to be transmitted

         
                       List of free blocks
         
         
         
                       List of receive blocks
                      =_...
         
         
         
         Peripheral Backplane Bus
         
         Figure 2.2: Network Con~rollerlProcessor Mernory Interaction
         The figure describes the interaction between the Network Controller and the memory of the netwo{k node. The
         controller contains an on-board microprocessor, various memory-mapped control registers through which service
         reques~s can be made and status checked. a physical interface to the network media, and a buffer memory to hold
         requesl and receive blocks. These contain network messages to be transmitted or which have been received respec-
         tively. A list of pending requests and messages already received reside in the host processor's memory. Duect
         memory operations (DMA), under the control of the node processor, copy these blocks to and from this memory.
         
         
         High Perfonn-nce Nelwo~ nd Chnnel-B--cd Sla~ge Scp~mber 27, 1991

         Disconnect to seek/fill buffer
         Message In (Disconnect)
         - - Bus Free - -
         Arbitration
         Reselection
         Message In (Identify)
         
         
         
         
         
         Disconnect to fill buffer
         
         ~nmmand Setu
         Arbitration
         Selection
         Message Out (Identify)
         ('nmmand
         
         
         
         
         
         Message In (Save Data P;tr)
         Message In (Disconnect)
         - - Bus Free - -
         Arbitration
         Reselection
         Message In (Identify)
         Messa~ge In (Restore Data Ptr)
         
              ~f nn dic~nnne~t i~ needed
         
         
         
         
         
         Data Transfer
         Data In
         
                 Completion
         
                       Command Completion
                         S ~
         Message In (Command Complete)
                           
                           
                           
                           
                           
         Figure 2.3: SCSI Phase Transitions on a Read
         The basic phase sequencing for a read (from disk) operation is shown. First the initia~or sets up the read command
         and sends it to the 1~:) device. The tarRet device disconnects from the SCSI bus to perform a seek and to begin to
         fill its intemal buffer. It then transfers the data to the initiator. ll)is may be interspersed with additional disconnects,
         as the transfer gets ahead of the internal buffering. A cornmand complete message terminates the operation. This
         figure is adapted from [Chervenak90].
         off.
             The command completion phase is entered once the data transfer is finished. The target
         device sends a status message to the initiator, describing any errors that may have been encoun-
         tered during the operation. The final command completion message completes the UO operation.
             The SCSI protocol specification is currently undergoing a major revision for higher perfor-
         mance. In the so-called "SCSI-l," the basic clock rate on the channel is 10 Mhz. In the new SCSI-
         2, "fast SCSI" increases the clock rate to 20 Mhz, doubling the channel's bandwidth from 5
         MByte/second to 10 MByte/second. Recently announced high performance disk drives, such as
         those from Fujitsu, support fast SCSI. The revised specification also supports an alternative
         method of doubling the channel bandwidth, called "wide SCSI." This provides a l~bit data path

         on the channel rather than SCSI- 1 's 8-bit width. By combining wide and fast SCS1-2, the channel
         bandwidth quadruples to 20 MByte/second. Some manufacturers of high performance disk con-
         trollers have begun to use SCSI-2 to interface their controllers to a computer host.
         
         
         HUhPe~onn~ceNawo~ u~ Chuu~l-B-u~S~ ge                             Sq~omber~,1991  8

 Arbitration: Initiators arbitrate for access to the bus. A device's physical address determines its
         priority.
         
 Selecnon: The initiator inforrns the target that it will participate in an VO operation.
         
  Reselection: The target informs the initiator that an outstanding operation is to be resumed. For
         example, an operation could have been previously suspended because the VO device had
         to obtain more data.
         
  Command: Command bytes are written to the target by the initiator. The target begins executing
         the operation.
         
  Data Tran~fer: The protocol supports two forms of the data transfer phase, Data In and Data Out.
         The former refers to the movement of data from the target to the initiator. Ln the latter, data
         moves from the initiator to the target.
         
  Message: The message phase also comes in two forms, Message In and Message Out. Message In
         consists of several altematives. Identify identifies the reselected target. Save Data Pointer
         saves the place in the current data transfer if the target is about to disconnect. Restore Data
         Pointer restores this pointer. Disconnect notifies the initiator that the target is about to give
         up the data bus. Cornrnand Com~lete occurs when the targPt tells the initiator that the oper-
         ation has completed. Message Out has just one form: Identify. This is used to identify the
         requesting initiator and its intended target.
         
  Status: Just before command completion, the target sends a status message to the initiator.
         
             To better understand the sequencing among the phases, see Figure 2.3. This illustrates the
         phase transitions for a typical SCSI read operation. The sequencing of an VO operation actually
         begins when the host's operating system establishes data and status blocks within its memory.
         Next, it issues an VO command to the HBA, passing it pointers to command, status, and data
         blocks, as well as the SCSI address of the target device. These are staged from host memory to
         device-specific queues within the HBA's memory using direct memory access techniques.
             Now the VO operation can begin in ernest. The HBA arbitrates for and wins control of the
         SCSI bus. It then indicates the target device it wishes to communicate with during the selection
         phase. The target responds by identifying itself during a following message out phase. Now the
         actual command, such as "read a sequence of bytes," is transmitted to the device.
             We assume that the target device is a disk. If the disk must first seek before it can obtain the
         requested data, it will disconnect from the bus. It sends a disconnect message to the initiator,
         which in tum gives up the bus. Note that the HBA can communicate with other devices on the
         SCSI channel, initiating additional VO operations. Now the device will seek to the appropriate
         track and will begin to fill its internal buffer with data. At this point, it needs to reestablish com-
         munications with the HBA. The device now arbitrates for and wins control of the bus. It next
         enters the reselection phase, and identifies itself to the initiator to reestablish communications.
             The data transfer phase can now begin. Data is transferred one byte at a time using a simple
         request/acknowledgment protocol between the target and the initiator. This continues until the
         need for a disconnect arises again, such as when the target's buffer is emptied, or perhaps the
         command has completed. If it is the first case, the data pointer must first be saved within the HBA,
         so we can restart the transfer at a later time. Once the data transfer pointer has been saved, the tar-
         get sequences through a disconnect, as described above.
             When the disk is once again ready to transfer, it rearbitrates for the bus and identifies the ini-
         tiator with which to reconnect. This is followed by a restore data pointer message to restablish the
         current position within the data transfer. The data transfer phase can now continue where it left
         
         High Perform~nce Naworlc nd Clun~l-B-~d S~ ge Scplanber Z7, 1991

                 Metric
         
         Bus Widlh (signals)
         AddresslDat4 Multiplc~ed?
         Data Width
         Xfer Size
         # of Bus Masters
         Split TraMactions
         Clockin8
         Bandwidth, Single Word (0 M mem)
         Bandwidth, Single Word (150 M mcn~)
         Bandwidth Multiple Word (0 ns mcm)
         Bandwidth Multiple Word (150 M mem)
         Ma~ # of devices
         ~f.~ Bus Length
         Sl~ndard
         
           VME     FutureBus MultiBus Il    SCSI-I
         
            128        96        96         25
            No        Yes        Yes        na
           16-32       32        32         8
         Single/Mul~iple SinglelMultiple Single/Multiple SinglelMultiple
         MullipleMul~ipleMultiple   Multiple
         No    OptionalOptional     Optional
         Async   Async   Sync        Ei~her
         25       37      20         5, 1.5
         12.9    15.5     10          5,1.5
         27.9    95.2     40         5, 1.5
         13.6    20.8    13.3        5, 1.5
         21       20      21            7
         .Sm      .5m     .5m          25m
         IFFE ln14IEEE X96ANSI/EEEE 1296    ANSI X3.131
         
         Table 2.2: Comparisons Arnong Popular Backplane and Channel Interconnec~s
                           
                           
         than the more message-oriented protocols to be found on networks and channels.
             Table 2.2 gives some of the metrics of three popular backplane busses, VME, FutureBus,
         and Multibus II (this table is adapted from [Hennessy 90]). For the purposes c,. comparison, we
         include the same metncs for the SCSI-I channel specification. The table includes the width of the
         interconnect (including control and data signals), whether the address and data lines are multi-
         plexed, the data width, whether the transfer size is a single or multiple word, the number of bus
         masters supported, whether split transactions are supported (these are network-like request and
         acknowledgment messages), the clocking scheme, the interconnect's bandwidth under a variety of
         assumptions (single vs. multiple word transfers, 0 ns access time memories vs. 150 ns access
         time), the maximum number of controllers or devices per bus, the maximum bus length, and the
         relevant ANSI or EEE standard that defines the interconnect.
             The most dramatic differences are in the interconnect width and the maximum bus width. In
         general, channel interconnects are narrow and long distance while backplanes are wide but short
         distance.
             However, some of the distinctions are being to blur. The SCSI channel has many of the
         attributes of a bus, FutureBus has cenain aspects that make it behave more like a channel than a
         bus, and nobody could describe a 64-bit HiPPI channel as being narrow! For example, let's con-
         sider FutureBus in a little more detail. The bus supports distributed arbitration, asynchronous sig-
         naling (that is, no global clocks), single source/muliple destination "broadcast" messages, and
         request/acknowledge split bus transactions [Borrill 84]. The latter are very much like SCSI dis-
         connect/reconnec~ phases. A host issues a read request message to a memory or VO controller, and
         then detaches from the bus. Later on, the memory sends a response message to the host, contain-
         ing the requested data.
         
         
         3. Storage Trends
         
         
         3.1. The Storage Hierarchy and Storage Technology

         
         
         3.1.1. Concept of Storage Hierarchy
         The storage hierarchy is traditionally modeled as a pyramid, with a small amount of expensive,
         fast storage at the pinnacle and larger capacity, lower cost, and lower performance storage as we
         
         Hi~h Per~onn-nce Nawo~ u~d Chnnel-B~ed S~                 e   Sq~anber Z~, 1991   10

         2.3 ~. High Performance Parallel Interface
         The High Performance Parallel Interface, HiPPI, was originally developed at the Los Alamos
         National Laboratory in the mid-1980s as a high speed unidirectional (simplex) point-to-point
         interface between supercomputers [Ohrenstein 90]. Thus, two-way communications requires two
         HiPPI channels, one for commands and write data (the write channel) and one for status and read
         data (the read channel). Data is transmitted at a nominal rate of 800 Mbits/second (32-bit wide
         datapath) or 1600 Mbit/second (64-bit wide datapath) in each direction.
             The physical interface of the HiPPI channel was standardized in the late 1980s. Its data
         transfer protocol was designed to be extremely simple and fast. The source of the transfer must
         first assert a request signal to gain access to the channel. A connection signal grants the channel to
         the source. However, the source cannot send until the destination asserts ready. This provides a
         simple flow control mechanism.
             The minimum unit of data transfer is the b~rst. A burst consists of 1 to 256 words (the width
         is determined by the physical width of the channel; for a 32-bit channel, a burst is 1024 bytes),
         sent as a continuous stream of words, one per clock period. A burst is in progress as long as the
         channel's burst signal is asserted. When the burst signal goes unasserted, a CRC (cyclic redun-
         dancy check) word computed over the transrnitted data words is sent down the channel. Because
         of the way the protocol is defined, when the destination asserts ready, it means that it must be able
         to accept a complete burst.
             Unfortunately, the Upper Level Protocol (ULP) for performing operations over the channel
         is still under discussion within the standardization committees. To illustrate the concepts involved
         in using HiPPI as an interface to storage devices, we restrict our description to the proposal to
         layer the IPI-3 Device Generic Command Set on top of HiPPI, put forward by Maximum Strate-
         gies and IBM Corporation [Maxirnum Strategies 90].
             A logical unit of data, sent from a source to a destination, is called a packet. A packet is a
         sequence of bursts. A special channel signal delineates the start of a new packet. Packets consist
         of a header, a ULP (Upper Layer Protocol) data set, and fill. The ULP data consists of a com-
         mand/response field and read/write data field.
             Packets fall into three types: cornrnand, response, or data-only. A cornmand packet can con-
         tain a header burst with an IPI-3 device command, such as read or write, followed by multiple
         data bursts if the command is a write. A response packet is similar. It contains an IPI-3 response
         within a header burst, followed by data bursts if the response is a read transfer notification. Data-
         only packets contain header bursts without command or response fields.
             Consider a read operation over a HiPPI channel using the IPI-3 protocol. On the write-chan-
         nel, the slave peripheral device receives a header burst containing a valid read command from the
         master host processor. This causes the slave to initiate its read operation. When data is available,
         the slave must gain access to the read-channel. When the master is ready to receive, the slave will
         transmit its response packet. If the response packet contains a transfer notification status, this indi-
         cates that the slave is ready to transmit a stream of data. The master will pulse a ready signal to
         receive subsequent data bursts.
         
         
         2.4. BackplaneArchitecture
         Backplanes are designed to interconnect processors, memory, and peripheral controllers (such as
         network and disk controllers). They are relatively wide, but short distance. The short distances
         make it possible to use fast, centralized arbitration techniques and to perform data transfers at a
         higher clock rate. Backplane protocols make use of addresses and read/write operations, rather
         
         
         Hi~h Perrom~nce Naworl~ unnel-B-~ed Slonge Sc~rnber Z7, 1991

         Clienl
         Workstation
         
         
         
         
         
         File
         Ser~er
         
            ' File'
            Cache
         
            Local
         Magnetic Disk
         
                L~cal Area
                 Network
         
         Server Cache
         
         Server"Remote" Magnetic Disk
         
         Magnetic Tape
         
         
              Figure 3.2: Typical Storage Hierarchy, Circa 1990
         The file ca~he has become substantially larger, and may be par~ially duplicated at the client in addition to
         the server. Secondary storage is split between local and remote disl~. Tape continues ~o provide the third
              level of storage.
         good rule ot thumb tor a unit ot tertiary storage media, such as a tape spool, is that it should have
         as much capacity as the secondary storage devices it is meant to back up. As disk devices continue
         to improve in capacity, tertiary storage media are driven to keep pace.
             In 1980, a typical machine of this class would have one to two megabytes of semiconductor
         memory, of which only a few thousand bytes might be allocated for inputloutput buffers or file
         system caches. The secondary storage level might include a few hundred megabytes of magnetic
         disk. The tape storage level is limited only by the amount of shelf space in the machine room.
         
         
         3.1.2. Evolution of the Storage Hierarchy
         Figure 3.2 shows the storage hierarchy distributed across a workstation/server environment of
         today. Most of the semi-conductor memory in the server can be dedicated to the cache function
         because a server does not host conventional user applications. The file system "meta-data," that is,
         the data structures describing how logical files are mapped onto physical disk blocks, can be held
         in fast semiconductor memory. This represents much of the active portion of the file system. Thus,
         disk latency can be avoided while servicing user re~uests.
             The critical challenge for workstation/server environments is the added latency of network
         communications. These are comparable to those of magnetic disk, and are measured in small tens
         of milliseconds. The figure shows one possible solution, which places small high performance
         disks in the workstation, with larger potentially slower disks at the server.
             If most accesses can be serviced by the local disks, the network latencies can be avoided
         altogether, improving client performance and responsiveness. However, there are several choices
         for how to partition the file system between the clients and the servers. Each of these partitionings
         represents a different tradeoff between system cost, the number of clients per server, and the ease
         of managing the clients' files.
             A swapful client allocates the virtual memory swap space and temporary files to its local
         disk. The operating system's files and user files remain on the server. This reduces some of the
         network traffic to the server, leaving the issues of system management relatively uncomplicated.
         For example, in this configuration, the local disk does not need to be backed up. However, to exe-
         cute an operating system command still requires an access to the remote server.
         
         
         HUh Per~orm~nce ~aworl~ u-d aunr~l-B~cd Slonge Scplanber 27,1991
         
         12

         Declining
         SlMByte
         
         ' File
         Cache
         
         
         Magnetic Disk
         
         
         
         Magnetic Tape
         
         
         Capacity
         
         Increasing
          Access Time
         
         
         
         
         
         Figure 3.1: Typical Storage Hierarchy, Circa 1980
         The microsecond access is provided by ~he Sle ca~he, a small number of by~es stored in semiconduc~or
         memory. Medium capacity, denominated in several hundred megabytes wi~h tens of millisecond access
         is provided by disk. Tape provides unlimiled capacity, bu~ access is r~stric~ed ~o ~ens of se~onds ~o min-
         
         
         u~s.
         move towards the base. ln general, there are order of magnitude differences in capacity, accesstime, and cost among the layers of the hierarchy. For example, main memory is measur~d in
         megabytes, costing approximately $50/MByte, and can be accessed in small numbers of micro-
         seconds. Secondary storage, usually implemented by magnetic disk, is measured in gigabytes,
         costs below $5/MByte, and is accessed in tens of milliseconds. The operating system can create
         the illusion of a large fast memory by judiciously staging data among the levels. However, the
         organization of the storage hierarchy must adapt as magnetic and optical recording methods con-
         tinue to improve and as new storage devices become available.
             Figure 3.1 depicts the storage hierarchy of a typical minicomputer of 1980. (It should be
         noted that la~ge mainframe and supercomputer storage hierarchies were more complex than what
         is depicted here.) A small file cache (or buffer), allocated by the operating system from the
         machine's semiconductor memory, provides the fastest but most expensive access. The job of the
         cache is to hold data likely to be accessed in the near future, because it is near data recently
         accessed (spalial locality) or because it has recently been accessed itself (~emporal locality).
         Prefetching is a strategy that accesses larger chunks of file data than requested by an application,
         in the hope that it will soon access spatially local data.
             Either a buffer or a cache can be used to decouple application accesses in small units from
         the larger units needed to efficiently utilize secondary storage devices. It is not efficient to amor-
         tize the millisecond latency cost to access secondary storage for a small number of bytes.
         Accesses in the range of 512 to 8192 bytes are more appropriate. The primary distinction between
         application memory and a cache is the latter's ability to keep resident certain data. For example,
         frequently accessed file directories can be held in an cache, thus avoiding slow accesses to the
         lower levels of the hierarchy.
             Secondary storage is provided by magnetic disk. Data is recorded on concentric tracks on
         stacked platters, which have been coated with magnetic materials. The same track position across
         the platters is called a cylinder. A mechanical actuator positions the read-write heads to the
         desired recording track, while a motor rotates the platters containing the data under the heads.
             Tertiary storage, provided primarily for archive/back-up, is implemented by magnetic tape.
         A spool of magnetic tape is drawn across the read-write mechanism in a sequential fashion. A
         
         
         Hi8h Per~onn-nce Ndwor~ u~d aunnel-Bercd S~on~e Scplcmber Z7, 1991
                                                                         l l

         mechanical delays, it is no surprise that performance can be dramatically improved if the data to
         be accessed is spread across many disk actuators. Disk arrays provide a method of organizing
         many disk drives to appear logically as a very reliable single drive of high capacity and high per-
         formance [Katz 89].
             Disk array organizations are organized into a multilevel taxonomy. Here, we concentrate on
         the two most prevalent RAID organizations: RAID Level 3 and RAID Level 5. F~rh of these
         spreads data across N data disks and an N+lst redundancy disk. The group of N+1 disks is called
         a stnpe set. In a RAID Level 3 organization, data is interleaved in large blocks (for example, a
         track or cylinder) across all of the disks within a stripe set. The redundancy disk contains a parity
         bit computed bit-wise across the rows of bits on the associated data disks. If a disk should fail, its
         contents can be reconstructed simply by examining tne sur~iving N disks and restoring the sense
         of the parity computed across the bit rows. Suppose that the redundancy disk contained odd parity
         before the failure. If, after a disk failure, the examination of a bit row yields even parity, then the
         failed disk must have had a 1 in that bit row. Similarly, if the row has odd parity, then the missing
         bit must have been a 0. RAID Level 3 organizations are read and written in full stripe units, simul-
         taneously accessing all disks in the stripe. The organization is most suitable for high bandwidth
         applications such as image processing and scientific computing.
             If RAID Level 3 is organized for high bandwidth, then RAID Level S is organized for high
         UO rate. The basic organization is the same: a stripe set of N data disks and one redundancy disk.
         However, data is accessed in smaller units, thus making it possible to support multiple simulta-
         neous accesses. Consider a data write operation to a single disk sector. This requires the parity
         redundancy to be updated as well. We accomplish this by first determining the bit changes to the
         data sector and then invert exactly these bits in the associated parity sector. Thus a logical single
         sector write may involve four physical disk accesses: read old data, read old parity, write new
         data, and write new parity. Since placing all parity sectors on a single disk drive would limit the
         array to a single write operation at a time, the parity sectors are actually interleaved across all
         disks of the stripe. A RAID Level 5 can perform N+1 simultaneous reads and l (N + 1 ) /2~ simul-
         taneous writes (in the best case).
         
         Near-Line Storage
         A comparable revolution has taken place in tertiary storage: the arrival of near-line storage sys-
         tems. These provide relatively rapid access to enormous amounts of data, frequently stored on
         removable, easy to handle optical disk or magnetic tape media. This is accomplished by storing
         the high capacity media on shelves that can be accessed by robotic media "pickers." When a file
         needs to be accessed, special file management software identifies where it can be found within the
         tape or optical disk library. The picker exchanges the currently loaded media with the one contain-
         ing the file to be accessed. This is accomplished within a small number of seconds, without any
         intervention by human operators. By carefully exploiting caching techniques, in particular, using
         the secondary storage devices as a cache for the near-line store, the very large storage capacity of
         a tertiary storage system can appear to have access times comparable to magnetic disks at a frac-
         tion of the cost. We describe the underlying storage technologies next.
         
         Optical Disk Technology for Near-line Storage
         Optically recorded disks have long been thought to be ideal for filling the near line level of the
         storage hierarchy [Ranade 90]. They combine improved storage capacity (2 GBytes per platter
         surface originally to over 6 GBytes per side today) with access times that are approximately a fac-
         tor of ten slower than conventional magnetic disks (several hundred milliseconds). The first gen-
         
         
         
         Hi~h Perfonn~nce Naworl~ u~d Qunnel-B~rcd Sla~ge           Sq~ anber Z7, 1991    14

                  A dataless client adds the operating system's files to the client's local disk. This further
         reduces the client's demand on the server, thus making it possible for a single server and network
         to support more clients. While it is still not necessary to back up the local disk, the system is more
         difficult to administer. For example, system updates must now be distributed to all of the worksta-
         tions.
             A dlskfull client places all but some fre~uently shared files on the client. This yields the low-
         est demands on the server, but represents the biggest problems for system management. Now the
         personal files on the local disk need to be backed up, leading to significant network traffic during
         backup operations.
             An alternative approach leverages the lower cost semiconductor memory to make feasible
         large file caches (approximately 25% of the available memory within the workstation) in the cli-
         ent workstation. These "client" caches provide an effective way to circumvent network latencies,
         if the network protocols allow file writes to be decoupled from communications with the server
         (see the discussion of NFS protocols in the next section). The approach, called diskless clients,
         has been used with great success in the Sprite Network Operating System [Nelson 88], where they
         report an ability to support 5-10 times as many clients per server as more conventional client/
         server organizations.
             Figure 3.3 depicts one possible scenario for the storage hierarchy of 1995. Three major tech-
         nical innovations shape the organization: disk arrays, near-line s~orage subsysterns based on opti-
         cal disk or automated tape libraries, and network distribution. We concentrate on disk arrays and
         near-line storage system technology in the remainder of this subsection. Network distribution is
         covered in Section 4.
         
         Disk Arrays
         Because of the rapidly decreasing formfactor of magnetic disks, it is becoming attractive to
         replace a small number of large disk drives with very many small drives. The resulting secondary
         storage system can have much higher capacity since small format drives traditionally obtain the
         highest areal densities. And since the performance of both large and small disk dnves is limited by
         
         
         
         
         
         On-line Storage
         
         
         
         
         Near-line Storage
         
           'Client
           Cache
         
         Local Area
         Network
         
         Server Cache
         
         Disk Array
         
         ~lde Area
         Network
         
         Optical Disk Jukebox
         
         Ma~netic or Optical Tape Library
         
                        _    ,
         
          Off-line Storage L ~     Shelved Magnetic or Optical Tape
         
         FiguK 3.3: Typical Storage llierarchy, Circa 1995
         Conventional disks have been replaced by disk arrays, a method of ob~aining much higher UO bandwidth
         by stnping data across multiple disks. A new level of s~age, "near line", eme~ges beeween disk and tape.
         1~ provides very high capacity, but at access times measured in seconds.
         

         
         HiJ~h Perronn-nce Nelwor~ ~ Qun~-B~rcd Stonge Scplanber Z7, 1991
         
         13

         eration technology recently introduced doubles the tape capacity and transfer rate.
             However, there has been an enormous increase in tape capacity, driven primarily by helical
         scan recording methods. In a conventional tape recording system, the tape is pulled across sta-
         tionary read/write recording heads. Recorded data tracks run in parallel along the length of the
         tape. On the other hand, helical scan methods slowly move the tape past a rapidly rotating head
         assembly to achieve a very high tape to head speed. The tape is wrapped at an angle around a rotor
         assembly, yielding densely packed recording tracks running diagonally across the tape. The tech-
         nology is based on the same tape transpon mechanisms developed for video cassette recorders in
         the VHS and 8mm tape formats and the newer digital audio tape (DAT) systems.
             Each of these systems provide a very high storage capacity in a small casy-to-handle car-
         tridge. The small formfactor make these tapes panicularly attractive as the basis for automated
         data libraries. Tape systems from Exabyte, based on the 8mm video tape format, can store 2.3
         GBytes and transfer at approximately 250 KBytes per second. A second generation system now
         av~llable doubles both the capacity and the transfer rate. A tape library system based on a 19"
         rack can hold up to four tape readers and over one hundred 8 mm cartridges, thus providing a stor-
         age capacity of 250 - 500 GBs ~Exabyte 901-
         
             DAT tape provides smaller capacity and bandwidth than 8mm, but enjoys certain otheradvantages [Tan 89]. Low cost tape readers in the 3.5" formfactor, the size of a personal computer
         floppy disk drive, are readily available. This makes possible the construction of tape libraries with
         a higher ratio of tape readers to tape media, increasing the aggregate bandwidth to the near-line
         storage system. In addition, the DAT tape formats support subindex fields which can be searched
         at speed two hundred times greater than the normal read/write speed. A given file can be found on
         a DAT tape in an average search time of only 20 seconds, compared to over ten minutes for the
         8mm format.
             VHS-based tape systems can transfer up to 4 MBytes/second and can hold up to 15 GBytes
         per cartridge. Tape robotics in use for the broadcast industry have been adapted to provide a near-
         line storage function.
             Helical scan techniques are not limited to consumer applications, but have also been applied
         for certain instrument recording applications, such as satellite telemetry, which require high
         capacity and high bandwidth. These tape systems are called DD1 and DD2. A single tape car-
         tridge can hold up to 150 GBytes, and can transfer at a rate of up to 40 MBytes/second. However,
         such systems are very expensive, and a good rule of thumb is that the tape recorder will cost
         $100,000 for each 10 MBytes/second of recording bandwidth it can support.
         
         Optical Tape Technology for Near-Line Storage
         A recording technology that appears to be very promising is optical tape [Feder 91]. The record-
         ing medium is called digital paper, a rnaterial constructed from an optically sensitive layer that
         has been coated onto a substrate similar to magnetic tape. The basic recording technique is similar
         to write once optical disk storage: a laser beam writes pits in the dig~ital paper to indicate the pres-
         ence (or absence) of a bit. Since the pits have lower reflectivity than the unwritten tape, a reflected
         laser beam can be used to detect their presence. One 12 inch by 2400 foot reel can hold 1 TB of
         data, can be read or written at the rate of 3 MBytes per second, and can be accessed in a remark-
         able average time of 28 seconds.
             Two companies are developing tape readers for digital paper: C~EO Corporation and Laser-
         Tape Corporation. CREO makes use of a 12 inch tape reels and a unique laser scanner array to
         read and write multiple tracks 32-bits at a time lSpencer 88]. The system is rather expensive, sell-
         in~ for over $200,000. LaserTape places digital paper in a conventional 3480 tape cartridge (50
         
         
         High Perrorm~nce Ne~worl~       nnel-B~red S~o ~ge       Scp~mber 27, 1991  16

         eration of optical disks were written once, but could be read many times, leading to the term
         "WORM" to describe the technology. The disk is written by a laser beam. When it is turned on, it
         records data in the form of pits or bubbles in a writing layer within the disk. The data is read back
         by detectIng the variations of reflectivity of the disk surface.
             The write once nature of optical storage actually makes it better suited for an archival
         medium than near-line storage, since it is impossible to accidently overwrite data oncc it has been
         written. A problem has been its relatively slow transfer rate, 100K - 200K bytes per second.
         Newer generations of optical drives now exceed one megabyte per second transfers.
             Magneto-optical technologies, based on a combination of optical and magnetic recording
         techniques, have recently led to the availability of erasable optical drives. The disk is made of a
         material that becomes more sensitive to magnetic fields at high temperatures. A laser beam is used
         to selectively heat up the disk surface, and once heated, a small magnetic field is used to record on
         the surface. Optical techniques are used for reading the disk, by detecting how the laser beam is
         deflected by different magnetizations of the disk surface. Read transfer rates are comparable to
         that of conventional magnetic disks. Access times are still slower than a magnetic disk due to the
         more massive read/write mechanism holding the laser optics, which takes longer to position than
         the equivalent l~w mass magnetic read/write head assembly. The write transfer rate is worse in
         optical disk systems because (1) the disk surface must first be erased before new data can be
         recorded, and (2) the written data must be reread to verify that it was written correctly to the disk
         surface. Thus, a write operation could require three disk revolutions before it completes. ( [Kryder
         89] details the trends and technology challenges for future optical disk technologies).
             Nevertheless, as the formfactor and price of optical drives continue to decrease, optical disk
         libraries are becoming more pervasive. Sony's recent announcement of a consumer-oriented
         recordable music compact disk could lead to dramatic reductions in the cost of optical disk tech-
         nology. As an example of an inexpensive optical disk system, let's examine the Hewlett-Packard
         Series 6300 Model 20GB/A Optical Disk Library System (Hewlett-Packard 89J. Based on 5.25"
         rewritable optical disk technology, the system provides two optical drives, 32 read/write optical
         disk cartridges (approximately 600 Mbytes per cartridge), and a robotic disk changer that can
         move cartridges to and from the drives, all in a desk side unit the size of a three drawer filing cab-
         inet. The optical cartridges can be exchanged in 7 seconds, and require a 4 second load time and
         2.4 second spin-up time. The unload and spin-down times are 2.8 and 0.8 seconds respectively.
         An average seek time requires 95 ms. The drives can sustain 680 KBytes/second transfers on
         reads and 340 KBytes/second transfers on writes.
             The Kodak Optical Disk System 6800 Automated Disk Library is characteristic of the high
         end [Kodak 90]. The system can be configured with 50 to 150 optical disk platters, and 1 to 3 opti-
         cal disk drives.It is capable of storing from 340 GBytes to 1020 GBytes (3.4 GBytes for each side
         of a 14" platter). The average disk change time is 6.5 seconds. The optical disk surface is orga-
         nized into five bands of varying capacity, with a certain number of tracking windows per band.
         The drives can sustain 1 MByte/second transfers, with 100 ms access times for data anywhere
         within the current band to 700 ms for data anywhere on the surface.
         
         Magnetic Tape Technology for Near-line Storage
         The sequential nature of the access to magnetic tape has traditionally dictated that it be used as the
         medium for archive. However, the success of automated tape libraries from Storage Technology
         Corporation has demonstrated that tape can be used to implement a near-line storage system. The
         most pervasive magnetic tape technology available today is based on the IBM 3480 half-inch tape
         cartridge, storing 200 MBytes and providing transfer rates of 3 MBytes per second. A second gen-
         
         
                                                                                                                                 15H~gh Perfonn~ Naworl~ u~d Chnrel-B-~cd S~g~ Seplanbcr 2~, 1991

         Memory-to-Memory Copy
         DMA over Peripheral Bus
         Xfer over Disk Channel
         Xfer over Serial Interface
         
           Application Address Space        1'
                                 Host Processor
         
            OS B ~ ss (--10 M1~ y t~
         
         HBA Buffers (1 M - 4 MBytes) VO Controller
         
         Track Buffers (32K - 256KBytes) Embedded Controller
         
         I/O Device     Head/Disk Assembly
         
         Figure 3.4:1lO Data Flow
         In response to a read opera~ion, dala moves from the device, to the embedded controller, to the I/O controller,
         to operating system buffers, and finally to the application ~:ross a variety of different interfaces.
         
         3.2. ~tora~e('ontrollerArchitecture
         
         
         3.2.1. I/O Data Flow
         Figure 3.4 shows the various i, -erfaces across which a typical VO request must flow. The actual
         flow of data starts at the I/O device. In the following discussion, we will assume that the device is
         an intelligent magnetic disk for something like a SCSI interface and that we are considering a read
         operation. The mechanical portion of the disk drive is called the headldisk assembly, or HDA. The
         control and interface to the outside world is provided by an embedded controller.
              Data moves across a bit-serial interface from the disk signal processing electronics to track
         buffers associated with the embedded controller. The amount of memory associated with the track
         buffers varies from 32 KBytes to 256 KBytes. Since the typical track on today's small formfactor
         disks is in the range of 32 K - 64 KBytes, a typical embedded controller can buffer more than one
         track.
              The interface between the embedded controller and the host is provided by an VO controller.
         We called such a controller a host bus adapler, or HBA, in Section 2.3.1. It couples the host
         peripheral bus to the disk channel int~ ~ace. Data is staged into buffers within the HBA, from
         which they are copied out via direct m, ~,ry access techniques to the host's memory. The typical
         size of VO controller buffers is in the range of 1 to 4 MBytes.
              The host's memory is coupled to the processor via a high speed cache memory. The connec-
         tion to the VO controllers is through a slower speed peripheral bus. Direct memory access opera-
         tions copy data from the controller's buffers to operating system buffers in main memory. Before
         the data can be used by the application, it may need to be copied once again, to stage it into a por-
         tion of the memory address space that is accessible to the application. Note that the same memory
         and operating system overheads that limit network performance also affect VO performance. This
         is critically important in file and storage servers, where both the VO and network traffic must be
         routed through the memory system bottleneck.
              If the host is actually a file server, the size of the operating system's buffers may be quite
         large, perhaps as large as 128 MBytes. In addition, the data flow must be extended to include
         transfers across the network interconnect into the application's address space on the client. A
         detailed examination of the operating system management of the UO path will be left until Section
         4.
         
         
         
         
         
         Hi~hPerfo~      ce~e~wor~     dClunnel~               edS~orRe   Sq~anberZ7,1991     18

          Technolog~ C~pscity BPI TPI BPI~TPI D~ta Tr~os-er Access Time
                     (MB)                (miDioo)    (KB~tes/s)
         
         Convcntion~l Tape
         Reel-toReel(l/2n)....    .140... ..6250....      ..18 ...    .0.11......    549........    minutes
         Callridge(l/4n) .....   .150 .. .12000....      .104 ...    .1.25......    .92........    minutes
         IBM 3480 (l/2n)......   .200 .. .22860....      .38.1 ..    .0.87......    3000 .......   ~conds
         
         Hclical Scan Tape
         VHS(l/2n)............   15000.. ....?.....      ..? ....    ...?.......    4000........   mmutes
         Vldeo (8mm) .........   4600 .. .43200....      1638 ...    70.56 .....    492 .......    minutes
         DAT(4mm) ............   1300... .61000....      1870 ...    114.07.....    .183..... 20seconds
         
         Optical Tap~
         CREO (35mm) .........   .ITB .. 9336000 ..      ..24 ...    .224 ......    3000 ...  28 seconds
         
         Magnetic Disk
         SeagateElite(5.25")1200.....    .33528....      I880....    63.01......    3000.......... 18ms
         IBM 3390 (10.5n).....   3800 .. .27940....      2235 ...    62.44 .....    4250 ........  20 ms
         Floppy Disk (3.5n) ..   ..2 ... .17434 ...      .135 ...    .2.35 .....    .92 ......     I second
         
         Optical Disk
         CD ROM (3.5n) .......   .540 .. .27600....      15875 ..    438.15 ....    .183 .....     I second
         Sony MO (5.25n~ .....   .640 .. .24130....      18796 ..    453.54 ....    87.5 .......   100 ms
         Kodak (14n) .........   3200 .. .21000....      14111 ..    296.33 ....    1000 .....     100's ms
         
                 TABLE 3.1: Relevant Metrics for Alternative Storage Technologies
         
         GBytes capacity and 3 MBytes per second transfer rate), and replaces a 34~0 tape unit's magnetic
         read/write heads with an inertialess laserbeam scanner. The scanner operates by using a high fre-
         quency radio signal of known frequency to vibrate a crystal, which is then transferred to a laser
         beam to steer it to the desired read/write location. A 3480 tape reader can be "retrofitted" for
         approximately $20,000. Existing tape library robotics for the 3480 cartridge formfactor can be
         adapted to LaserTape without changes.
         
         Sumnury
         Table 3.1 summarizes the relevant metrics of the alternative storage technologies, with a special
         emphasis on helical scan tapes. The metrics displayed are the capacity, bits per inch (BPI), tracks
         per inch (TPI), areal density (BPI*TPI in millions of bits per square inch), data transfer rate
         (Kbytes per second sustained transfers), and average positioning times. The latter is especially
         important for evaluating near-line storage media. An access time measured in a small number of
         seconds begins to make tape technology attractive for near-line storage applications, since the
         robotic access times tend to dominate the time it takes to pick, load, and access data on near-line
         storage media.
         
         
         
         
         
         High Perrorm-nce ~etworl~ ùnd Clunr~el-B-sed Sla~ge             Sq~cmber 27, 1991   17

         4. Software Trends
         
         
         4.1. Network File Systems
         One of the most important software developments over the past decade has been the rapid devel-
         opment of the concept of remote file services. In a location transparent manner, these systems
         provide a client with the ability to access remote files without the need to rcson to special naming
         conventions or special methods for access.
             It is important to distinguish between the related concepts of block server and file server. A
         block server (sometimes called a "network disk") provides the client with a physical device inter-
         face over a network. The block server supports read and write requests to disk blocks, albeit to a
         disk attached to a remote machine. A file server supports a higher level interface, providing the
         complete file abstraction to the client. The interface supports file creation, logical reads and
         writes, file deletion, etc. In a file server, file system related functions are centralized and per-
         formed by the server. In a block server, these functions must be handled by the clients, and if the
         disks are to be shared across machines, this requires distributed coordination among them.
             The most ubiquitous file system model is based on that of UNIX, and so we begin our dis-
         cussion with its structure. A~le is uniquely narned within an hierarchical narne space based on
         directories. As far as the user is concemed, a file is nothing more than an uninterpreted stream of
         bytes. The file system provides operations for positioning within the file for the purposes of read-
         ing and writing bytes. Intemally, the file system keeps track of the mapping between the file's log-
         ical byte strearn and their physical placement within disk blocks through a data structure called an
         inode. The inode is "metadata," that is, data about data, and contains information such as the
         device containing the file, a list of the physical disk blocks containing the file's data, and pointers
         to additional disk blocks (called indirect blocks) should the file be large enough to exceed the
         mapping space of a single inode.
             From the operating system perspective, tracing an VO request from the application to disk
         proceeds as follows. The application program must make a system call, such as read or write, to
         request service from the operating system. This is handled by the UNIX System Call Layer, which
         in turn calls the file system to handle the request in detail. Wlthin the file system are block VO
         routines which handle read or write requests. These call a particular disk driver to schedule the
         actual disk transfers. The software layers are shown in Figure 4.1.
             The figure shows the software architecture for a file system on the same machine as the cli-
         ent application. The major innovation of SUN's Network File System, or NFS, is its ability to map
         remote file systems into the directory structure of the client's machine. That is, it is transparent to
         the user whether the referenced file is available locally or is being accessed over the network. This
         is accomplished through the new abstraction of a virtual f~le system, or VFS [Sandberg 85]. The
         VFS interface allows file system requests to be dispatched to the local file system, or sent to a
         remote server across the network. The generic software layers are shown in Figure 4.2. and the
         path through the software taken between the client and the server is shown in Figure 4.3.
             The access to the remote machine is implemented via a synchronous remote procedure call
         (RPC) mechanism. This is a communications abstraction that behaves much like a conventional
         procedure call, except tha~ the procedure being invoked may be on a remote "server" machine.
         Since the RPC is synchronous, the client must wait or block until the server has completed the call
         and r med the requested data or status.
             . .e NFS protocol is a collection of procedure calls and parameters built on top of such a
         RPC mechanism. One of the key design decisions of NFS is to make this protocol stateless. This
         
         
         Hi~h Perform-nce Ne worl~ uld Chnnel~                     cd S~   e    Scptanber 27, 1991  20

                         Host
         Memory
         
         
         
         Processor
         Cache
         
         
         
         Host
         Processor
         
         Peripheral Bus (VME, FutureBus, etc.)
         
         
         
         
         
         uProc
         
         Peripheral Bus Interface/DMA
         
         
                         Buffer
                        Memory
         
         
         
                         ROM
         
         
         
            VO Channel Interface
                            IIO Controller
         
              Figure 3.5: Internal Organization of an IIO Controller
              The VO controller couples a peripheral bus to the VO channel via a buf~er memory. Hardware in the
              peripheral interface implements direct memory access between this buffer memory and the host's
              memory. The VO channel interface implements the handshaking protocols with the VO devices. The
         microprocessor is the "traffic cop," coordinating the actions of the two interfaces.
         
         3.2.2. lnternal organization of Uo ~ontroller
         Figure 3.5 shows the intemal organization of a typical high performance host bus adapter VO con-
         troller. Interestingly enough, it is not very different in its internal architecture from the network
         controller of Figure 2.2. Usually implemented on a single printed circuit board, the controller con-
         tains a microprocessor, a modest amount of memory dedicated to buffers and run-time data struc-
         tures, a ROM to hold the controller firrnware, a DMA/peripheral bus interface, and an VO channel
         interface.
             The system interface is also similar to the network controller described previously. Request
         blocks containing VO commands and data are organized into as a linked list in the host memory.
         The host writes to a memory-mapped command register within the VO controller to initiate an
         operation. Using DMA techniques, the controller fetches the request blocks into its own memory.
         The on-board microprocessor unpackages the VO commands and write data, and sends these over
         the I/O channel interface. Status and read data are repackaged into response blocks that are copied
         back to reserved buffers in the host memory. The host can choose whether the VO controller will
         interrupt the host whenever an operation has been completed.
             The controller of Figure 3.5 is notable because of its support for direct memory access.
         Some lower performance controllers require that commands and data be written a word (or half
         word) at a time to memory-mapped controller registers over the peripheral bus. Since a typical
         command block can be 16 to 32 bytes in length, simply downloading a command may take tens of
         rnicroseconds, requiring a good deal of host processor intervention.
             In implementing a high performance file service on a network, a critical relationship exists
         between the network and VO controller architectures. The network interface and the VO controller
         must be coupled by a high performance interconnect and memory system. This key observation
         provides the motivation for several of the systems reviewed in Section 5, especially the prototype

         bein~ developed at U. C. Berkeley described in Section 5.6.
         
         
         H~gh Perfonn-nce Naworl~ u~d Clunnel-B~ed Stor Re Scp~anber Z7, 1991
         
         19

          UNIX System Call Layer
         
         
          rtual File System Interface
         
         
             NFS File System
         
         
         RPC/Transmission Protocols
                              ~lient ~ Server
         
         UNIX System Call Layer
         
         
         rtual File System Interface
         .Se~ver Routines
         
         
         RPC/Transmission Protocols
         
         
                                           Network
         Figure 4.3: Pa~h of an NFS Client to Server Request
         A request to access a remote file is handled by the client's VFS, which maps the request through the NFS
         l~ayer into RPC calls ~o the server over the network. Al the server end, the requests are pres~nted to the VFS,
         this time to be mapped into calls on the server's local UNIX file system.
         
         that the real problem is the stateless nature of the NFS protocol, and its associated forced disk
         writes. By making file system operations synchronous with disk, the performance of the file sys-
         tem is (overly) coupled to the performance of the disk system [Rosenblum 91].
         
         
         4.2. File Server Architecture
         In this subsection, we examine the flow of a network-based VO request as it arrives at the network
         interface, through the file server's hardware and software, to the storage devices and back again to
         the network. Our goal is to bring together the discussions of network interface, I/O controller, and
         network file system processing of an VO request, initiated by a client on the network.
             Figure 4.4 shows the hardware/software architecture of a conventional workstation-based
         file server. A data read request arrives at the Ethernet controller. The network messages are copied
         from the network controller to the server's prima~y memory. Control passes through the software
         levels of the network driver and protocol interpretation to process the request. At the file system
         level, to avoid unnecessary disk accesses, the server's primary memory is interrogated to deter-
         rnine if the requested data has already been cached from disk.
         
         
         
         
         
                Ethernet
         
             ....~...........
         NFS
         Request
         
         Kernel NFS Protocol & File Processing
         
         TCP/IP Protocols | Unix File System
         
                Ethernet | Disk Manager
                Driver l' & Driver
         
                       Single Processor File Server
         
         
                             ------------31  Primary

                              Memory
         
         - ------ 1        Backplane Bus
         
                                 Disk
                      Controller
                           
         Figure 4.4: Conventional File Server Architecture
         An NSF VO requesl amves at the Ethernet interface of the server. The request is pa~sed through to the net~vork
         driver, the protocol processing sohware, and the file system. The request may be satisfied by data cached in the
         primary memory, if not, the data must be acces~ed from disk. At this point, the process is reversed ~o send the
         requested data back over the network. This figure is adapted from ~Nelson 90].
         
         H~h Perforrrl~nce ~aworl~ ~nd Qunnel-B-rcd S~a~ge   S~anber 27, 1991
         
         22

                                                    Application Program
                           
                           
                UNIX System Call Layer
                           
                           
                   UNIX File System
                           
                           
                  Block VO Function
                           
                           
                 Block Device Driver
                           
         Figure 4.1: Software Layers in the UNIX File System.
         The UNIX System Call Laya dispatches a read or write request to the File System, which in turn calls a
         Block VO rou~ine. This calls a specific device driver to handJe the scheduling the VO request.
         
         means that each procedure call is completely self~escribing; the server keeps track of no past
         requests. This choice was made to drastically reduce the complexity of recovery. In the event of a
         server crash, the client simply retries its request until it is successfully serviced. As far as the cli-
         ent is concerned, there is no difference between a crashed server and one that is merely slow. The
         server need not perform any recovery processing. Contrast this with a "stateful" protocol in which
         both servers and clients must be able to detect and recover from crashes.
             However, the stateless protocol has significant implications for file system performance. In
         order to be stateless, the server must commit any modified user data and file system metadata to
         stable storage before retuming results. This implies that file wntes cause the affected data blocks,
         inodes, and indirect blocks to be written f~m in-memory caches to disk. In addition, housekeep-
         ing operations such as file creation, file removal, and modifications to file attributes must all be
         performed synchronously to disk.
             Some controversy surrounds the real source of bottlenecks in NFS performance. Network
         protocol overheads and server processing are possible culprits. However, it has now become clear
         
         
                          Application Pr~gram
                           
                           
            UNlX System Call Layer
         
         
         ~lrtual File System Interface
         
                              NFS Client
                           
                           
                Network Protocol Stack
                           
         UNIX File System
         
         
          Block Device Driver
         
         Figure 4.2: Software Layers in the UNIX File System Extended for NFS
         llle VFS interface allows requests to be mapped ~ansparently among local 61e sys ems and remole 61e sys-
         lems.
         
         
         Hi8h Per~orm-nce Networl~ uld Ch nnel~   onge Scpl~rnber 27, 1991
         
         21

         Bitfile
         Xfer
         
                  Clients
         
         
         
         
               Movc
               Cmnd
         
         
         
         
         
         Usc~-Lcvcl
          Name
         
                Name
                Server
         
         Bitfile
         Mover
         
         
         
         
         
         Bitfilc
         ID
         
           Mass Storage System
         
         
         Bitfile
         Mover
         
                Movc Completc
         
             I w, ClicntRcq~ests '
         ~pplication I        ~ Bitfile
         
          ~lient |_Bitfilc Servcr Re~lies Server
         
                 Move
           Cmnd Comp ,
         
         SS Req~,~ ~ ~ ~ l .
              _ S torag
         ,      Server
         
                      Bi~lc
                      Xfcr
         
         
         
                       PV Rcq
         
                      PV Rcplics
         
         SS Rcplics  Physical
                     VolYmc
                      Movc
         
                      Phy. Vol.
             Repository

                           
                           
                           
                           
                           
         Figure 4.5: Elements of ùhe Mass Storage System Reference Model
         The figure shows the inleractions amonp ~he elements of the MSS Reference Model. Cornmand flows are shown
         in light lines while data flows are un heavy lunes. The reference model clearly distinguishes among the software
         functions of name service, mapping of logical files onto physical devices, management of the physical media, and
         the transfer of files between the storage system and clienLs. The figure has been adapted from [Miller 88].
         
         cal storage management system, based on software originally developed at the Lawrence Liver-
         more National Laboratory.
         
         
         5. Case Studies
         
         In this section, we look at a variety of commercial architectures and research prototypes for high
         performance networks, file servers, and storage servers. Wlthin these systems, we will see a com-
         mon concern for providing high bandwidth between network interfaces and I/O device control-
         lers.
         
         
         5.1. Ultranet
         
         
         5.1.1. General Organization
         The UltraNetwork is a hub-based multihop network capable of achieving up to 1 Gbit/second
         transmission rates. Its most frequent application is as a local area network for interconnecting
         workstations, storage servers, and supercomputers.
             Figure 5.1 depicts a typical Ultranet configuration. The hubs provide the basis of the high
         speed interconnect, by providing special hardware and software for routing incoming network
         packets to output connections. Hubs are physically connected by serial links, which consist of two
         unidirectional connections, one for each direction. If optical fiber is chosen for the links, data can
         be transmitted at rates of up to 250 Mbitlsecond and distances to 4 Km. The Gbit transmission rate
         is achieve by interleaving transmissions across four serial links. The point-to-point links are ter-
         minated by link adapters within the hubs, special hardware that routes the transmissions among
         input and output serial links. These are described in more detail below.
         Computer are connected to the network in two different ways: through host adapters and
         resident adapters. A host-based adapter is similar to the network controller described in Fig-
         
         
         
         High Pcrform-ncc Ndwo~ u~d aunncl-B~cd S~ c Scp cmbcr 27, 1991
                                                                         24

                  If the request cannot be satisfied from the file cache, the file system will issue a request to
         the disk controller. The retrieved data is then staged by the disk controller from the VO device to
         the primary memory along the backplane bus. Usually it must be copied (at least) one more time,
         into templates for the response network messages. The software path returns through the file sys-
         tem, protocol processing, and network drivers. The network response messages are transmitted
         from the memory out through the network interface.
             There are two key problems with this architecture. First, there is the long instruction path
         associated with processing a network-based VO request. Second, as we have already seen, the
         rnernory system and the backplane bus form a serious performance bottleneck. Data must flow
         from disk to memory to network, passing through the memory and along the backplane several
         times. In general, the architecture has not been specialized for fast processing between the net-
         work and disk interfaces. We will examine some approaches that address this limitation in Section
         5.
         
         
         4.3. Mass Storage System Reference Model
         Supercomputer users have long had to deal with the problem that high performance machines do
         not come with scalable VO systems. As a result, each of the major supercomputer centers has been
         forced to develop its own mass storage system, a network-based storage organization in which
         files are staged from the back-end storage server, usually from a near-line subsystem, to the front-
         end supercomputer.
             The Mass Storage System (MSS) Reference Model was developed by the managers of these
         supercomputer centers, to promote more interoperability among mass storage systems and influ-
         ence vendors to build such systems to a "standard" ~Miller 88]. The purpose of the reference
         model is to provide a framework within which standard interfaces can be defined. They begin
         with the underlying prernise that the storage system will be distributed over heterogeneous
         machines potentially running different operating systems. The model firmly endorses the client/
         server model of computation.
             The MSS Reference Model defines six elements of the mass storage system: Natne Server,
         Bit~le Client, Bitfile Server, Storage Server, Physical Volutne Repository, and Bitf~le Mover (see
         Figure 4.5). Bit files are the model's terminology for uninterpreted bit data streams. There are dif-
         ferent ways to assign these elements to underlying hardware. For example, the Name Server and
         Bitfile Server may run a single Mass Storage control processor, or they may run on independent
         comrnunicating machines.
             An application's request for UO service begins with a conversation with the Name Server.
         The name service maps a user readable file name into an intemally recognized and unique bitfile
         ID. The client's requests for data are now sent to the Bitfile Server, identifying the desired files
         through their IDs. The Bitfile Server maps these into requests to the Storage Server, handling the
         logical aspects of file storage and retrieval, such as directories and descriptor tables. The Storage
         Server handles the physical aspects of file storage, and manages the physical data volumes. It may
         request the Physical Data Repository to mount volumes if they are currently off-line. Storage
         servers may be specialized for the kinds of volumes they need to manage. For example, one stor-
         age server may be specialized for tape handling while another manages disk. The Bitfile Mover is
         responsible for moving data between the Storage Server and the client, usually over a network. It
         provides the components and protocols for high-speed data transfer.
             The MSS Reference Model has been incorporated into at least one commercial product: the
         Unitree File Management System sold by General Atomics, Inc. This is a UNIX-based hierarchi-
         
         
         
         Hi8h Per~onn~nce Naworl~ ge   Seplanber Zl, 1991

         Hub-Based
         HiPPI
         Adapter
         
         HiPPI Source ~        | HiPPI Destination
         
         
         
         
         Link
         Mux
         Link
         Mux
         Link
         Mux
         Link
         Mux
         
         Link Mux Personality Module
         Protocol Processor
         
         UltraBus Personality Module
         
           Serial Links
         to Other Hubs and
         Host Adapter
         
         Serial strearn in
         Parallel strearn out
         
         
         
         
         
             Link Adapters ~        UltraBus
         
         Figure 5.2: Internal Organization of the UltraNet Hub
         The serial lirlks, one for each direction, connect to the link multiple~ers. Each link mu~ can handle up ~o four
         serial link pairs. The link adapter interfaces the link multiple~ers to the wide, fast UltraNet bus. The Link adapter
         has enough intelligence to do its own routing of network traffic among the lin~ multiple~ers it manages. With a
         well configured hub, little traffic should need access to the UltraBus.
         
         TE rapidly moves data through the protocol processor.
             Figure 5.3 depicts the protocol architecture supported by the UltraNet. The combination of
         the UltraNe~ firmware and software implements the industry standard TCP/IP protocols on top of
         the UltraNet, as well as UltraNet specific protocols. The lower levels of the network protocol,
         namely the transport, network, data link, and physical link, are implemented with the assistance of
         the UltraNet protocol processor and host or hub-resident adapter hardware.
         
         
         User Level      I RCP
         
         
         Kernel Level
         
         
         
         
         
         Host Resident
         P~tocols
         
         Drivers
         
         Network
         

              FTP
         
                      User Application
         
         
         Sockets Compatibility Library
         
         
            | NFSI | Socketsl
         
         
              UDP I TCP
         
             Data Link Driver
             thernet I UltraNet
         
         Ethernet Controller
         
         UltraNet Driver
         
         
         Transport Layer
         Network Laver
          lata Link Layer
         Hardware Assisted
         Protocol Engine
         
         
         Physicai Link (UltraBus or Serial Link)
         
         Figure 5.3: UltraNet Pro~ocol Architecture
         Access to the UltraNet is through Lie standard UNIX Socket interface. It is possible to use standard TCP/IP proto-
         cols on lop of the UltraNel or UltraNet-specific Fotocols. The lower levels of the networl; are implemented with the
         assistance of the protocol engines and adapters throughout the UltraNet system.
         
         
         Hi8h Perform-nce Ndwor~ ~nd Chnnel-B~cd S~ge              Scplanber 27, 1991 26

         Workstatjon  Hub
         
            Hub
         
         
         
         
         
         Link Adapter \ Hub
         
         
                              iPPI
         
         Host-based        \           Hub-based Adapter
         Adapter
         
         Supercomputer
         
         
         
         
         
          Figure 5.1: UltraNet Configuration
         The networ~c intercormection.topology is formed by hubs connected by optical serial links. The ma~imum link
         speed is 250 Mb/s; higher transmission bandwidth is obtained by interleaving across multiple lirlks. Host-based
         adapters plug into computer backplanes while ~dapters for channels such as HiPPI reside within the hub.
         
         ure 2.2, and resides within the host computer's backplane. This kind of interface is appropriate for
         machines with industry standard backplanes, such as workstations and mini-supercomputers. In
         these kinds of clients, processors and VO controllers, including the network interface, are treated
         as equals with respect to memory access. The adapter contains an on-board microprocessor and
         can perform its own direct memory accesses, just like any other peripheral controller.
             A different approach is needed for mainframe and supercomputers, since these classes of
         machines connect to peripherals through special channel interfaces rather than standard back-
         planes. UO devices are not peers, but are treated as slaves by the processor. The hub-resident
         adapters place the network interface to the Ultranet within the hub itself. These provide a standard
         channel interface to the computer, such as HiPPI or the IBM Block Multiplexer interface.
         
         
         5.1.2. UltraNet Hub Organization
         The heart of the Ultranet hub is a 64-bit wide (plus 8 parity bits), high bandwidth backplane called
         the UltraBus. Its maximum bandwidth is 125 MBytes/second. The serial links from other hubs
         and host-based adapters are interfaced to the UltraBus through link multiplexers, which in turn,
         are controlled by the link adapters. The link adapters route the serial data to the parallel interface
         of the UltraBus. Physically, it is a bus, but logically, the interconnect is treated more like a local
         area network. Packets are written to the bus by the source link adapter and are intercepted by the
         destination link adapter. If the output link is controlled by the same link adapter as the input link,
         the transfer can be accomplished without access to the UltraBus. Figure 5.2 illustrates the intemal
         organization of the hub.
             The link adapter contains a protocol processor and two modules that interface to the link
         multiplexers on the one hand and the UltraBus on the other. The protocol processor is responsible
         for handling the network traffic. The datapath that couples the personality modules on either side
         of the protocol processor consists of two unidirectional 64-bit wide busses with speed matching
           Os at the interface boundaries. The busses operate independently and achieve peak transfers of
         100 MByte/second.
             The protocol processor consists of three components: the Data Acknowledgrnent and Com-
         mand Block Processor (DACP), the Control Processor (CP), and the Tran~fer Engine ~TE). The
         DACP performs fast processing of protocol headers and request blocks. The CP is responsible for
         managing the network, such as setting up and deleting connections between network nodes. The
         
         
         H4~h Perforrn~nce Nelwor~ u-d aun~l-B~red Sla~ge    Seplanber 27, 1991

         transmissions to storage controllers are handled via messages. The hardware in the CI ports provide
         special support for block transfers: an ability to copy sequential large blocks of data from the vir-
         tual address space of a process on one processor to the virtual address space of another pr~cess on
         another CI node. Block transfers are cxploited to move data back and forth between client nodes
         and the storage controllers.
             An interesting aspect of the VAXCluster architecture is its support for a Mass Storage Control
         Protocol (MSCP), through which clients request storage services from storage controllers attached
         to the CI. A message-based approach has several advantages in the distributed environment em-
         bodied by the cluster concept. First, data sharing is simplified, since storage controllers can extract
         requests from message queues and service them in any order they choose. Second, the protocols
         enforce a high degree of device independence, thus making it easier to incorporate new devices into
         the storage system without a substantial rewrite of existing software. Finally, the decoupling of a
         request from its servicing allows the storage controllers to apply sophisticated methods for opti-
         mizing VO performance, including rearranging requests, breaking large requests into fragments
         t -at can be processed independently, and so on.
         
         
         5.2.2. ~ISC-70 Interna1 Organization
         The internal organization of an HSC is shown in Figure 5.5. The HSC was originally designed in
         the late 1970's, and has been in service for a decade. Its internal architecture was detennined by
         the technology limits of that time. Nevertheless, there are a number of notable aspects about its
         organization. An HSC is actually a heterogeneous multiprocessor, with individual processors ded-
         icated to specific functions. The three major subsystems are: (1) the host interface, (2) the VO
         control processor, and (3) the VO device controllers. They communicate via shared control and
         data memories accessed via a control and data bus respectively.
             The Host Interface, called a K.CI, is responsible for managing the transfer of messages over
         the CI Bus. The hardware is based on an AMD bit-slice processor. The device controllers, called
         K.SDIs for disk interfaces and K.STIs for tape interfaces, use the same bit-slice processor. They
         implement device-specific read and write operations, as well as format, status, and seek operations
         
         
         Contrd
         Memorg
         
         
         
         
         CIBus
          
          
          
          
          
         D~t~
         Memor~
         
         Host
         Interf~ce
         IC.CI
         
         Contrd Bus (6.6 MB/second)
         
         
         
         
         
         1/0 Control
         Processor
         
         P.10
         
         Up to 8 per HSC
         
         
         
         

         De~ice
         Controllers
         Ic cn~ nr K STI
             
             
         Up to 4 per
         Cont~ller
         
                      Memor~ Bus (13.3 MB/second)
                           
         Figui-e 5.5: ~SC Internal Architecture
         The Host Inlerface is managed by a dedicated bil slice processor called the K.CI. Devices are
         attached to K.SDI (disk) and K.STI (tape) device controllers. High level control is performed by the
         P.IO, a PDP-II microprocessor programmed to coordinate the activities of the device controllers and
         the host interface.
         
         
         Huh Pcr~oml~ncc l~'awor!~ ~nd aunncl-B-rcd S~ongc           Scplanbcr 27, 1991    28

         5.2. Digital Equipment Corporation's VAXCluster and HSC-70
         
         
         5.2.1. VaxClusterConcept
         Digital Equipment Corporation's VAXCluster concept represents one approach for providing net-
         worked storage service to client computers (Kronenberg 86, Kronenberg 87]. The VAXCluster is
         a collection of hardware and software services that closely couple together VAX computers and
         Hierarchical Storage Contr~llers (HSCs). A VAXCluster lies somewhere between a "long dis-
         tance" peripheral bus and a communications network: a high speed physical link couples together
         the processors, but message-oriented protocols are used to request and receive services. The
         VAXCluster concept is characterized by (1) a complete communications architecture, (2) a mes-
         sage-oriented computer interconnect, (3) hardware support for the connection to the interconnect,
         and (4) message-oriented storage controllers.
             The hardware organization of a VAXCluster is shown in Figure 5.4. Its elements include VAX
         processors, HSC storage controllers, and the Computer Interconnect (CI). The latter is a high speed
         interconnect (dual path connections,70 Mbits/second each), similar in operation to an Ethernet, al-
         though the detailed methods for media access are somewhat different. Physically, the CI is orga-
         nized as a star l.~twork, but appears to processors as though it were a simple broadcast bus like the
         Ethernet. Up to sixteen nodes can be interconnected by a single star coupler, with each link being
         no more than 45 meters in length. A processor is connected to the CI via a CI port, a collection of
         hardware and software that provides the physical connection to the CI on one side and a high level
         queue-based interface to client software on the other side.
             The communications protocols layered onto the CI and CI ports support three methods of
         transmission: datagrams, messages, and blocks. Datagrams are short transmissions meant to be
         used for status and information requests, and are not guaranteed to be delivered. Messages are sim-
         ilar to datagrams except that delivery is guaranteed. Read/write requests and other device control
         
         
         
         
         
                         VAX
                           
                        ~lPnrt
                                           
             VAX
         
             Cl Port
         
         
         
         CI ~ Stsr
             Coupler
         
          ~1 / ~ ~ rT
         
         Cl Port
           
         HSC
         
         Cl Port
           
         H.'i~
         
               VAX
         
               Cl Port
         
         
         
         
         
         Sh~lred Dislcs
         
         ~   lDkk
         

         
         
         
         
         Figure 5.4: VAXCluster Block Diagram
         A VAXC1uster consists of client Processors (VAX), server storage controllers (HSC), a high speed inter~on-
         nect (CI), adapters (CI port), ar~ coupling hardware (Star Coupler). A message-oriented protocol is layered
         onto the interconnecl hardwar~ to irnplement clien~/server access to storage se~ices.
         
         
         HighPcrfonn~ceNdwor~u~ Chunel-B-~ SDnge    Sqxomber27,1991
         
         27

         into physical disk operations, such as the disk seek command and a sequence of sector transfer re-
         quests. The onginal MSCP caTlmand message is modified to become a response message. The last
         phase is to place a pointer to the disk request data structure on a K.SDI work queue, to be found in
         the control memory.
             The next thing that happens is the disk portion of the transfer. The K.SDI f~rrnware reads the
         request on its work queue, extracts the seek comrnand, and issues it to the appropriate dnve. When
         the drive is ready to transfer, it indicates its status to the K.SDI. At this point the disk controller
         allocates buffers in the data memory, and stages the data as it comes in from disk to these buffers.
         When the list of sector transfers is complete, a completion message is placed on the work queue
         for the K.CI.
             We are now ready to transfer the data from the controller's data memory to the host. The K.CI
         software wakes up when new work appears in its queue. It then generates the necessary CI message
         packets to transfer data from dat~. memory out over the CI to the originally requesting host proces-
         sor As a data buffer is emptied, ~-~ lS returned to a list of free buffers maintained in the control mem-
         ory. When the last buffer of the transmission has been sent, the K.CI now transmits the MSCP
         completion message that was built by th~ IIO policy processor from the original request message.
             The steps outlined above have as. d that processing continues without error. There are a
         number of error recovery routines that may be invoked at various points in the process described
         above. For example, if the transfer request within a K.SDI fails, the software is structured to route
         the request to error handling software to make the decision whether to retry or abort the re~uest.
         
         53. Seagate ARRAYMASTER (9OS8 Controller)
         
         
         S.3.1. General Organization
         The Seagate ARRAYMASTER is an example of an VO controller design targeted for high band-
         width environments. To this end, it supports multiple (4) IPI-2 interfaces to disks, which can burst
         at 10 MByteslsecond each, and multiple ~) IPI-3 interfaces to the host, each of which can burst
         transfer at 25 MBytes/second. (The IPI lnterface is similar in concept to the SCSI protocols
         described in Section 2.3, but provide higher performance though they are more expensive to
         implement.) A single controller can handle up to 32 disk drives, organized into eight stripe units
         (called drive clusters by Seagate) of four disks each.
             The controller supports three altemative disk organizations: a high transfer rate/high avail-
         ability mode with duplexed controllers, a simplex version of this organization, and a high transac-
         tion rate/high capacity organization. These are summarized in Figure 5.6. The first organization,
         for high transfer rate and very high system availability is distinguished by duplexed controllers,
         dual ported and spindle synchronized disk drives, and a 3 + 1 RAID Level 3 parity scheme.
             Enhanced system availability is achieved by the duplexed controllers and dual ported drives:
         if a single controller fails, then a path still exists from the host to a device through a functional
         con~ller. Seagate claims 99.999% availability with a Mean rlme to Data Loss (MlTDL) that
         cxceeds 1 million hours with this configuration. This is probably a conservative estimate. A lost
         drive can be reconstructed within four minutes, assuming 1 GByte Seagate Sabre disk drives. The
         organization can sustain 36 MByte/seconds, assuming a sustained transfer bandwidth of 6 MByte/
         second per drive.
             The second organization is characterized by single ported drives, organized into the RAID
         Level 3 scheme, and a single controller. System availability is not as good as the previous organi-
         zation: a failure within the controller renders the disk subsystem unavailable. However, media
         
         
         HUh Per~                         nce ~d WO~ ùrld Ch nnel -B-sed Sla~ge   Scp~mber 2 7, 1991  30

         (for disk) over Digital's proprietary device interfaces. Up to four devices can be controlled by a
         single K.X device controller, and up to eight K.X controllers can be attached to an HSC, for a total
         of twenty-four devices. All policy decisions are handled by the VO control processor, which is
         based on a microprocessor implementation of the PDP- 11 lLary 89].
             The shared memory subsystem of the HSC plays a critical role in its ability to sustain VO traf-
         fic. Private memories deliver instructions and data to the various processors, l~eeping them off the
         shared memory busses. Data structures used for interprocessor communications arc located in the
         control memory. The control memory and bus support interlocked operation, making it possible to
         implement an atomic two-cycle read-modify-write. Data moving between VO devices and the
         Computer Interconnect must be staged through the data memory. The sizes of both memories are
         rather modest by today's standards: 256K bytes in cach.
             The performance bottIenecks within the HSC corne from two primary sources: bus conten-
         tion andprocessor contention [Bates 89]. We exarnine bus contention first. Internal bus contention
         affects the maximum data-rate that the controller can support. The controller's transfer bandwidth
         (MBytes/second) is limited by its memory architecture and the implementation of the CI interface,
         both on the controller and on a processor with which it communicates. Because data must traverse
         the memory bus twice, the effective intemal bandwidth to VO devices is limited to 6.6 MBytes per
         second. For example, on a device read, data must be staged from the device controller to the data
         memory over the memory bus, and then transferred once again over the bus to the CI interface. The
         HSC's software includes mechanisms for accounting for the amount of intemal bandwidth that has
         been allocated to outstanding VO requests. It will throttle UO activity by delaying some requests if
         it detects saturation.
             While this may appear to be a limitation, a more serious restriction is imposed by the HSC's
         CI interface itself. In general, it is designed to sustain on the order of 2 MBytes/second. For some
         low-end members of the VAX family, even this may exceed the bandwidth of the host's CI inter-
         face. To avoid overrunning a host, and thus limit CI bandwidth wasted on retransmissions, the CI
         interface will only transmit a single buffer to a given client as long as there are buffers waiting to
         be sent to it.
             Next we turn to processor contention. This is due to some extent to the design of the SDI disk
         interfaces. Each disk has a dedicated control bus, but a single data bus is shared among the devices
         attached to a controller. Thus, high data bandwidth can be sustained by spreading disks among as
         many controllers as possible. For example, two disks on a single K.SDI will transfer fewer bytes
         per second than a configuration with one disk on each of two disk controllers.
         
         
         5.23. Typical I/O Operation Sequencing
         To understand the flow of data and processing through the HSC, we shall examine the processing
         steps of a typical disk read operation. The steps that we outline next are described at a high-level.
         Considerably more detail, including the detailed data structures used, can be found in [Lary 89].
             The first step is the arrival of the MSCP command over the CI bus. The message is placed in
         a K.CI reception buffer, where it is checked for well-formedness and validity. If it passes these
         checks, it is copied to a special data structure in the control memory, and pointers to this data struc-
         ture are placed on a queue of work for the VO policy processor.
             The next step involves the execution of MSCP server software on the policy processor. The
         software is structured as a process that wakes up whenever there are pending requests in the work
         queue. The software examines the queue of commands, choosing the next one to execute based on
         the currently executing commands. It constructs a data structure that maps the MSCP command
         
         
         HUh Performu~ce Nelworl~    unnel-B~sed S~             e     Sq~mber 27, 1991    29

         ,~              ~1~ To Drive Clu~ ) ~             , .
         ~1 ~ ~ q 1 ~ I ~1 ~ .
            lløPI 'P' ~ . ~ I I I IPI ~1 1~øPI IPI ~
               ~1~}~ ~1,
         
         .  --I Puity C~ la~on D-IJ Pa~
         
                                       ~Gl
                   u~d lOPs ~ V ù~ C ~y
           _
         
                ~, ~ ~ ~ ,
               ~øP| IPI-31/F|      ~IOP| IPI-3 UFI      11øP| IPI-3 U~      |IOP| IPI-3 UFI
           E~ ~ I ~ I E~
                   ~ ,             ~ ~   To Host(s)    ~ I~             ~ ~
         
         Figure 5.7: Internal Organization of ArrayMaster 9058 Array Controller
         The ArrayMaster's intemal structure consists of the disk inierfaces, host interfaces, parity calculation logic, and a
         '~affic cop" microprocessor to delerrnine the VO strategy. VO processors associated with each of the inlerfaces
         handle the low level details of the interface protocols. Data movement is controlled by direct memory access
         engines associaled with the disk interfaces.
         
         5.3.2. (~ontroller Internal organization
         Figure 5.7 shows the intemal organization of the ArrayMaster controller. An VO request can be
         traced as follows. The host issues the appropriate command to one of the IPI-3 interfaces. This is
         staged to a command buffer within the controller. The central control microprocessor examines
         the command and determines how to implement it in detail.
             Suppose that the command is a data wnte and that the array is organized into a RAID Level
         3 scheme. The control processor maps this logical write request into a stream of physical wIites to
         the disks within a drive cluster. As the data streams across the host interface, it passes through the
         parity calculation datapath, where the horizontal parity is computed. DMA controllers move data
         and parity from this datapath to buffers associated with individual disk interfaces. UO processors
         local to the IPI-2 disk interfaces manage the details of staging data from the buffers to particular
         disk drives. Read operations are performed in much the same manner, but in reverse.
             Note that reconstruction operations can be perfommed without host intervention. Assume
         that the failed disk has been replaced by a new one. Under the control of the central microproces-
         sor, data is read from the surviving members of the drive cluster. The data is streamed through the
         parity calculation datapath, with the result being directed to the disk interface associated with the
         failed disk. The reconstituted data is then wntten to its replacement.
         
         
         5.4. Maximum Strategies HiPPI-2 Array Controller
         Maximum Strategies offers a family of storage products oriented towards scientific visualization
         and data storage applications for high performance computing environments. The products offer a
         tradeoff between performance and capacity, spanning from high MBytes/seconds but low MBytes
         (based on parallel transfer disks) to high performance~high capacity (based on arrays of disk
         arrays). In the following discussion, we concentrate on their HiPPI-based storage server.
         Figure 5.8 shows the basic configuration of the Strategy-2 Array Controller. It supports one
         
                                                                                                                                  32
         High P~er~orm~nce Networ~ ~nd Clunnel~ cd Slor~ge Seplember 27,1991

         High Bandwidth/High Availability Duplexed Controllers
         
         
         
         
         
         Pon~ D~ves
         
             ~ ~3--- ~3
         ,, ~. F~
         
         
         
         
         I     ~        ~
         
              High Bandwidth/High Availability          High UO Rate/High Capacity
                   Simplex Controller                  Simplex Controller
         
         Figure 5.6: Alternalive Disk Organizations for the ArrayMaster 9058
         The ArrayMaster can be configured in three alternative organizations: duple~ed controller parity array for
         e~tremely high system availability, simple~ parity array for high media availability, and a non-redundant organiza-
         tion for ma~imum UO ra~e and capacity.
         availability is just as good because of the parity encoding scheme. This organization can susta~n
         18 MBytes/second and also claims a 1 million hour Ml-rDL.
             The last organization represents a tradeoff between performance, availability, and capacity.
         It gains capacity by dispensing with the parity drives, supporting a maximum of 32 GBytes versus
         24 GBytes in the other two organizations, assuming 1 GByte drives. However, there is also no
         protection against data loss in the case of a disk crash. Data is no longer interleaved, thus sacrific-
         ing data bandwidth for a higher VO rate. In the previous organizations, up to 8 VOs can be in
         progress at the same time, one for each dnve cluster. In this organization, 32 VOs can simulta-
         neously be in progress. The controller supports 500 random VOs per second, approximately 16 V
         Os per second per disk drive (this represents a disk utilization of 50%).
             The controller designer's have placed considerable emphasis on providing support for very
         high data integrity within the controller and disk system. All intemal data paths are protected by
         parity, data is written to disk with an enhanced ECC coding scheme (a 96 bit Reed-Solomon code
         that can correct up to 17 bit errors and even some 32 bit errors), and a large number of retries are
         attempted in the eventof an VO failure (three attempts at normal offset, all with ECC; three
         attempts at late and early data strobes with nominal camage offset, all with ECC; three attempts at
         +/- carriage offset with nominal data strobes, all with ECC).
         
         
         
         HiBh Perforrn-nce Nawork uld Qunnel-B ùod St~ge S~anber 27,1991
         
         31

         HiBh B~V B~ (2S0 ~DB/s)
         VME Con~olBus
         
                                                 SP
         Figure 5.9: High Capacity Strategy Array
         High capacity is achieved by using large numbers of commodity disk drives. These are coupled to the HiPPI
         frontends ~hrough a high bandwid~h dala bus and a VME-based control bus.
         array. Up to 10 subarrays can be controlled by a single HiPPI controller, yielding a system config-
         ured from a total of 370 disk drives, a 345 GByte data capacity, and a 144 MByte/second transfer
         rate.
         
         
         5.5. AUSPEX NS5000 File Server
         
         
         5.5.1. General Overview
         AUSPEX has developed a special hardware and software architecture specifically for providing
         very high performance NFS file service. The system provides a file system function integrated
         with an ability to bridge multiple local area networks. They claim to have achieved a performance
         level of 1000 NFS 8 KByte read VO operations per second, compared with approximately 100 -
         400 VO operations per second for more conventional server architectures [Nelson 90].
             They call their approach functional multiprocessing. Rather than building a server around a
         single processor that must simultaneously run the UNIX operating system and manage the net-
         work and disk interfaces, their architecture incorporates dedicated processors to separately man-
         age these functions. By running specialized software within the network, file, and storage
         processors, much of the normal overhead associated with the operating system can be eliminated.
             A functional block diagram of the NS5000 appears in Figure 5.10. The system backbone is
         an enhanced VME bus that has been tweaked to achieve a high aggregate bandwidth (55 MBytes/
         second). A conventional UNIX host processor (a SUN-3 or SUN Sparcstation board), the various
         special purpose processors, and up to 96 MBytes of semiconductor memory (the primary mem-
         ory) can be installed into the backplane. We examine each of the special pr~cessors in the next
         .c~lh~ection.
         
         
         5.5.2. Dedicated Processors
         A dedicated network processor board contains the hardware and software needed to manage two
         independent Ethemet interfaces. Up to four of these can be incorporated into the server to inte-
         
         
         
         HUh Perfonn~nce Naworl~ u~d Chnnel-B-~ed Slonge            Seplanber 27, 1991     34

                    To Hold
         
         HiPPI I                  | Hi
         
         
              Strategy HiPPI Controller
         
         
         
         D-t~ Dlsks  P-rlt~ Hot Sp-~
         
          Figure 5.8: S~rategy HiPPI Conlroller Block Diagram
         The Slrategy conholler couples mulhple HiPPI interfaces to an 8 + I + 1 RAID Level 3 disk organiza~ion.
         
         or two lW MByte/second host HiPPl interfaces or a single 2()U MByte/second inter~ace.The con-
         troller supports a RAID Level 3 organization calculated over eight data disks and one parity disk.
         Optionally, hot spares can be configured into the array. This allows reconstruction to take place
         immediately, without needing to wait for a replacement disk. It also helps the system achieve an
         even higher level of availability. Since reconstruction is fast, the system becomes unavailable
         only when two disks have crashed within a short period.
             The controller can be configured in a number of different ways, representing alternative
         tradeoffs between performance and capacity. The low capacity/high performance configuration
         stripes its data across four parallel transfer disks. This yields 3.2 GBytes of capacity and can reach
         a 60 MByte/second data transfer rate. It provides no special support for high availability, such as
         RAID parity.
             A second organization stripes across 8 + 1 parallel transfer disks implementing a RAID
         Level 3 organization. This organization provides 6.4 GBytes and achieves a 120 MByte/second
         transfer rate. Both configurations are called the Strategy HiPPI-SM Storage Server.
             These organizations provide relatively little capacity for the level of performance provided.
         In addition, parallel transfer disks are quite expensive per MByte and have a poor reputation for
         reliability. An alternative configuration uses multiple ranks of 8 + 1 + 1 commodity disk drives.
         Maximum Strategies' HiPPI-S2 Storage Server is shown in Figure 5.9. Backend controllers
         (called S2s in Maximum Strategies' terminology) manage stnngs of eight disks each. Maximum
         Strategies makes use of older technology ESDI drives (5.25" formfactor, 1.2 GByte capacity
         each), which can share a common control path but require dedicated datapaths. A maximum con-
         figuration can support ten of these: eight data strings, one parity string, and an optional hot spare
         stnng. The backends are connected to the frontend HiPPI interfaces through a 250 MByte/second
         data backplane and a conventional VME backplane used for control. Note the separation of con-
         trol and datapath. The high bandwidth data transfer path is over HiPPI; the control path uses a
         lower latency (and lower bandwidth) VME interconnect. Parity calculations are handled in the
         frontend. This organization can provide a 300 GByte capacity and a 144 MByte/second transfer
         rate.
             Maximum Strategies also provides a storage server based on a VME-based host interface.
         The S2R Storage Server supports up to 40 x 5.25" ESDI drives, organized into an 8 + 1 + 1 RAID
         Level 3 scheme that is four stripe units deep. This organization yields 38.4 GBytes of capacity and
         an 18 MByte/second transfer rate.
             The highest capacity/highest performance system combines the S2R-based arrays with the
         HiPPI-attached controller of Figure 5.8. The result is an "array of disk arrays." The architecture
         calls for replacing the S2 controllers with S2R disk controllers. Each S2R array contains 37 disks,
         organized into four stripe units of 8 data disks and 1 parity disk, plus one spare for the entire su~
         
         High Pcrfonn~ncc Nawor1c uld aunncl~                     cd Sla gc    Scplanbcr 27, 1991  33

         Unix System Call Layer
         
         VFS Interface
         
         ..............                 -----------~ NFS Client |    LFS Client
         
         
         
         Ethernet Processor
         
                LFS Client
                NSF Server
         Protc~cols
         Network VF
         
 File Processor
         l FS Server
            File Systern Server
         
         
           Primary Memory
         
               Host Processor
         
         
         Storage Processor
         
         
         
         
             Disk Arrays
         
                  Ethernel
         
         Figure 5.11: Auspex NS5000 Software Architectwe
         The main data flow is represen ed by the heavy black line, wi~h data beinB transmi~ted from the disks to the pri-
         mary memory to the network in~erface. The primary control flow is shown by a heavy Bray line. File system
         requests are passed between LFS (L~ File System) client software on the Ethernet processor to server sohware
         on the File Processor. These are mapped onto detailed requests to the Storage Processor by the File System Server.
         Limited control inL~actions involve the Virtual File System interfaces on the Host, and are denoted by dashed
         lines.
         
         client. Note the minimal intervention from the host processor and so~tware.
         
         
         5.6. Berkeley RAID-II Disk Array File Server
         Our research group at the University of California, Berkeley is implementing a high performance
         UO controller architecture that connects a disk array to an UltraNet network via a HiPPI channel.
         We call it RAID-II to distinguish it from our first prototype, RAID-I, which was constructed from
         off-the-shelf controllers [Chervenak 90]. Given the observations about the critical performance
         bottlenecks in file server architectures thr~ughout this paper, our controller has been splecifically
         designed to provide considerable bandwidth between the network, disk, and memory interfaces.
             A block diagram for the controller is shown in Figure 5.12. The controller makes use of a
         two board set from Thinking Machines Corporation (TMC) to provide the HiPPI channel inter-
         face to the UltraNet interfaces. The disk interfaces are provided by a VME-based multiple SCSI
         string board from Array Technologies Corp. (ATC).The major new element of the controller,
         designed by our group, is the X-Bus board, a crossbar that connects the HiPPI boards, multiple
         VME busses, and an interleaved, multiported semiconductor memory. The X-bus board provides
         the high bandwidth datapath between the network and the disks. The datapath is controlled by an
         extemal file server through a memory-mapped control register interface.
             The X-bus board is organized as follows. The board implements an 8 by 8 32-bit wide cross-
         bar bus. All crossbar transfers involve the on-board memory as either the source of the destination
         of the transfer. The ports are designed to burst transfer at 50 MByte/second, and sustain transfers
         of 40 MByte/second. The crossbar is designed to provide an aggregate bandwidth of 320 MByte/
         second.
             The controller memory is allocated eight of the crossbar ports. Data is interleaved across the

         eight banks in 32 word interleave units. Although the crossbar is designed to move large blocks
         from memory to or from the network and disk interfaces, it is still possible to access a single word
         when necessary. For example, the external file server can access the on-board memory through
         
         
         High Perfonn~nce Nelworl~ u~d aYnnel-8-red Su~nge           Se~cmber Z7, 1991     36

         Ethernet
         Processor
         
          Host
         Processor
         
         
         
         
         
         File
         Pr~ces~or
         
         Independent File
         System
         
         Host
         Memory
         
                  Single Board
                   Computer
         
                  Enhanced
               VME Backplane
         
         Storage
         Processor
         
             ~ I Parallel
           ... ~ SCSI Channels
         
         Figure 5.10: NS5000 Block D~agram
         The server incorporates four different kinds o~ processors, dedicated to networ~, file, storage, and general pur-
         pose processing. The server can integrate up to eighl independenl Ethernets through the incorporation of mul-
         tiple network processors. The storage processor supports ten SCSl channels, making it possible to attach up to
         twenty dlsks to the server.
         
         grate a reasonably large number of independent networks. The board executes all of the necessary
         protocol processing to implement the NFS standard. Because the network boards implement their
         own packet routing functions, it is possible to pass packets from one network to another without
         intervention by the host. Some cached network packet headers are buffered in the primary mem-
         ory.
             The.file processor board runs dedicated file system software factored out of the standard
         UNIX operating system. The board incorporates a large cache memory, partitioned between user
         data and file system meta-data, such as directories and inodes. This makes it possible for the file
         system code to access critical file system information without going to disks.
             The s~orage processor manages ten SCSI channels. Disks are organized into four racks of
         five 5.25" disks each (20 disks per server). It is also possible to organize these into a RAID-style
         disk array, although the currently released software does not support the RAID organization at
         this time. Most of the primary memory is used as a very large disk cache. Because of the way the
         system is organized, most of the memory system and backplane bandwidth is dedicated to sup-
         porting data transfers between the network and disk interfaces.
             The host processor is either a standard SUN-3 68020-based processor board or a Sparcsta-
         tion host processor board. These run the standard Sun Microsystems' UNIX, as well as the utili-
         ties and diagnostics associated with the rest of the system.
         
         
         5.53. Software Organization
         A significant portion of Auspex's improved performance comes from the way in which the net-
         work and file processing software are layered onto the multiprocessor organization described
         above. The basic software architecture, its mapping onto the processors, and their interactions are
         shown in Figure 5.11.
             Consider an NFS read operation. Initially, it arrives at an Ethemet processors, where the net-
         work details are handled. The actual data read request is forwarded to a file processor, where it is
         transformed into physical read requests, assuming that the request cannot be satisfied by cached
         data. The read request is passed to the storage processor, which tums it into the detailed opera-

         tions to be executed by the disk drives. Retrieved data is transferred from the storage processor to
         primary memory, from which the Ethernet processor can construct data packets to be sent to the
         
         
         HUh Perfonn-ncc N~worlc ~nd Clun~l-B~-cd Slor~g~ Sepmnber 27, 1991
         
         35

         file server CPU must do most of the conventional file system processing. Since it is executing file
         server code, the file server needs access only to the file system meta-data, not user data. This
         rnakes its possible to locate the file server cache within the X-Bus board, close to the network and
         disk interfaces.
             Since a single X-bus board is limited to 40 MByte/second, we are examining system organi-
         zations that interleave data transfers across multiple X-bus boards (as well as multiple file servers,
         cach with its own HiPPI interface). Multiple X-bus boards can share a common HiPPI interface
         through the IOP Bus. Two X-bus boards should be able to sustain 80 MByte/second, more fully
         utilizing the available bandwidth of the HiPPI interfacc.
             The controller architecture described in this subsection should perform well for large data
         transfers that require high bandwidth. But it will not do so well for small transfers where latency
         dominates performance more than transfer bandwidth.Thus we are investigating organizations in
         which the file server remains attached to a more conventional network, such as FDDI. Requests
         for small files will be serviced over the lowest latency network available to the server. Only very
         lar~e files will be transferred through the X-Bus board and the UltraNet.
         
         
         6. Summary and Research Directions
         
         In this paper, we have made the case for generalizing the workstation-server storage architecture
         to the mainframe and high perforrnance computing environment. The concept of network-based
         storage is very compelling. It has been said that the difference between a workstation and a main-
         frame is the VO system. The distinction will become blurred in the new system architectures made
         possible by high bandwidth, low latency networks coupled to the correct use of caching and buff-
         ering throughout the path from service requestor to service provider.
             Nevertheless, many research challenges remain before this vision of ubiquitous network-
         based storage can be achieved. First, new methods are needed to effectively manage the complete
         and complex storage hierarchy as described in this paper. How should data be staged from tertiary
         to secondary storage? What are the effective prefetching strategies? How is data to be extracted
         from such large storage systems?
             Second, it is time to apply a system-level perspective to storage system design. Throughout
         the VO path, from host to embedded disk controller, we find buffer memories and processing
         capabilities. The current partitioning of functions may not be correct for future high performance
         systems. For example, some searching and filtering capabilities could be migrated from applica-
         tions into the devices. The memory in the VO path could be better organized as caches rather than
         speed matching buffers, given enough local intelligence about VO patterns. A better approach for
         error handling is also possible given a system perspective. For example, in response to a device
         read error, a disk arTay controller could choose between retrying the read or exploiting horizontal
         parity techniques to reconstitute the data on the fly.
             Third, new architectures are needed to break the bottlenecks, both hardware and software, be-
         tween the network, memory, and VO interfaces. The RAID-II controller tackles this at the hard-
         ware level, by providing a high bandwidth interconnection among these components. At the
         software level, new methods need to be developed to reduce the amount of copying and memory
         remapping currently required for controlling these interfaces.
             Fourth, today's high bandwidth networks, such as FDDI and UltraNet, exhibit latencies that
         are somewhat worse than conventional Ethernets. Unfortunately, latency becomes a dominating
         factor as the overheads of data transfer scales down in higher bandwidth networks. New methods
         
         
         H~gh Perfonn~nce Ne~wo~ u~ red Slonge              Seplanber 27, 1991 38

         HiPPI
         
            TMC
            IOP Bus
         
         
         TMC
         HiPPlS
           TMC
          HiPPID
         
         X-Bus
         
          X-Bus
          Board
         
         
         
         lOPB In
         
         IOPB Out
         
           FDDI Network
         
         
         
         
               8 Port Interleaved
             Memory (128 MByte)
         
         
         
         | ~ 8 x 8 x 32-bit
                  Crossbar
         
         -,,, ~ L ,.::
         
         
         
         
         
         ATC
         5 SCSI
         Channels
         
         ,~C, t~SI lanneis
         SCSI
         nels
         
         ~TC
         SCSI
         
         X~R
         
         
         VME
         
     File
         r
         
                         VME
                           
                           
                           
                           
                       Control

                         Bus
                           
                           
                           
                           
                           
         Figure 5.12: R~ID-II Organization
         A high bandwidLh crossbar inlerconneclion ties ~he networl~ interface (HiPPI) to the disk controllers (AITay
         Tech) via a multiponed memory system. Hardware o perform the parity calculation is associated with the
         memory system.
         
         the X-bus board's VME control interface.
             Two of the remaining eight ports are dedicated as interfaces to the Thinking Machine VO
         processor bus. The TMC HiPPI board set also interfaces to this bus. Since these X-bus ports are
         dedicated by their direction, the controller is limited to a sustained transfer rate to the network of
         40 MByte/second.
             Four more ports are used to couple to single board multi-string disk controllers via the
         industry standard VME bus, one disk controller per VME bus. Because of the physical packaging
         of the array, 15 disks can be attached to each of these, in three stripe units of five disks each. Thus,
         60 disk drives can be connected to each X-bus board, and a two X-Bus board configuration con-
         sists of 120 disk drives.
             Of the remaining two ports, one is dedicated for special hardware to compute the horizontal
         parity for the disk array. The last port links the X-Bus board to the external file server. It provides
         access to the on-board memory as well as the board's control registers (through the board's con-
         trol bus). This makes it possible for file server software, running off of the controller, to access
         network headers and file meta-data in the controller cache.
             It may seem strange that there is no processor within the X-Bus board. Actually, the config-
         uration of Figure 5.12 contains no less than seven microprocessors: one in each of the HiPPI
         interface boards, one in each of the ATC boards, and one in the file server (we are also investigat-
         ing multiprocessor file server organizations). The processors within the HiPPI boards are being
         used to handle some of the network processing normally performed within the server. The proces-
         s~rs within the disk interfaces handle the low level details of managing the SCSI interfaces. The
         
         
         Hi8h P~rforrn-nce N~lwor~ u~d aunr~l-B--ed S~                 g~     Scplcmbcr 2~, 1991  37

  Hennessy, 1., D. Patterson, Cornputer Arc.~itecture: A Quantitative Approach, Morgan Kaufmarm, San
         Mateo, CA, 1990.
         
  Hewlett-Packard Corporation, "HP Series 6300 Model 20GB/A Rewritable Optical Disk Library System
         Product Brief," 1989.
         
  Joshi, S. P., "High Performance Networks: Focus on the Fiber Distributed Data Interface Standard," IEEE
         Micro, (lune 1986), pp. 8 - 14.
         
  Kanakia, H., D. Cheriton, "The VMP Network Adaptor Board (NAB): High-Performance Network Com-
         munication for Multiprocessors," Pr~c. ACM SigComm '88 Symposium, (August 1988), pp. 175 -
         
         
         
  Kanakia, H., "High Performance Host Interfacing for Packet-Switched Networks," Ph.D. Dissertation
         Departrnent of E.E.C.S., Stanford University, 1990.
         
  Katz, R., G. Gibson, D. Patterson, "Disk S~ ~ m Architectures for High Performance Computing," Pro-
         ceedings of the IEEE, Special Issue on ~upercomputing, (December 1989).
         
  Kodak Corporation, "Optical Disk System 6800 Product Description," 1990.
         
   Kronenberg, N. P., H. Levy, W. D. Strecker, "VAXausters: A aosely Coupled Distributed System," ACM
         Transactions on Computer Systems, V. 4, N. 2, (May 1986), pp. 130 - 146.
         
   Kronenberg, N. P., H. M. Levy, W. D. Strecker, R. J. Merewood, 'Vrhe VAXauster Concept: An Overview
         of a Distributed System," Digita~ Technical Journal, V. 5, (September 1987), pp. 7 - 21.
         
   Kryder, M. H., "Data Storage in 2000--Trends in Data Storage Technologies," IEEE Trans. on Magnet-
         ics, V. 25, N. 6, (November 1989), pp. 4358 - 4363.
         
   Lary, R. L., R. G. Bean, ' The Hierarchical Storage Controller- A Tlghtly Coupled Mul~iprocessor as Stor-
         age Server," Digital Technical Journal, V. 8, (February 1989), pp. 8 - 24.
         
   Maximum Strategies, "Strategy HPPI: Disk Array Subsystem," Report No. HPP100, (1990).
         
   Massiglia, P., D~gita~ Large System Mass Storage Handbook, Digital Equipment Corporation, 1986.
         
   Miller, S. W., "A Reference Model for Mass Storage Systems," Advances in Computers, V 27, 1988, pp.
         IS7 - 210.
         
   Moran, J. R. Sandberg, D. Coleman, J. Kepecs, B. Lyon, "Breaking Thru the NFS Performance Bottle-
         neck." EUUG Sorino 90, Munich, (April 1990).
         
   Nelson, B., "An Overview of Functional Multiprocessing for NFS Network Servers," AUSPEX Technical
         Report 1, (July 1990).
         
   Nelson, M., J. K. Ousterhout, B. Welch, "Caching in the Sprite Network File System,"' A.CM. Transac-
         tions on Cornputer Systerns, V. 6, N. 1, (February 1988), pp. 134-154.
         
   Negraponte, N., "Products and Services for Computer Networks," Scientific American, V 265, N 3, (Sep-
         ~ember 1991), pp. 106 - 115.
         
   Pollard, A., "New Storage Function for Digital Audio Tape," New York Times, Wednesday, (May 25,
         1988), p. C10.
         
   Ranade, S., J. Ng, Systerns Integrationfor Write-Once Optical Storage, Meckler, Westport, CT, 1990.
         
         
         
         HuhPerform~ceNawo~ u~Chun~-B-~ honge                              Sq~onber~,1991  40

         need to be developed to reduce this latency. One strategy is to increase the packet sizes, to better
         amortize the start-up latencies. A second strategy, demonstrated by the Autonet project at Digital
         Equipment Corporation's System Research Laboratory, is to construct a high bandwidth network
         using point-to-point connections and an active switching network [Schroeder 90].
             Finally, the whole issue of distributed and multiprocessor file/storage scrvers and thcir role
         in high perforrnance storage systems must be addressed. The technical issues include the methods
         for how to partition the file server software functions among the processors of a multiprocessor or
         a distributed collection of processors. The AUSPEX controller architecture is one approach to the
         former. The EEE Mass Storage System Reference offers one model for the latter.
         
         
         Acknowledgments
         We appreciate the careful reading of this manuscript and the detailed cornments by Peter Chen,
         Ann Chervenak, Ed Lee, Ethan Miller, Srini Seshen, and Steve Strange. The research that led to
         this paper was supported by the Defense Advanced Research Projects Agency and the National
         Aeronautics and Space Administration under contract NAG2-591, "Diskless Supercomputers:
         High Performance VO for the TeraOp Technology Base." Additional support was provided by the
         State of Calif~ia MICRO Prograrn in conjunction vith industrial matching support provided by
         DEC, Emulex, Exabyte, IBM, NCR, and Storage Technology corporations.
         
         
         7. References
         
         Anon, Network Operations Manual, UltraNetwork Technologies, Part Nwnber 06 0001-001, Revision A
         (1990). Chapter 2: UltraNet Architecture; Chapter 3: Ultranet Hardware.
         
         Anon, Strategy HPPI Disk Array Subsystem Operation Manual, Maximurn Strategies, Part Nurnber
         #HPP 100, (1990).
         
         Bates, K. H., "Performance Aspects of the HSC Controller," Digital Technical Journal, V. 8, (February
         1989)), pp. 25 - 37.
         
         Borrill, P., l. Theus, "An Advanced Communications Protocol for the Proposed IEEE 896 FutureBus,"
         IEEEMicro, (August 1984), pp. 42 - 56.
         
         Cerf, V., "Networks," Scientific Arnerican, V 265, N 3, (September 1991), pp. 72 - 85.
         
         Chervenak, A., "Performance Measurements of the First RAID PTototype," U. C. Berkeley Computer Sci-
         
         ence Division Report No. UCB/CSD 90/574, (lanuary 1990).
         
         Chervenak, A., R. H. Kat~, "Perforrnance Measurements of a Disk Array Prototype," ACM SIGMETRICS
         Conference, San Diego, CA, (May 1990).
         
         Clar~ D., V. lacobson, J. Romkey, H. Salwen, "An Analysis of TCP Processing Overhead," IEEE Com-
         
         mlmications Magazine, (lune 1989), pp. 23 - 29.
         
         E~abyte Corporation, "EXB-120 Cartridge Handling Subsystem Product Specification," Part No. 510300-
         
         002, 1990.
         
         Feder, B. F., '~he Best of Tapes and Disks," N. Y. Tirnes, Sunday Business Section, (September 1, 1991), p.
         
         
         Heatly, S., D. Stokesberry, "Analysis of Transpor~ Measurements Over a Local Area Network," IEEE
         Comm~nica~ions Magazine, (lune 1989), pp. 16 - 22.
         
         
         Hi~h Per~                       ce Network ~nd aunr~l-B-~cd Su~nge Scp~nber Zl, 1991   39

         ** No page found **

   Rosenblum, M., 1. Ousterhout, '~he Design and Implementation of a Log-Structured File System," ACM
         Transactions on Computer Systems, (February 1992), to appear.
         
   orenstein. E.. "HPPI-based Stora~e System," Computer Technology Review, (April 1990).
         
   Ousterhout, 1. K., "Why Aren't Operating Systems Getting Faster as Fast as Hardware?," Proc. USENIX
         Summer Conference, Anaheim, CA, (lune 1990), pp. 247 - 256.
         
   Sandberg, R., D. Goldberg, S. Keiman, D. Walsh, B. Lyon, "Design and Implementation of the SUN Net-
         work Filesystem," Proc. USENIX Summer Conference, (lune 1985), pp. 119 - 130.
         
   Sclu~eder, M., A. D. Birrell, M. Burrows, H. Murray, R. M. Nee~ham, T. L. Rodeheffer, E. H. Satterth-
         waite, C. P. Thacker, "Autone~: A High-Speed Self-configuring Local Area Network Using Point-to-
         Point Links," DEC SRC Tech. Rep. #59, (April 1990).
         
   Spencer, K., "The 6~Second Terabyte," Canadian Research Magazine, (lune 1988).
         
  Tan, E., B. Vermeulen, "Digital Audio Tape for Data Storage," IEEE Spectrwn, V. 26, N. 10, (October
         1989), p~. 34 - 38.
         
  Tesla, L., "Networ~ed Computing in the 1990s," Scientific American, V 265, N 3, (September 1991), pp.
         86 - 93.
         
  Verity, l. W., "Rethinking the Computer," Business Week, (November 26, 1990), pp. 116 - 124.
         
    Watson, R., S. Marnrak, "Gaining Efficiency in Transport Services by Appropriate Design and Implemen-
         tation Choices." ACM Transactions on Computer Systems, V. 5, N. 2, (May 1987), pp. 97 - 120.
         
    Wood, R., "Magnetic Megabytes," IEEE Spectrum, V. 27, N. 5, (May 1990), pp. 32 - 38.
         
         
         
         
         
                                                                         ~,
         Hi8hPer~onn~ceNelwo~ u~ Chunel-B-~ ~onge    Sc~ember~,1991

         ** No page found **