Enstore Technical Design Document

Joint Projects Document JP0026

$Revision$
$Date$GMT


Jon Bakken
Eileen Berman
Chih-Hao Huang
Alexander Moibenko
Don Petravick
Ron Rechenmacher
Kurt Ruthmansdorfer



Table of Contents


1 Enstore Architecture

Enstore provides a generic interface for experimenters to efficiently use mass storage systems as easily as if they were native file systems.

(also available in Postscript)

1.0.1 Scope and Overview

Enstore provides distributed access to, and management of, petabytes of data stored on tape. The data can be made of up to billions of files of varying sizes - typically of between 1 and 2 Gigabytes in size. At any time, many of the tapes are accessible through automated tape libraries. The system supports migration of tapes to and from shelves - export and import - where operator intervention is required to move the tapes between the shelves and roboticly accessible tape drives. The system treats robot's shelves as a scarce commodity. Enstore is a system to provide mass storage for large Run II data sets. As such it is not a general purpose mass storage system, but optimized to allow access to large datasets made of many files. The system supports random access of files, but also streaming, the sequential access of successive files on tape.

The Enstore system provides for access to the data by user/client applications which are distributed across an IP network. It supports tape drives attached locally to the users' computer, as well as those remotely accessible over the network.

The Enstore system provides resource management of the available tape drives such that, for example, logging of data from the data acquisition systems can be given guaranteed access to the tape bandwidth whatever other user accesses are being requested. Enstore is designed to be used by Fermilab experiments' data acquisition, data processing and analysis systems. Well defined interfaces will be provided to these data handling systems to allow them to easily use the services provided. The writing and reading of tapes must therefore be reliable and efficient, and the system must be robust enough to support this critical application without compromising data taking. Enstore's goal is to provide a system that can be extended as needed for the experiments actual data taking needs, as well as be easily maintainable for the duration of several data taking runs.

Enstore is based on a client-server model that allows hot swapping of hardware components and dynamic software configuration, is platform independent, runs on heterogeneous environments and is easily extendable. Most of the operations are transparent to the user. System performance is monitored and fine tunable. A great deal of care has been taken to ensure that it is able to prevent or to recover from a worst case scenario. The system has layers around it to customize and address problems as they occur. When possible, these layers are expected to use already existing components (e.g. FTT, pnfs).

The Enstore system is designed to provide for the needed Run II data access throughput requirements within the budget assigned. The system software is layered and accessible to the Run II developers such that needed modifications can be made in a timely manner to meet the needs of commissioning and running of the Run II detectors.

Enstore is designed to support "lights out" operation of the Run II automated tape library systems. To this end, the design is targeted towards requiring operator intervention at no more than 8 hour intervals - for example, import/export requests are queued and need only be handled within the daytime operator shifts. Careful attention is paid to error reporting, handling and recovery in order to require the minimal possible load on the operations and support staff.

To summarize, Enstore provides the following features:

  • Support for several types of serial media accessed through Automated Tape Libraries or locally mounted on the client or host computers.
  • Support for distributed access to data on these tapes.
  • Reliable, efficient and prioritized write access from the experiment data acquisition systems for the logging of raw data.
  • Optimized access to large (petabyte) datasets made up of many (100s of millions of) files of 1-2 GB in size.
  • Efficient and flexible support for "write streaming" of data to tape, where data is physically clustered on tapes according to a simple classification scheme - typically the trigger number associated with the event data written.
  • Management of hardware and software resources, e.g., a limited number of available tape drives to allow prioritized access to the data.
  • Swapping of hardware components - tapes, tape drives, and computers - without bringing down the complete system.
  • Complete error reporting and configurable response to error conditions.
  • The use of already tested components as far as possible.
  • Easy mechanisms for testing of the system.
  • Import and export of tapes between the Automated Tape Library(ies) and shelf storage without bringing down the complete system.
  • Support for distributed clients to access data through standard network protocols, and to a great extent transparently.
  • Sequential access of data in files, for event reconstruction and other large data processing requirements.
  • Random access to files on tapes, to support general event analysis.
  • D0, and most specifically, SAM, has been very helpful in setting the direction for what is needed from Enstore. We believe a close and working collaboration has been developed in which both SAM and Enstore have profited. We appreciate the early, and sometimes tedious and painful, testing the SAM group has done on Enstore.

    We have been working with D0 to try provide a storage system that fulfills their needs. We have chosen to first present what Enstore provides and then what D0 requires and describe how Enstore fulfills it. This ordering could have been reversed - there has been great synergy between the two efforts.

    1.0.2 Assumptions and Constraints

    A constraint is a factor that limits our implementations. Assumptions are factors that, for planning purposes, will be considered to be true, real, or certain.
  • A usable system must be available in time for vertical slice tests in June, 1999.
  • The system must be apropos for Run II use, and have a place in the overall storage strategy for the laboratory.
  • Binaries should be distributed to the users. We should not expect users to have to deploy a large infrastructure on their computers to use Enstore.
  • The DESY name space is usable for Enstore, available to FNAL, and usable for Run II sized applications.
  • Fermilab FTT, a platform-independent SCSI tape package is suitable for the selected hardware, and useful over Run II projects.
  • OCS will be available and usable for operator mounts.
  • OCS will be available and usable as the repository of the tape and drive statistics. Enstore, itself will not store long term statistics.
  • Tape drives may be attached to user's computers via IP, and an adequate IP based network is constructed for Run II.
  • A non-transparent FMSS-like interface is appropriate for the users.
  • Python is an appropriate choice for implementing the project. Loadable C modules will be used where appropriate, for example, to increase performance.

  • 1.0.3 Components

    Enstore uses a client-server architecture to provide a generic interface for users to efficiently use mass storage systems. Enstore supports multiple distributed media robots, each of which may handle multiple media types or individual directly attached drives, and multiple distributed mover nodes. The system architecture does not dictate an exact hardware architecture. Rather, it specifies a set of general and generic networked hardware and software components. These components are loosely coupled in the sense that each one can be replaced easily without affecting the rest of the system and each class of components can be easily expanded to accommodate the increased demand in performance or capacity.

    The system is written in python, a scripting language that has advanced object-oriented features. Python provides a sound environment for quick turn-around and a seamless integration/migration path to fully compiled languages, such as C and C++, if there is a demand for even better performance.

    Enstore has four major kinds of software components:

    • namespace, implemented by the pnfs package from DESY
    • encp, a program used to copy files to and from media libraries
    • servers
      • Configuration Server
      • Volume Clerk
      • File Clerk
      • Info Server
      • Multiple, distributed Library Managers
      • Multiple, distributed Movers
      • Media Changer (1 per Library Manager)
      • Log Server
      • Inquisitor
      • Alarm Server
      • Accounting Server
      • Drivestat Server
    • administration tools
    The planned implementation is to have one Media changer serving a physical library. The Media Changer would accept commands, fork and send the request to the physical library. Single Media Changers allow a central place to:
    • balance the requests between each robot arm
    • limit the number of requests (EMASS has a queue limit)
    • pause arm activity to allow import/eject functions to occur

    These software components, as well as hardware components, are shown schematically in the following system context diagram. Hardware components are connected via IP. Great care has been taken to ensure that the system will function well under extreme load conditions. By design, there is no preset limit on the number of concurrent user computers nor on the number of physical media libraries or drives. The system is only limited by the availability of physical resources. We control all of the source code for the system except for that of pnfs (which is a well supported product from DESY).

    (also available in Postscript)

    Like tcp, the system is architected with distributed and peer-to-peer reliability. Each request originating from the encp program is branded with a unique ID. Encp retries under well-defined circumstances, issuing an equivalent request with a new unique ID. The system can instruct encp to retry if it needs to back out of an operation.


    1.1 The DESY pnfs Namespace

    Pnfs is a DESY written and supported package. Detailed information about pnfs can be found on the DESY http://mufasa.desy.de/pnfs/Welcome.html and http://watphrakeo.desy.de/pnfs/ web pages. The pnfs servers can not be installed without the permission of DESY. However, DESY has provided permission to Fermilab to use pnfs for Enstore work and support of the product has been superb.

    The DESY pnfs package implements an nfs-v2 daemon and mount daemon. These daemons do not actually serve a file system, but, instead make a collection of database entries looks like a file system, and provide control information for the system. Each file that is created in pnfs has 8 layers that Enstore uses to store metadata information about the file transfers. Normal UNIX permissions and administered export points are used to prevent unauthorized access to the name space.

    To inspect files, users mount their portion of the pnfs file system on their own computers, and interact with it using the native operating system utilities. For example, users can ls, stat, mv, rm or touch existing "files", but are given errors on attempts to read or write the content of the files. Users can also mkdir and rmdir, and ln files. Hard links should be used to ensure all the metadata information is linked; symbolic links will not give the user what he naively expects.

    There are also some special pnfs files which act as normal UNIX files. Administrators can write data to these files and the users can read from them. These files are the exception rather than the rule. Enstore plans on using them to distribute service information that everyone, who has pnfs mounted, can read.

    Enstore uses pnfs for three different kinds of access and information:

    1. Administration Interactions
      An administrator can create special files, called wormholes, in the pnfs name space. For example, one special file signifies that the system needs to be drained (maybe due to an impending shutdown). Existence of this file causes encp to stall, preventing users from submitting additional jobs and thus draining the Enstore system of transfers. The name of the Enstore-draining wormhole file is "local-pnfs-mountpoint/.(config)(flags)/disabled". Additional wormholes can be created as needed.

    2. Configuration Information
      Some creation details need to be provided before the user can write files to media. Enstore uses pnfs tag files (usually just called tags)for these purposes. Tags are associated with a directory and not any specific file. Examples of configuration information that is specified with tags include the file family name, file family width, and Library Manager. There is also optional file family wrapper tag, specifying a type of file wrappers for given file family. If this tag is not set the default wrapper type 'cpio_custom' will be used. This is a wrapper initially used in the Enstore project.

    3. User File information
      The rest of the system identifies a file by a 64-bit numeric identifier, dubbed a "bit file ID". After a file is written, the File Clerk generates a bfid and encp stores this information in one of the pnfs file layers. Encp then reads this bit file ID and gives it to the Enstore servers when fetching data. Other encp file transfer details, such as time of last access or location of where the file was copied to or transfer rates, are stored in a different metadata layers of the same pnfs file.

    1.1.2 Pnfs Tests

    Pnfs was tested in the prototype with very good results. The code was run on a 200 MHz Pentium Pro Linux machine with SCSI disks. The pnfs code had not been run extensively on Linux machines before and there were a few minor glitches found during initial running. Support from DESY was outstanding and all problems were solved quickly. The Fermilab installation was the first on Linux platforms. DESY supports pnfs on all major platforms. As a test of pnfs' capabilities, the February 1998 set of 250K HPSS filenames were put into pnfs. Approximately 20 pnfs databases were used for this test, with each database corresponding to existing HPSS "experimental" separation. [That is, D0 had its own database, SDSS had its own, and so forth.] Name lookup was done on each database simultaneously from 3 machines (IRIX, Linux, and AIX) for several days. Pnfs performed flawlessly and was able to provide names at a rate of 3-15 names/sec. The database can be further optimized, but this performance is already adequate for Enstore.

    *** These tests will have to be repeated *** under the final hardware configuration, but there is no indication of any problems.

    1.1.2 Pnfs Server Installation

    Pnfs Server installation should be done by an experienced administrator. It is not something that a user should ever have to do. Further, permission to install pnfs must be granted by DESY. DESY has granted Enstore permission to fully use, but not change, its pnfs package at Fermilab. DESY has also granted Enstore the right to distribute a binary + necessary scripts version of pnfs (i.e., no source code) via the normal Fermilab UPD product distribution mechanism. For example, the current (Jan 99) version in UPD is
    $ upd list pnfs
    
    DATABASE=/ftp/upsdb
            Product=pnfs    Version=v3_1_3a-f4      Flavor=Linux+2
                    Qualifiers=""   Chain=current
    
    
    The UPD version can be decoded as follows: "v3_1_3a" is the DESY version of pnfs, and the "-f4" signifies the 4th Fermi "release". None of the DESY code is modified - Fermilab only adds its UPS packaging framework and some local installation instructions. All fixes, changes, or updates to pnfs, will always come from DESY. DESY has allowed Fermilab full access to the pnfs source code, and as such, we could, in principle, solve problems if DESY were unable to continue their support. It is expected that there will be only a few pnfs servers at Fermilab. To date, pnfs servers have been installed on 3 Linux nodes without any difficulty. Each time, a set of installation instructions has been improved; however the pnfs server installation is still not completely automatic. The Fermilab installation instructions are distributed along with the UPD product. On the node that is serving pnfs, pnfs takes over the normal function of exporting nfs. Otherwise the machine is general purpose. To be explicit, the only 2 processes pnfs server machine can not run are rpc.mountd and rpc.nfsd. It runs the pnfs versions of these instead. These processes are only concerned with exporting pnfs. For example, Rip6 is the current (Jan 99) Enstore pnfs server. Here is its /etc/fstab
    rip6$ cat /etc/fstab
    /dev/sda6               /                       ext2    defaults        1 1
    /dev/sda5               swap                    swap    defaults        0 0
    /dev/sdc1               /rip6a                  ext2    defaults,grpid  2 1
    /dev/fd0                /mnt/floppy             ext2    noauto          0 0
    none                    /proc                   proc    defaults        0 0
    rip8:/fnal              /fnal                   nfs     soft,rsize=8192,wsize=8192      0 0
    rip8:/home              /home                   nfs     soft,rsize=8192,wsize=8192      0 0
    rip8:/usr/local         /usr/local              nfs     soft,rsize=8192,wsize=8192      0 0
    localhost:/fs           /pnfs/fs                nfs     noauto,intr,bg,hard,rw,noac       0 0
    rip6:/grau-ait          /pnfs/grau/ait          nfs     noauto,user,intr,bg,hard,rw,noac 0 0
    rip6:/grau-dlt          /pnfs/grau/dlt          nfs     noauto,user,intr,bg,hard,rw,noac 0 0
    rip6:/grau-mammoth      /pnfs/grau/mammoth      nfs     noauto,user,intr,bg,hard,rw,noac 0 0
    rip6:/stk-red20         /pnfs/stk/red20         nfs     noauto,user,intr,bg,hard,rw,noac 0 0
    rip6:/stk-red50         /pnfs/stk/red50         nfs     noauto,user,intr,bg,hard,rw,noac 0 0
    rip6:/rip6disk1         /pnfs/rip6              nfs     noauto,user,intr,bg,hard,rw,noac 0 0
    

    As you can see, rip6 is nfs mounting 3 disks from rip8 and mounting the pnfs disks it is exporting as well as the local disks. There are also numerous Enstore processes running on rip6, for example:

    USER       PID %CPU %MEM  SIZE   RSS TTY STAT START   TIME COMMAND
    bakken    3280  0.0  7.3 20240  9448  ?  S    00:17   0:14 python /home/bakken/enstore/src/configuration_server.py
    bakken    3334  0.0  4.9 20112  6356  ?  S    00:17   0:01 python /home/bakken/enstore/src/log_server.py
    bakken    3366  0.0  5.7 37980  7352  ?  S    00:17   0:02 python /home/bakken/enstore/src/volume_clerk.py
    bakken    3398  0.0  5.3 37760  6832  ?  S    00:17   0:01 python /home/bakken/enstore/src/file_clerk.py
    bakken    3433  0.0  5.1 36472  6560  ?  S    00:17   0:00 python /home/bakken/enstore/src/media_changer.py
    bakken    3465  0.0  8.8 33456 11300  ?  S    00:17   0:01 python /home/bakken/enstore/src/mover.new.py
    bakken    3515  0.0  0.3  1140   492  ?  S    00:18   0:00 db_checkpoint -h
    bakken    3520  0.0  0.3  1612   420  ?  S    00:18   0:00 db_deadlock -h
    bakken    3523  0.0  5.3 30508  6808  ?  S    00:18   0:01 python /home/bakken/enstore/src/alarm_server.py
    bakken    3673  0.0  8.1 36760 10432  ?  S    00:20   0:34 python /home/bakken/enstore/src/inquisitor.py
    bakken   12178  0.0  0.8  1552  1048  p2 S    19:38   0:00 /bin/login -h willow fnal.gov -p bakken
    

    The main point, often confused, is that the pnfs server node remains a general purpose and usable machine.

    Permission to mount the pnfs namespace is granted using a mechanism similar to the normal Unix nfs export permission scheme. There are DESY commands (the pmount command) that make this entire process very simple.

    Pnfs can be started automatically on boot-up. This allows other nodes to easily mount the namespaces after reboots.

    Finally, it should be noted that a Run II pnfs server will need a SCSI RAID level 5 disk system for its databases. RAID level 5 is needed for redundancy and reliability. This is the system that DESY uses for their pnfs system.

    *** Live Backups of database and recovery procedures *** - to be discussed during March trip to DESY. This has not been a priority yet.

    1.1.3 Pnfs Client Installation

    No pnfs client software is needed -- the pnfs file system (really namespace) just has to be mounted! Enstore has been written such that it recognizes all of pnfs namespace if it has a local mounting point beginning with /pnfs/... . Enstore uses /pnfs as convenient key. Typical steps for mounting a new pnfs namespace (in this example the pnfs namespace is called "grau-ait" and it is served from the rip6 node) are:
    1. As root, mkdir -p /pnfs/grau/ait
    2. As root, append to /etc/fstab:
      rip6:/grau-ait /pnfs/grau/ait nfs user,intr,bg,hard,rw,noac 0 0
      The "intr,bg,hard,rw,noac" mount options should not be changed as they are needed for proper operation.
    3. mount /pnfs/grau/ait This can be done as a normal user if the "user" mount option is specified in the /etc/fstab file.
    Of course, the actual steps depend on the pnfs installation.

    Pnfs filesystems in any way that other NFS filesystems can be mounted:

    • Explicit mount commands via root
    • User mount commands if pnfs entries in in fstab have the option "user"
    • Automounting of needed filesystems

    Pnfs supports automounting as one would expect. There is a general problem with automounting that the pnfs mountpoints exacerbate: automouting works fine if the mountpoint is only 1 level deep - but if one tries to mount deeper in a mounted tree, the automounter will not work properly. To circumvent this difficulty, one needs to employ a link gambit, provided by Ramon Pasetes of the OSS Department. The solution uses an intermediate link where the filesystem are mounted in the 1st level, a series of links that make it make the file system appear to be mounted as deep as it need be, and an export map to get the mountpoints to the client machines.

    Here, as an example of this solution. The example is the current automounting maps that the OSS department are using for the Run II farm nodes, but applied to a test node called airedale.

    Here is the auto.master entry for pnfs:

    /pnfs   /etc/auto.pnfs    -hard,intr,noac
    

    And here is the auto.pnfs map:

    d0sam           pcfarm9:/d0sam
    #
    enstore         pcfarm9:/enstore
    #
    grau            airedale:/Pnfs/grau        ro  [OSS uses node fnpca]
    grau-ait        rip6:/grau-ait
    grau-dlt        rip6:/grau-dlt
    grau-mammoth    rip6:/grau-mammoth
    #
    rip6            rip6:/rip6disk1
    #
    sam             airedale:/Pnfs/sam         ro  [OSS uses node fnpca]
    sam-ait         samson:/sam-ait
    sam-dlt         samson:/sam-dlt
    sam-mammoth     samson:/sam-mammoth
    sam-red20       samson:/sam-red20
    sam-red50       samson:/sam-red50
    samson          samson:/samson
    #
    stk             airedale:/Pnfs/stk         ro  [OSS uses node fnpca]
    stk-red20       rip6:/stk-red20
    stk-red50       rip6:/stk-red50
    

    Here is the /etc/export file: [OSS exports to the required nodes]

    /Pnfs/grau	airedale.fnal.gov
    /Pnfs/sam	airedale.fnal.gov
    /Pnfs/stk	airedale.fnal.gov
    

    And finally, here are the intermediate links:

    airedale# ls -alsFgR /Pnfs 
    /Pnfs:
    total 5
       1 drwxrwxr-x   5 root     root         1024 Feb  8 11:05 ./
       1 drwxr-xr-x  30 root     root         1024 Feb  8 10:38 ../
       1 drwxrwxr-x   2 root     root         1024 Feb  8 10:39 grau/
       1 drwxrwxr-x   2 root     root         1024 Feb  8 11:05 sam/
       1 drwxrwxr-x   2 root     root         1024 Feb  8 11:27 stk/
    
    /Pnfs/grau:
    total 2
       1 drwxrwxr-x   2 root     root         1024 Feb  8 10:39 ./
       1 drwxrwxr-x   5 root     root         1024 Feb  8 11:05 ../
       0 lrwxrwxrwx   1 root     root           14 Feb  8 10:39 ait -> /pnfs/grau-ait/
       0 lrwxrwxrwx   1 root     root           14 Feb  8 10:39 dlt -> /pnfs/grau-dlt/
       0 lrwxrwxrwx   1 root     root           18 Feb  8 10:39 mammoth -> /pnfs/grau-mammoth/
    
    /Pnfs/sam:
    total 2
       1 drwxrwxr-x   2 root     root         1024 Feb  8 11:05 ./
       1 drwxrwxr-x   5 root     root         1024 Feb  8 11:05 ../
       0 lrwxrwxrwx   1 root     root           13 Feb  8 11:04 ait -> /pnfs/sam-ait/
       0 lrwxrwxrwx   1 root     root           13 Feb  8 11:04 dlt -> /pnfs/sam-dlt/
       0 lrwxrwxrwx   1 root     root           17 Feb  8 11:04 mammoth -> /pnfs/sam-mammoth/
       0 lrwxrwxrwx   1 root     root           15 Feb  8 11:05 red20 -> /pnfs/sam-red20/
       0 lrwxrwxrwx   1 root     root           15 Feb  8 11:05 red50 -> /pnfs/sam-red50/
    
    /Pnfs/stk:
    total 2
       1 drwxrwxr-x   2 root     root         1024 Feb  8 11:27 ./
       1 drwxrwxr-x   5 root     root         1024 Feb  8 11:05 ../
       0 lrwxrwxrwx   1 root     root           15 Feb  8 11:26 red20 -> /pnfs/stk-red20/
       0 lrwxrwxrwx   1 root     root           15 Feb  8 11:27 red50 -> /pnfs/stk-red50/
    

    Finally, it should be noted that mounting the pnfs namespace does not restrict the node in any other way - it can import and mount any other file systems and run any tasks as it normally would.

    1.1.4 pcmd: Enstore related pnfs commands

    All non-I/O Unix commands that operate on normal file systems can also (typically) be used on the pnfs namespace. DESY has provided several examples on their web pages and tools in their pnfs package that allow users to view and control the special features of pnfs. Enstore has tailored these tools into a single script, called pcmd, that allows users to control, manipulate and query pnfs files that Enstore creates. The pcmd tool is distributed along with the encp client. It is almost entirely written in shell, and therefore, it is stand alone and doesn't require any other products, including python. Pcmd is based on DESY scripts and tailored for Enstore, as such it is not a general pnfs tool.

    Commonly used commands are:

    FUNCTION COMMAND OUTPUT
    lists the online help pcmd help  
    lists "important" info about the file
    *** Needs work to be fully functional ***
    pcmd info file
    $ pcmd info M1
    bfid="91184924000000L";
    volume="flop309";
    location_cookie="68608";
    size="1252";
    file_family="jon4";
    filename="/pnfs/enstore/airedale/jon4/M1";
    orig_name="/pnfs/enstore/airedale/jon4/M1";
    map_file="/pnfs/enstore/volmap/jon4/flop309/000000068608";
    pnfsid_file="00020000000000000050AE88";
    pnfsid_map="00020000000000000050AEA0"
    lists the tags in the directory pcmd tags directory
    $ pcmd tags .
    .(tag)(library)  =  rip6
    .(tag)(file_family)  =  jon-rip6
    .(tag)(file_family_width)  =  1
    .(tag)(file_family_wrapper)  =  cpio_custom
    -rw-rw-r--   1 bakken   g023            4 Nov 16 21:24 /pnfs/rip6/.(tag)(library)
    -rw-rw-r--   1 bakken   g023            8 Nov 16 21:24 /pnfs/rip6/.(tag)(file_family)
    -rw-rw-r--   1 bakken   g023            1 Nov 16 21:24 /pnfs/rip6/.(tag)(file_family_width)
    -rw-rw-r--   1 bakken   g023           11 Jan 27 18:45 /pnfs/rip6/.(tag)(file_family_wrapper)
    
    sets/lists library tag to value
    (must have correct cwd)
    pcmd library [value]
    $ pcmd library
    ait
    $ pcmd library xxx
    $ pcmd library
    xxx
    sets/lists file family tag to value
    (must have correct cwd)
    pcmd file_family [value]
    $ pcmd file_family
    jon-ait-3
    $ pcmd file_family xxx
    $ pcmd file_family
    xxx
    
    sets/lists file family width tag to value
    (must have correct cwd)
    pcmd file_family_width [value]
    $ pcmd file_family_width    
    2
    $ pcmd file_family_width 10
    $ pcmd file_family_width
    10
    
    sets/lists file family wrapper tag to value
    (must have correct cwd)
    pcmd file_family_wrapper [value]
    $ pcmd file_family_wrapper
    cpio_custom
    $ pcmd file_family_wrapper cpio_odc
    $ pcmd file_family_wrapper
    cpio_odc
    
    lists all the files on specified tape in volmap
    *** Needs work to be fully functional ***
    pcmd files volmap-tape  
    lists the volmap-tape for the specified volumename
    *** Needs work to be fully functional ***
    pcmd volume volumename  
    lists the bit file id of the file pcmd bfid file
    $ pcmd bfid testfile
    91551931700000L
    
    *** lists the last parked location of the file
    parked feature is not implemented
    pcmd parked file  
    lists the debug info about the file transfer pcmd debug file  
    lists the cross-reference info about the file pcmd xref file
    $ pcmd xref testfile
    CA2902 (tape label)
    '0000_000000000_0000132' (positioning info)
    104857600 (file size)
    jon-ait-1 (file family)
    /pnfs/grau/ait/jon1/100MB.trand (original name)
    /pnfs/grau/ait/volmap/jon-ait-1/CA2902/...
       ... 0000_000000000_0000132 (volume map name)
    0001000000000000000928D0 (pnfs id of file)
    0001000000000000000928E0 (pnfs id of volume map file)
    
    does an ls on the named layer in the file pcmd ls file [layer]
    $ pcmd ls testfile 3
    4 -rw-rw-r-- bakken g023 3692 Jan 5 00:55 ./.(use)(3)(testfile)
    
    lists the layer of the file
    it is easier to use pcmd bfid|parked|debug|xref commands
    pcmd {cat|more|less} file layer  
    lists the tag in the directory
    it is easier to use pcmd library|file_family|file_family_width commands
    pcmd {tagcat|tagmore|tagless} tag directory  
    lists whether Enstore is still accepting transfers pcmd enstore_state
    $ pcmd enstore_state
    Enstore enabled
    
    lists whether pnfs mount point is up
    *** Not fully functional yet ***
    pcmd pnfs_state mount-point
    $ pcmd pnfs_state /pnfs/grau/ait
    Pnfs up
    

    Don't use these unless you know what you are doing:

    FUNCTION COMMAND OUTPUT
    echos text to named layer of the file pcmd echo text file layer  
    deletes (clears) named layer of the file pcmd rm file layer  
    copies Unix file to named layer of file pcmd cp unixfile file layer  
    copies Unix file to named layer of file pcmd cp unixfile file layer  
    sets the size of the file pcmd size file size  
    echos text to the named tag pcmd tagecho text tagname  
    removes the tag (tricky, see DESY documents) pcmd tagrm tag  
    sets io mode (can't clear it easily) pcmd io file  

    Don't use these unless you can interpret the results:

    FUNCTION COMMAND OUTPUT
    shows the pnfs id pcmd id file  
    shows the showid information pcmd showid id  
    shows the const information pcmd const file  
    shows the filename pcmd nameof id  
    shows the complete file path pcmd path id  
    shows the parent pcmd parent id  
    shows the counters pcmd counters file  
    shows of the counters pcmd counterN dbnum
    (must have cwd in pnfs
     
    shows the cursor pcmd cursor file  
    shows the directory position pcmd position file  
    shows the database information pcmd database file  
    shows the database information pcmd databaseN dbnum
    (must have cwd in pnfs)
     


    1.2 Encp

    Reading and writing files means interacting with media. Most users will just want to get their data file and do their work. They do not want want all the "baggage" products (python, ftt libraries, libtp...) that is required to run a complete Enstore system. To this end, we distribute a separate product, called encp along with Enstore. This product basically consists of 1 stand-alone binary executable, encp, and one UPS table file. This executable, and the mounting of pnfs, is the only thing that clients need to access their data on the robot. Encp is very similar to the cp command in UNIX, and we have tried to duplicate its behavior wherever possible. The syntax is:
    % encp [options] src_file dst_file
    
    Currently there is no wild-carding allowed, but this is a straight forward extension to encp.

    1.2.1 Encp command line options

    FUNCTION SWITCH DEFAULTS
    print short help message about using encp --help None
    perform CRC check on the local user machine --crc CRC check is only performed on the mover computers
    set the base priority = value --pri=value 1
    change the base priority by value after a period specified by the agetime switch --delpri=value 0
    specify the time period, in minutes, after which the base priority could change --agetime=value 0 (no aging of priority)
    give the library manager a hint that more work is coming for the volume and it should not dismount the volume "too quickly" when this transfer is completed --delayed_dismount None (immediate dismount on completion)
    turn on special status printing requested by D0 --data_access_layer D0 printing is off
    change the amount of information printed about the transfer --verbose=value 0 (no printing)
    list the active and pending transfers for the specified node --queue nodename None
    create a new file family (width exactly 1) and copy files to this file family --ephemeral None
    copied files to specified file family --family value None
    specifies the hostname where the configuration server is running --config_host=value environmental variable, ENSTORE_CONFIG_HOST, set by the UPS setup command
    specifies the port number that the configuration server responds to --config_port=value environmental variable, ENSTORE_CONFIG_PORT, set by the UPS setup command

    The data_access_layer option provides the output of encp in a format required by the D0 SAM system. The example output is below:

    $  encp --data_access_layer 1GB.trand  /pnfs/grau/ait/jon1/
    INFILE=/rip8a/enstore/random/1GB.trand
    OUTFILE=/pnfs/grau/ait/jon1/1GB.trand
    FILESIZE=1073741824
    LABEL=CA2901
    DRIVE=/dev/rmt/tps2d2n
    TRANSFER_TIME=384.365273
    SEEK_TIME=0.004476
    MOUNT_TIME=24.688773
    QWAIT_TIME=1.581097
    TIME2NOW=434.628970
    STATUS=ok
    
    1 GB   copied to CA2901 at user 2.4 MB/S (2.7 MB/S IO rate)
    
    
    This output is easily parsable and provides information about input and output files, file size, volume label, drive used to read/write data, transfer time, file seek time, volume mount time, wait time in the request queue, operation completion status, and total time since the invocation of encp till the end of the operation.

    1.2.2 Throttling in encp

    It is important not to swamp any system. In Enstore, a first level of throttling is implemented in encp. Control communications in Enstore uses a simple reliable request-response protocol implemented in UDP, whereas data transfers are implemented using two TCP ports. A fixed number, currently 30, of pre-allocated TCP ports are arbitrated among all instances of encp on a given machine. Consequently, the system will survive the worst sort of abuse, for example, someone forking off 200 copy requests, since at most 15 will be active in the system at any time.

    1.2.3 Specifying lists of files when reading from or writing to the HSM

    Encp duplicates unix cp's behavior when one tries to read or write multiple input files from the HSM:

    encp [options] source... directory

    That is, the final item has to be a directory when specifying an input list. There is one caveat when specifying multiple files using a single encp command: there is at most one Mover processing your request list at a time. If you want more than one Mover active, you should use multiple encp commands when you are reading lists of files from more than one volume. The reason encp uses just one Mover is to keep the system simple (if you want to use multiple Movers, then use many encp commands).

    On reads from the HSM, encp scans all specified files and groups them according to which volume they are on, orders them according to location on a tape, and then submits all the file requests that are on one specific volume, reads all the files for that volume from the HSM, and then proceeds to the next volume. Encp processes the volumes in any order it chooses.

    On writes to the HSM, encp processes each input file sequentially. Since the user must specify a single output directory, all input files must belong to the same file family, and hence could all go to the same tape (if possible). Encp sets a flag,"don't dismount the volume too quickly - there's more files coming for the same family", that the Mover uses to postpone the dismount and thereby avoid the extra times involved in the volume manipulations. Please note that there is no guarantee that all the files will go to one tape (there might not be room) or that they will be grouped together on the tape (there may be other writes to the same family that get intermixed).

    Consider the following example (P=/pnfs/enstore/airedale) reading from the HSM:

    The following files on flop301: ran-1, ran-2 ran-3, ran-4
    The following files on flop302: ran-5, ran-6 ran-7, ran-8
    The following files on flop302: ran-9, ran-10 ran-11, ran-12

    Encp submits the requests for all files on flop301 and reads back those files and then does the same for flop302 and flop303.

    Here is the output of an actual test:

    $ encp $P/ran-1       $P/ran-2        $P/ran-3        $P/ran-4 \
           $P/test2/ran-5 $P/test2/ran-6  $P/test2/ran-7  $P/test2/ran-8 \
           $P/test3/ran-9 $P/test3/ran-10 $P/test3/ran-11 $P/test3/ran-12 .
    $P/test2/ran-5  -> ./ran-5  : 102400 bytes copied from flop302 at 0.19 MB/S  requestor:bakken cum= 3.5
    $P/test2/ran-6  -> ./ran-6  : 102400 bytes copied from flop302 at 0.42 MB/S  requestor:bakken cum= 3.7
    $P/test2/ran-7  -> ./ran-7  : 102400 bytes copied from flop302 at 0.46 MB/S  requestor:bakken cum= 3.9
    $P/test2/ran-8  -> ./ran-8  : 102400 bytes copied from flop302 at 0.46 MB/S  requestor:bakken cum= 4.2
    $P/test3/ran-9  -> ./ran-9  : 102400 bytes copied from flop303 at 0.12 MB/S  requestor:bakken cum= 5.4
    $P/test3/ran-10 -> ./ran-10 : 102400 bytes copied from flop303 at 0.49 MB/S  requestor:bakken cum= 5.6
    $P/test3/ran-11 -> ./ran-11 : 102400 bytes copied from flop303 at 0.45 MB/S  requestor:bakken cum= 5.8
    $P/test3/ran-12 -> ./ran-12 : 102400 bytes copied from flop303 at 0.45 MB/S  requestor:bakken cum= 6.1
    $P/ran-1        -> ./ran-1  : 102400 bytes copied from flop301 at 0.12 MB/S  requestor:bakken cum= 7.4
    $P/ran-2        -> ./ran-2  : 102400 bytes copied from flop301 at 0.45 MB/S  requestor:bakken cum= 7.6
    $P/ran-3        -> ./ran-3  : 102400 bytes copied from flop301 at 0.42 MB/S  requestor:bakken cum= 7.8
    $P/ran-4        -> ./ran-4  : 102400 bytes copied from flop301 at 0.45 MB/S  requestor:bakken cum= 8.0
    

    1.2.4 Encp Transfer Lists

    One of the requests of the Fermilab Farms group was the ability to query Enstore to determine what transfers were in progress and pending on a certain node. If they need to take a farm node out of service they want to know what transfers are active on the node before it is taken down. [Some users submit jobs even when they know the node is going to be down and then complain when their jobs fail. This feature will allow the farms group to send an automated message saying that the node was taken down and explain why their transfers failed.]

    This is a straightforward and easy command, except for 2 complications:

    • Python is (currently) required to get the transfer queue
    • The command must be available everywhere the encp clients run and python isn't typically available everywhere.
    Normally encp is involved only with copying files. It is also the only task distributed with the encp client software. We chose to extend the encp client code so it could determine the transfer queues in addition to its main responsibility of copying files. This got around both complications.

    Here's an example of how it is used:

    $ encp --queue rip8.fnal.gov
    rip8.fnal.gov bakken /raid/1MB.trand /pnfs/grau/ait/jon1/1MB.trand P
    rip8.fnal.gov bakken /raid/1GB.trand /pnfs/grau/ait/jon2/1GB.trand M
    rip8.fnal.gov bakken /raid/1GB.trand /pnfs/grau/ait/jon1/1GB.trand M
    
    $ encp --queue rip4.fnal.gov
    rip4.fnal.gov bakken /raid/1MB.trand /pnfs/grau/ait/jon2/1MB.trand M
    
    The 1st field in the output is the node name, the 2nd is the requester, the 3rd is the input filename, and the 4th is the output filename. The 5th and last field can have 2 values: "P" denotes a Pending transfer still in the Library Manager queues and "M" signifies a active transfer at a Mover.

    1.2.5 Encp Return Status

    Every individual encp request returns a status in order to identify (and in some cases resolve) any occurring problem, if any. Encp processes the status internally and either retries the request (providing the user with intermediate status) or terminates with the appropriate status. The only successful status returned by encp is "OK" and the only successful encp exit code is 0. Below is the list of statuses returned to the user:

    STATUS DESCRIPTION
    OK Operation Completed Successfully
    KEYERROR Not an existing reference key
    DOESNOTEXIST Object (file name, etc.) does not exist
    NOMOVERS No Movers to process request
    MOUNTFAILED Mount of required volume failed
    DISMOUNTFAILED Dismount of required volume failed
    MEDIA_IN_ANOTHER_DEVICE Requested Media is in another Device
    MEDIAERROR Bad Media
    USERERROR User Error
    DRIVEERROR Drive Error
    UNKNOWNMEDIATYPE Unknown media type
    NOVOLUME Volume does not exist
    NOACCESS Volume marked as no access
    CONFLICT Configuration conflict detected
    WRITE_NOTAPE Requested volume was not found in the library
    WRITE_TAPEBUSY Requested volume is in another drive.
    WRITE_DRIVEBUSY A volume is already in the drive.
    WRITE_BADMOUNT Mount failure or load operation failed.
    WRITE_BADSPACE EOD cookie does not produce EOD.
    WRITE_ERROR Error writing data block or file mark
    WRITE_EOT Hit EOT while writing data block or file mark
    WRITE_NOBLANKS No more blank volumes
    WRITE_MOVER_CRASH Mover crash during write operation
    READ_NOTAPE Requested volume was not found in the library
    READ_TAPEBUSY Requested volume is in another drive
    READ_DRIVEBUSY A volume is already in the drive
    READ_BADMOUNT Mount failure or load operation failed
    READ_BADLOCATE Failed space or initial CRC's don't match
    READ_ERROR Error reading data block
    READ_COMP_CRC CRC mismatch
    READ_EOT Hit EOT when reading
    READ_EOD Hit EOD when reading
    READ_UNLOAD Error unloading volume
    READ_UNMOUNT Error when unmounting volume
    READ_MOVER_CRASH Mover crash during read operation

    More detailed description of these and other statuses and how they are processed inside of the system can be found in section 4 of this document.

    1.2.6 Encp Client Installation

    Encp is distributed via the Fermilab UPD mechanism. It is a static binary product that does not depend on any other products or executables. The binary encp product is made using the standard Python freeze tool. Basically, the Python freeze tool uses the regular Python parser to parse the Python code and all its modules to produce a binary that people who don't have Python can run.

    Typically 2 encp products are available in UPD, one for general use and one explicitly tailored for D0/SAM. This tailorint is simply for SAM's convenience. There will only be one version in the future.

    $ upd list -a encp
    
    DATABASE=/ftp/upsdb 
            Product=encp    Version=v0_11-sam       Flavor=Linux+2
                    Qualifiers=""   Chain=current
    
            Product=encp    Version=v0_11   Flavor=Linux+2
                    Qualifiers=""   Chain=""
    
    The installation procedure is straightforward:
    $ upd install -G"-c" encp
    informational: beginning install of encp.
    informational: transferred /ftp/products/encp/v0_11/Linux+2/encp_v0_11_Linux+2
            from fnkits.fnal.gov to
            /home/products/encp/v0_11
    informational: transferred /ftp/products/encp/v0_11/Linux+2/encp_v0_11_Linux+2.table
            from fnkits.fnal.gov:/ to
            /home/products/upsdb/encp/v0_11.table.new
    informational: ups declare succeeded
    informational: ups declare succeeded
    
    The entire product consists of the encp binary, pcmd (a pnfs script described in the pnfs section), and some UPS tables. The encp binary is large since it is statically linked.
    $ ls
         179 Nov 24 09:49 .manifest.encp
     2976514 Nov 24 09:24 encp*
        1690 Nov 24 09:24 encp.table
           9 Dec  2 15:20 enstore_variables.table -> rip.table
       11313 Nov 24 09:24 pcmd*
         398 Nov 24 09:24 rip.table
         399 Nov 24 09:24 sam.table
    
    Two environmental variables, ENSTORE_CONFIG_PORT, and ENSTORE_CONFIG_HOST, control to which Enstore system the encp requests go. In order to allow a user to override the default control environmental variables distributed with the product, the encp product uses the UPS concept of "virtual" products. The basic idea is that everything in the encp table file is general, and everything in the virtual product enstore_variables.table file is user/installation specific.

    Finally, when encp is setup, it creates a directory in the /tmp area where it stores debugging information and other non-user information. The user can ignore the files in the /tmp area.

    As an aside, it should be noted that since Enstore is still in development, no versions are cut. The complete UPS product structure is finished. For new installations, typically we CVS checkout code, issue one make command, and the product is ready to be used. We expect to cut versions of Enstore when it is appropriate. These versions will not be frozen, i.e., they will need Python and the other dependent products.


    1.3.0 Enstore Servers

    Enstore servers are software modules that have specific functions. The high level concepts are as follows:
    Physical library
    A Physical Library represents a real, tangible collection of media along with software drivers/utilities to manipulate, read and write and organize them. A physical library can be thought of as consisting of
    • one or more virtual libraries
    • a media changer (robot arm)
    • one of more media export/import slots
    • one of more drives (tape, cdrom, disk, etc.)
    • volumes (tape cartridges, cdroms, etc.)

    Virtual Library -- A virtual library contains one and only one kind of media. For example, Enstore divides an STK Powderhorn library holding 50, 20 and 10 GB redwood media into at least three virtual libraries. In common usage, the term "library" in Enstore refers to a virtual library. Writes are directed to a specific (virtual) library, thus selecting the media.

    Drives -- Drives are bound to special processes called Mover clients. The drives can be dynamically assigned allowing the number of drives to be less than the number of virtual libraries.

    Volumes -- Are uniquely identified by an external label, which is known to the Media Changer.

    Quota Family
    A quota family is a set of pairs of media names and maximum number of volumes. All files are created with respect to a quota family. Creation of a file is not allowed if the maximum number of volumes in that family would be exceeded. *** Quotas are not implemented in Enstore yet ***

    File family:
    A file family is specified by a name and an integer "width". A file family is associated with every file creation. Within a given library, Enstore keeps no more than "width" volumes open for writing, and loads volumes on no more than "width" number of drives at any given moment. There currently is no width for reading, this could be added if deemed important. This is not striping, but rather, the number of different volumes, and hence different files, which can be active at one time. Once a volume is associated with a file family, only files in that family will be placed on the volume. By design, there is no pre-set limit on the number of file families. Clever use of file families will allow volumes to be faulted out to "shelf", and also to decrease access times for subsequent reads. When a file family has filled all of its "width" media, new media are drawn out of a pool of blanks.
    Media ejected to shelf are put into a shelf virtual library and are controlled by a shelf Library Manager. Users are informed that this data is currently unavailable, and if they really want the data, arrangements should be made to have the media placed in a library which is accessible, or get it manually later.

    Each of the servers listed below is discussed in further detail in its own section. Please refer to these sections for information on detailed functionality and specific command line interfaces.

    • Configuration Server - administers Enstore system configuration information.
    • Library Manager - queues and dispatches work for a virtual library.
    • Mover - handles the actual transfer of data from a volume to the user.
    • Media Changer - represents a physical device or operator who performs mounts/dismounts of volumes on drives.
    • Inquisitor - monitors Enstore system status and activity.
    • File Clerk - administers file information (e.g. - location cookies).
    • Volume Clerk - administers volume information (e.g. bytes left).
    • Info Server - provides read only functions of File and Volume clerks
    • Log Server - formats and records Enstore system log messages.
    • Admin Clerk - allows administrative access to the Enstore system.
    • Alarm Server - analyzes Enstore system status and activity information to determine system health.
    • Accounting Server - provides accounting of data transfers.
    • Drivestat server - records tapedrives usage statistics

    1.3.0.1 General Command Line Control of Servers

    Enstore servers implement a common uniform interface. Common parsing is used to determine command line options. All servers support a set of general options as well as those specific in functionality. Control of the servers is done via enstore command. This command provides a general way of accessing the individual servers as well as the Enstore system as a whole. Examples of using this command to accomplish this are given following the table of general options. enstore recognizes the following server names when sending commands to the individual servers:

    • Configuration Server - conf[iguration]
    • Library Manager - lib[rary]
    • Mover - mov[er]
    • Media Changer - med[ia]
    • Inquisitor - inq[uisitor]
    • File Clerk - fil[e]
    • Volume Clerk - vol[ume]
    • Info Server - inf[o]
    • Log Server - log
    • Alarm Server - ala[rm]
    For example : enstore inq --help

    When enstore is used to start/stop servers, the server must be specified by using the full Ascii name specified in the configuration file. For example:

    • configuration_server
    • rip6.library_manager
    • rip6.mover
    • rip6.media_changer
    • file_clerk
    • volume_clerk
    • log_server
    • alarm_server

    General Server Options
    FUNCTION SWITCH DEFAULTS
    check if the server process exists --alive None
    turn on more alarms --do-alarm levels None
    turn on more verbose logging(DEBUGLOG) --do-log levels None
    turn on more verbose output(stdout) --do-print levels None
    turn off more alarms --dont-alarm levels None
    turn off more verbose logging(DEBUGLOG file) --dont-log levels None
    turn off more verbose output(stdout) --dont-print levels> None
    print a short help message about using the server --help None

    Enstore System Command Line Control
    OPTION COMMAND
    start the Enstore system on the current node enstore start
    start only the file_clerk on the current node enstore start --just file_clerk
    start the Enstore system on the whole Enstore cluster enstore Estart
    start only the file_clerk on stkensrv0 enstore Estart stkensrv0 "--just file_clerk"
    stop the Enstore system on the current node enstore stop
    stop only the log_server on the current node enstore stop --just log_server
    stop the Enstore system on the whole Enstore cluster enstore Estop
    stop only the file_clerk on stkensrv0 enstore Estop stkensrv0 "--just file_clerk"
    stop and then start the Enstore system enstore restart
    stop and then restart only the Inquisitor on the current node enstore restart --just inq
    stop and then start the Enstore system on the whole Enstore cluster enstore Erestart
    stop and then start only the file_clerk on stkensrv0 enstore Estop stkensrv0 "--just file_clerk"
    Display enstore related processes on the local host EPS
    Display enstore related processes for the whole Enstore system enstore EPS

    1.3.1 Volume Clerk

    The Volume Clerk keeps and administers volume information that it stores in a single table database. There is one record for each volume known to the system. The record is looked up by a key, which is the volume's external label. The information tracked for each volume is described in the table below. The default values are shown in parentheses ().
    Column Name Type Comments
    external_label string [primary_key] Volume name specified by user on volume creation; is used to display volume metadata.
    file_family string ("none") File family name, specified by user on volume creation; only files that belong to this family will be stored on this volume.
    media_type string Specified at volume creation; implies the block-size; used for writing.
    library string Specified by user on volume declaration; defines which (virtual) library currently holds the volume
    first_access int (-1) Unix time when user issues the first write command to copy data to the volume. Set by the Volume Clerk.
    last_access int (-1) Unix time when user last accessed the volume. Set by the Volume Clerk.
    declared int Unix time when the volume is declared available to the system. Set by the Volume Clerk.
    capacity_bytes 64-bit int Specified by user on volume creation; estimate of the number of bytes that would fit on the volume.
    blocksize int Set by the Volume Clerk; derived from the media type.
    remaining_bytes 64-bit int Specified by the user on volume creation; estimate of the number of bytes that would fit on the volume; updated by the Volume Clerk every time data are written to the media.
    eod_cookie string ("none") Tells the driver how to space to the end of the volume; it is driver specific; updated by the Volume Clerk when data are written on the media.
    wrapper string ("cpio") Wrapper method; currently specifies the format of the files on the volume.
    sum_rd_err int (0) Read error count; Volume Clerk increments this field when the Mover receives an error while reading from the volume.
    sum_rd_access int (0) Read access count; Volume Clerk increments this field every time a file is read.
    sum_wr_err int (0) Write error count; Volume Clerk increments this field when the Mover receives an error while writing to the volume.
    sum_wr_access int (0) Write access count; Volume Clerk increments this field every time a file is written.
    user_inhibit string (d:"none" or "readonly", "noaccess") Specified by user at volume creation; access level for this volume, updated by Volume Clerk.
    system_inhibit string (d:"none" or "writing", "readonly", "full", "noaccess") Administrator generated limitation on the kind of access permitted to this volume; updated by Volume Clerk when data are written on the volume, an error occurred while data were being written or the file size exceeded the remaining number of bytes on the volume.
    at_mover tuple. First element is state string ("unmounted","mounting", "mounted", "unmounting"). Second is a mover name Reflects state of volume. Used to keep track of volume mount state to avoid illegitimate mount requests. Transitions are as follows: "unmounted"->"mounting"->"mounted"->"unmounting"->"unmounted". All other transition and associated requests will be rejected

    The Volume Clerk does the following operations:

    • show the name of all the volumes
    • show volume information
    • add a volume
    • delete a volume
    • restore a volume
    • find an appropriate volume on which to write the file
    • change the number of remaining bytes on the volume
    • set the number of read/write errors
    • set the current status of the volume
    • set the volume as readonly
    • start/stop backup of volume journals

    1.3.1.1 Command Line Control of the Volume Clerk

    The user may interact with the Volume Clerk directly through the enstore vcc command.

    Function Command
    show the name of all the volumes enstore vcc --vols
    show volume information enstore vcc --vol volume_name
    add a volume enstore vcc --addvol library file_family media_type volume_name capacity remaining_capacity
    delete a volume enstore vcc --delvol volume_name
    restore a volume (do not restore files) enstore vcc --restorevol volume_name
    restore a volume (restore files) enstore vcc --all --restorevol volume_name
    find an appropriate volume on which to write the file enstore vcc --nextvol library_name minimal_remaining_bytes file_family
    put volume into a new library enstore vcc --newlib volume_name library_name
    clear system inhibitors to the volume enstore vcc -clrvol volume_name
    mark no access to this volume enstore vcc --noavol volume_name
    set the volume as read only enstore vcc --rdovol volume_name
    start/stop backup of volume journals enstore vcc --backup


    1.3.2 File Clerk

    The File Clerk tracks files in the system. There is one record for each file in the system. The records are keyed. The key is the string version of the bit file ID. The default values are shown in parentheses (). The fields tracked are as follows:

    Column Name Type Comments
    bfid string [primary_key] bit file ID; uniquely identifies every file in the system.
    external_label string Volume name on which the file has been written; same as the external_label in the volume table.
    bof_space_cookie string Driver specific string telling how to space to the file on the media. A lexical sort of all bof_space_cookies for a given volume will yield a optimized traversal of the volume.
    complete_crc int crc of all the bits sent by the user.
    sanity_cookie string ("(0,0)") Number of bytes used for a sanity crc and the sanity crc itself. The sanity crc is just the normal crc but only for the 1st N bytes in the file. This allows the Mover to check early in the transfer process that it probably has the right user file selected; it at least will know if it has the wrong file.

    The File Clerk supports the following requests:

    • show bfid of all the files
    • show file information
    • start/stop backup of file journals
    • assist in processing file read requests
    • delete/restore files

    1.3.2.1 Command Line Control of the File Clerk

    Users may interact with File Clerk directly through enstore fcc command.

    Function Command
    show bfid of all the files enstore fcc --bfids
    show file information enstore fcc --bfid=bit-field-ID
    start/stop backup of volume journals enstore fcc --backup
    declare file deleted/undeleted enstore fcc --bfid=BFID --deleted={yes/no}
    restore file by name enstore fcc --restore="file_name"
    restore file by name and restore a path enstore fcc --r --restore="file_name"


    1.3.3 Library Manager

    The Library Manager is a server which queues up and dispatches work for a virtual library. There is one Library Manager for each virtual library. The virtual library is the collection of tape volumes of the same media type in a (robotic) storage and movers that use these tapes. The Library Manager has two types of clients:
    1. Users -- requesting to have their files read or written.
    2. Movers -- seeking to actually read or write files.

    It can be also accessed from a command line interface

    Enstore does not limit the number of Library Managers and the relation between Library Managers and Movers is many to many. That is, one Library Manager may have many Movers associated with it and, one Mover may have many Library Managers associated with it. Information about Library Managers is contained in the Enstore configuration dictionary and is available to clients via the Configuration Server. Each Library Manager is specified in the configuration dictionary as follows:

    configdict['cdf.library_manager']       = { 'host':'cdfensrv4', 
                                                'port':7515,
    					    'logname':'CDFLM',
    					    'norestart':'INQ',
    					    'max_encp_retries':3,
    					    'max_suspect_movers':3,
    					    'max_file_size':(60L*GB) - 1,
    					    'min_file_size':300*MB,
    					    'suspect_volume_expiration_time':3600*24,
    					    'legal_encp_version':legal_encp_version,
    					    'CleanTapeVolumeFamily': '9940ACLN.CleanTapeFileFamily.noWrapper',
    					    }
    
    configdict['CDF-9940B.library_manager'] = { 'host':'cdfensrv4', 
                                                'port':7522,
    					    'logname':'9940BLM',
    					    'norestart':'INQ',
    					    'max_encp_retries':3,
    					    'max_file_size':(200L*GB) - 1,
    					    'min_file_size':300*MB,
    					    'suspect_volume_expiration_time':3600*24,
    					    'legal_encp_version':legal_encp_version,
    					    'CleanTapeVolumeFamily': '9940BCLN.CleanTapeFileFamily.noWrapper',
    					    }
    
    *.library_manager is the name of the Library Manager.
    

    Below is the description of all library manager (LM) keys

    KEY DESCRIPTION DEFAULT
    host host name where the server runs. None
    port command communication port None
    logname name identifying the server in the log file None
    lock if specified LM will start in this state. Allowed values:locked, unlocked, ignore, pause, nowrite, noread unlocked
    max_suspect_movers if number of suspect movers on which a given volume failed >= of max_suspect_movers, this volume will be set to NOACCESS state. 3
    suspect_volume_expiration_time remove entry from suspect volume list after this period of time None
    min_file_size minimal file size. 0
    max_file_size maximal size of the file allowed by this library. 2GB-2kB
    blank_error_increment do not set volume to NOACCESS in case of FTT_EBLANK error until the number of erros exceeds max_suspect_movers+blank_error_increment. 5
    legal_encp_version minimal encp version number allowed to acess enstore None
    CleanTapeVolumeFamily volume family for cleaning tapes None
    storage_group_limits minimal amount of drives that can be used by a certain storage group (fair share) when different storage groups compete for tape drives. None

    Each Mover has an entry in the configuration dictionary describing the Library Manager(s) associated with it. This entry can be a single name or a list of names:

    #for single LM 
    
    configdict['9940B15.mover'] = { 'host':'stkenmvr15a', 'data_ip':'stkenmvr15a', 'port':7577, 'logname':'DBT15MV',
    				'statistics_path':'/tmp/enstore/enstore/DBT15MV.stat',
    				'norestart':'INQ',
    				'max_consecutive_failures': mvr_max_consecutive_failures,
    				'max_failures': mvr_max_failures,'compression':0,
    				'check_written_file': b_mvr_check_f,
    				'check_first_written_file':b_mvr_check_1st,
    				'max_buffer':1000*MB,
    				'max_rate': s9940b_rate,
    				'mount_delay':15,
    				'update_interval':5,
    				'library':'CD-9940B.library_manager',
    				'device':'/dev/rmt/tps0d0n', 'driver':'FTTDriver',
    				'mc_device':'0,0,10,17', 'media_changer':'stk.media_changer', 'do_cleaning':'No',
    				'syslog_entry':low_level_diag_pattern,
    				'max_time_in_state':1200,
    				'send_stats':1,
    				}
    
    
    # for multiple LMs
    configdict['9940B16.mover'] = { 'host':'stkenmvr16a', 'data_ip':'stkenmvr16a', 'port':7578, 'logname':'DBT16MV',
    				'statistics_path':'/tmp/enstore/enstore/DBT16MV.stat',
    				'norestart':'INQ',
    				'max_consecutive_failures': mvr_max_consecutive_failures,
    				'max_failures': mvr_max_failures,'compression':0,
    				'check_written_file': b_mvr_check_f,
    				'check_first_written_file':b_mvr_check_1st,
    				'max_buffer':1000*MB,
    				'max_rate': s9940b_rate,
    				'mount_delay':15,
    				'update_interval':5,
    				'library':['CD-9940B.library_manager',
                                               'test.library_manager'],
    				'device':'/dev/rmt/tps0d0n', 'driver':'FTTDriver',
    				'mc_device':'0,0,10,18', 'media_changer':'stk.media_changer', 'do_cleaning':'No',
    				'syslog_entry':low_level_diag_pattern,
    				'max_time_in_state':1200,
    				'send_stats':1,
    				}
    
    

    All movers periodically send messages to their library managers, notifying library managers about state of the mover. If mover is in the IDLE or HAVE_BOUND state the library manager can send a work to this mover from the list of its pending requests. The work dispatching will be discucced later.

    1.3.3.1 Users' Requests

    Users' requests to the Library Manager can be divided into 2 categories: data requests and inquiries. Data requests are the tape ( or other media) write or read requests. Inquiries are the requests about libaray manager internal state and resources. encp is the major client application for data trasferring data between user and end enstore. encp can issue write or read request.
    Writes into the system
    Based on the user's encp destination filename, a pnfs tag associated with the destination directory, identifies the library for a write request allowing the encp program to compose a write request and contact the appropriate Library Manager directly.The request is The Library Manager queues the work, and acknowledges the request.
    Reads from the system
    Given the fact that users may mv the pnfs files, on reads from the system, pnfs can only provide the bit file ID associated with the file. In this case, encp contacts the File Clerk, which returns the bit file ID and additional information about the requested file as well as the Library Manager associated with this file. Then encp sends the read request to this Library Manager.
    Inquiries
    Currently there is only one kind of Inquiry request allowing a user to observe requests queued in the Library Manager queues. This command asks the Library Manager to provide information about all current requests from a particular user node. The output format is:
    [user node] [user name] [input file] [output file] [request status] request status can be either P (pending) or M (at Mover)
    The example of the output is given in 1.2.4

    Work can be prioritized. Bigger priority number means higher priority. Currently, write and read are both priority 1 for our test purposes. Any priority mechanism could be developed to replace the existing one. However, the system will exhaust all work for a volume, given that it has been mounted, regardless of priority.

    The Library Manager tries to sort read requests according to file location on the tape. If a read request has been already sent to the Mover the next request to this Mover for the same tape will be for the file whose location number is higher than the current one. If the location number is less than the current one, it will be placed at the end of the request list.

    Once a User request comes, the Library Manager tries to pick up the next available (marked as "idle") Mover and send a "summon" message to it. The purpose of this message is to cause a Mover to send a Mover Request to the Library Manager. Mover Requests are described in the next section. The mechanism of the selection of a particular Mover allows control of some error conditions and implementation of retry logic. For this purpose there is dynamic list of volumes on which write or read requests failed - Suspected Volumes List. It is keyed by the volume external label and contains sublists of Movers on which the request for this volume failed. This tells the Library Manager to not use the same Mover when the User retries its request. When the Library Manager "summons" a Mover it changes the Mover state into the Mover List to "summoned" and puts it into the Summoned Movers List. Every time the Library Manager sends a message there is a time out handler that is being invoked if a response does not arrive before the time out expires. The time out handler will retry to "summon" the Mover whose time out has expired and, eventually remove the Mover for which "summon" retries expire from the Mover List.

    1.3.3.2 Movers' Requests

    The Enstore system keeps unassigned read and write requests in a queue of unallocated (pending) work in the Library Manager. Once a request for the next work comes from the Mover ("idle" or "have bound volume: idle"), the Library Manager tries to change "at_mover" volume state to "mounting", and, if succeeds, puts the request in a "work at mover" queue and responds to the Mover with the appropriate ticket. The reason for this is to track the volumes for scheduling : the Library Manager must not submit to a Mover a request for a volume which is already in use by another Mover. It is the Mover, and not the Library Manager which completes the requests. The two Library Manager request queues are:

    • pending work
    • work at a Mover.

    It is important to keep these queues consistent. Volume and reading errors are handled in the Mover and partially in the Library Manager.

    Movers seek to transport data between media and users over a TCP socket. When "summoned" or having completed work, Movers contact the Library Managers seeking work. If the Library Manager has work, it sends a corresponding ticket to the Mover, which in turn mounts the volume if necessary and transfers the data between user and media. When the Mover completes some work it sends to the Library Manager a request for more work and if it gets a reply that there is no more work for it, it dismount a volume. A Mover may also have decided to dismount a volume unilaterally because it ran into trouble. But it actually does it receiving no_work reply from the Library Manager. Library Manager - Mover communications are in the tables below

    Library Manager sends Mover Sends
    summon idle - ready to do work;
    have bound volume:busy - doing work;
    or have bound volume:idle - volume is mounted but no work
    Mover sends Library Manager may respond
    idle_mover if work needs to be done - read/write;
    or no_work
    have_bound_volume if reads/writes pending for the volume - read/write;
    or if no work - unbind_volume
    unilateral_unbind no work

    Library Manager has just responded Mover sends Library Manager presumes
    read or...
    write
    idle_mover Mover crashed and was re-started
    have_bound_volume look for work on that volume
    if work, give it
    if none, unbind_volume
    unilateral_unbind update Suspected Volumes List and respond with no_work
    acknowledged a...
    unilateral unbind or..
    idle Mover
    no_work
    idle_mover Mover is available for work, If more work available, bind a volume
    have_bound_volume it has restarted, the Mover had a volume from a previous instance of me, tell it to unbind
    unilateral_unbind no work
    Note that if a Mover should crash holding a volume, the worst that can happen is that the Library Manager will be unable to schedule work for that volume. If the physical library has more than one drive, the system should be able to continue servicing requests.

    1.3.3.3 Library Manager Query Commands

    The Library Manager supports queries that provide information about its internal queues. Some of these commands are general to all servers and are described elsewhere. Library Manager specific commands are:
    • getwork - returns all work requests currently in the Library Manager queues in two lists. The first is a list of pending work requests and the second is a list work at Movers.
    • getmoverlist - returns a list of the Movers currently known to this Library Manager and their status.
    • get_suspect_vols - returns a list of suspect volumes.
    • loadmovers - re(loads) list of movers assigned to specified library manager from configuration file.
    • del_work - remove work from the Library Manager queue of pending works
    • change_priority - change work priority in the queue of pending works
    • get_del_dismount - get list of works in the delayed dismount list A description and the purpose of this list has been discussed above. This list contains volumes requests which have failed but are retryable.
    The format of these lists are python dictionaries. Other programs retrieve these lists and format them for presentation. Sample output from these commands:

    getwork
    $ enstore lmc --getwork sphinxdisk.library_manager
    [{'callback_addr': ('131.225.81.23', 7600),
      'encp': {'adminpri': -1,
               'agetime': 0,
               'basepri': 1,
               'curpri': 1,
               'delayed_dismount': 0,
               'delpri': 0},
      'fc': {'bfid': '91548494800000L',
             'complete_crc': 2048910256,
             'external_label': 'flop1',
             'location_cookie': '000000063488',
             'pnfsid': '000200000000000000514A98',
             'sanity_cookie': (9045, 2048910256),
             'size': 9045},
      'lm': {'address': ('131.225.81.23', 7503)},
      'retry_cnt': 0,
      'status': ('ok', None),
      'times': {'t0': 915731751.891, 'job_queued': 915731758.979},
      'unique_id': 'sphinx.fnal.gov-915731758.238293-3793',
      'vc': {'blocksize': 512,
             'capacity_bytes': 1400000L,
             'declared': 915469931.313,
             'eod_cookie': '000000108032',
             'external_label': 'flop1',
             'file_family': 'sphinx',
             'first_access': 915469958.425,
             'last_access': 915728933.0,
             'library': 'sphinxdisk',
             'media_type': 'diskfile',
             'remaining_bytes': 1291968L,
             'status': ('ok', None),
             'sum_rd_access': 0,
             'sum_rd_err': 0,
             'sum_wr_access': 0,
             'sum_wr_err': 0,
             'system_inhibit': 'none',
             'user_inhibit': 'none',
             'wrapper': 'cpio'},
      'work': 'read_from_hsm',
      'wrapper': {'fullname':'/usr/hppc_home/moibenko/enstore_test/enstore/src/tst/
    admin_clerk_client.pyc',
                  'gid': 5440,
                  'gname': 'hppc',
                  'inode': 0,
                  'machine': ('Linux',
                              'sphinx.fnal.gov',
                              '2.0.35',
                              '#1 Thu Jul 23 14:01:04 EDT 1998',
                              'i686'),
                  'major': 0,
                  'minor': 5,
                  'mode': 33268,
                  'pnfsFilename': '/pnfs/enstore/sphinx/t1/admin_clerk_client.pyc',
                  'pstat': (33204,
                            38881944,
                            5,
                            1,
                            6849,
                            5440,
                            9045,
                            915484948,
                            915484948,
                            915485267),
                  'rmajor': 0,
                  'rminor': 0,
                  'sanity_size': 65535,
                  'size_bytes': 9045,
                  'uid': 6849,
                  'uname': 'moibenko'}},
     {'callback_addr': ('131.225.81.23', 7600),
      'encp': {'adminpri': -1,
               'agetime': 0,
               'basepri': 1,
               'curpri': 1,
               'delayed_dismount': 0,
               'delpri': 0},
      'fc': {'bfid': '91548792000000L',
             'complete_crc': -1493930591,
             'external_label': 'flop1',
             'location_cookie': '000000073216',
             'pnfsid': '000200000000000000514B40',
             'sanity_cookie': (4538, -1493930591),
             'size': 4538},
      'lm': {'address': ('131.225.81.23', 7503)},
      'retry_cnt': 0,
      'status': ('ok', None),
      'times': {'t0': 915731751.891, 'job_queued': 915731759.085},
      'unique_id': 'sphinx.fnal.gov-915731758.243841-3793',
      'vc': {'blocksize': 512,
             'capacity_bytes': 1400000L,
             'declared': 915469931.313,
             'eod_cookie': '000000108032',
             'external_label': 'flop1',
             'file_family': 'sphinx',
             'first_access': 915469958.425,
             'last_access': 915728933.0,
             'library': 'sphinxdisk',
             'media_type': 'diskfile',
             'remaining_bytes': 1291968L,
             'status': ('ok', None),
             'sum_rd_access': 0,
             'sum_rd_err': 0,
             'sum_wr_access': 0,
             'sum_wr_err': 0,
             'system_inhibit': 'none',
             'user_inhibit': 'none',
             'wrapper': 'cpio'},
      'work': 'read_from_hsm',
      'wrapper': {'fullname':'/usr/hppc_home/moibenko/enstore_test/enstore/src/tst/
    backup.py',
                  'gid': 5440,
                  'gname': 'hppc',
                  'inode': 0,
                  'machine': ('Linux',
                              'sphinx.fnal.gov',
                              '2.0.35',
                              '#1 Thu Jul 23 14:01:04 EDT 1998',
                              'i686'),
                  'major': 0,
                  'minor': 5,
                  'mode': 33268,
                  'pnfsFilename': '/pnfs/enstore/sphinx/t1/backup.py',
                  'pstat': (33204,
                            38882112,
                            5,
                            1,
                            6849,
                            5440,
                            4538,
                            915487920,
                            915487920,
                            915488239),
                  'rmajor': 0,
                  'rminor': 0,
                  'sanity_size': 65535,
                  'size_bytes': 4538,
                  'uid': 6849,
                  'uname': 'moibenko'}},]
    
    [{'callback_addr': ('131.225.81.23', 7600),
      'encp': {'adminpri': -1,
               'agetime': 0,
               'basepri': 1,
               'curpri': 1,
               'delayed_dismount': 0,
               'delpri': 0},
      'fc': {'bfid': '91548494100000L',
             'complete_crc': 1614017314,
             'external_label': 'flop1',
             'location_cookie': '000000055296',
             'pnfsid': '0002000000000000005149F8',
             'sanity_cookie': (7581, 1614017314),
             'size': 7581},
      'lm': {'address': ('131.225.81.23', 7503)},
      'mover': 'sphinxdisk.mover',
      'retry_cnt': 0,
      'status': ('ok', None),
      'times': {'in_queue': 1.77947795391,
                'lm_dequeued': 915731760.655,
                't0': 915731751.891},
      'unique_id': 'sphinx.fnal.gov-915731758.236400-3793',
      'vc': {'blocksize': 512,
             'capacity_bytes': 1400000L,
             'declared': 915469931.313,
             'eod_cookie': '000000108032',
             'external_label': 'flop1',
             'file_family': 'sphinx',
             'first_access': 915469958.425,
             'last_access': 915728933.0,
             'library': 'sphinxdisk',
             'media_type': 'diskfile',
             'remaining_bytes': 1291968L,
             'status': ('ok', None),
             'sum_rd_access': 0,
             'sum_rd_err': 0,
             'sum_wr_access': 0,
             'sum_wr_err': 0,
             'system_inhibit': 'none',
             'user_inhibit': 'none',
             'wrapper': 'cpio'},
      'work': 'read_from_hsm',
      'wrapper': {'fullname':'/usr/hppc_home/moibenko/enstore_test/enstore/src/tst/
    admin_clerk_client.py',
                  'gid': 5440,
                  'gname': 'hppc',
                  'inode': 0,
                  'machine': ('Linux',
                              'sphinx.fnal.gov',
                              '2.0.35',
                              '#1 Thu Jul 23 14:01:04 EDT 1998',
                              'i686'),
                  'major': 0,
                  'minor': 5,
                  'mode': 33268,
                  'pnfsFilename': '/pnfs/enstore/sphinx/t1/admin_clerk_client.py',
                  'pstat': (33204,
                            38881784,
                            5,
                            1,
                            6849,
                            5440,
                            7581,
                            915484942,
                            915484942,
                            915485261),
                  'rmajor': 0,
                  'rminor': 0,
                  'sanity_size': 65535,
                  'size_bytes': 7581,
                  'uid': 6849,
                  'uname': 'moibenko'}}]
    
    
    getmoverlist
    $ enstore lmc --getmoverlist sphinxdisk.library_manager
    [{'address': ('131.225.81.23', 7508),
      'last_checked': 915731762.917,
      'mover': 'sphinxdisk.mover',
      'state': 'idle_mover',
      'summon_try_cnt': 0,
      'tr_error': 'ok'}]
    
    get_suspect_vols
    $ enstore lmc --get_suspect_vols sphinxdisk.library_manager
    [{'external_label': 'flop1', 'movers': ['sphinxdisk.mover']}]
    
    loadmovers
    $ enstore lmc --loadmovers happydisk.library_manager
    {'movers': [{'address': ('131.225.84.122', 7509),movers happdisk.library_manager 
                 'external_label': 'flop1',
                 'file_family': 'happy.cpio_custom',dmovers sphindisk.library_manager
                 'last_checked': 922226963.097,--loadmovers sphinxdisk.library_manager
                 'mover': 'happydisk1.mover',
                 'state': 'idle_mover',
                 'summon_try_cnt': 0,
                 'tr_error': 'ok'},
                {'address': ('131.225.84.122', 7511),
                 'external_label': 'flop1',
                 'file_family': 'happy.cpio_custom',
                 'last_checked': 922226968.162,
                 'mover': 'happydisk3.mover',
                 'state': 'idle_mover',
                 'summon_try_cnt': 0,
                 'tr_error': 'ok'},
                {'address': ('131.225.84.122', 7513),
                 'external_label': 'flop1',
                 'file_family': 'happy.cpio_custom',
                 'last_checked': 922226922.662,
                 'mover': 'happydisk5.mover',
                 'state': 'idle_mover',
                 'summon_try_cnt': 0,
                 'tr_error': 'ok'},
                {'address': ('131.225.84.122', 7510),
                 'external_label': 'flop1',
                 'file_family': 'happy.cpio_custom',
                 'last_checked': 922226929.446,
                 'mover': 'happydisk2.mover',
                 'state': 'idle_mover',
                 'summon_try_cnt': 0,
                 'tr_error': 'ok'},
                {'address': ('131.225.84.122', 7508),
                 'external_label': 'flop1',
                 'file_family': 'happy.cpio_custom',
                 'last_checked': 922226947.063,
                 'mover': 'happydisk.mover',
                 'state': 'idle_mover',
                 'summon_try_cnt': 0,
                 'tr_error': 'ok'},
                {'address': ('131.225.84.122', 7512),
                 'external_label': 'flop1',
                 'file_family': 'happy.cpio_custom',
                 'last_checked': 922226954.994,
                 'mover': 'happydisk4.mover',
                 'state': 'idle_mover',
                 'summon_try_cnt': 0,
                 'tr_error': 'ok'}],
     'status': ('ok', None),
    
    
    del_work
    $ enstore lmc --del_work rip6.library_manager rip8.fnal.gov-922223453.580484-30688
    ID rip8.fnal.gov-922223453.580484-30688
    {'status': ('ok', 'Work deleted')}
    
    change_priority
    no example
    
    get_del_dismount
    no example
    

    1.3.4 Mover

    A Mover task is bound to a single drive, and seeks to use that drive to service read and write requests. It communicates with the Library Manager in a defined protocol, as just described.

    The Mover is responsible for efficient data movement and as such is an integral part of the system. The architecture allows for performance critical code to be written in C thus allowing efficient access to fundamental OS features such as forking with minimal to no language overhead.

    Although a Mover is bound to a drive, a drive may serve more than one virtual library, i.e., the Mover has a dynamic list of of Library Managers that it is supposed to service. This has two benefits. First, since a Library Manager handles only one type of media, a drive which handles multiple types of media (i.e. different capacity media) can be shared without a static partitioning of the system. Second, if we are partitioning resources in a library, we can assign a Library Manager to each type of use. For example, suppose Group A and Group B want to share the capacity of a library. Suppose half the tapes belong to Group A and the other half to Group B. We want to guarantee that Group A have one third of the tape drives, Group B have one third, and the last third be shared. The Movers can be configured to do this easily. And with some slight changes, this is how we can guarantee resources to data acquisition.

    There has been a request to duplicate (write to two tapes) critical data. This feature had been discussed but not implemented as a specific method of implementation has not been decided upon. The following are among the possible implementations:

    • Assign two tape drives to one Mover. If the Mover has a list of two volumes in the write_to_hsm response, the Mover binds both volumes and writes to both drives. This Mover could be give a single volume for "normal" data writing.
    • Have the Library Manager summon two Movers and tell one it is a master and the other it is a slave. The master receives data from the user and in addition to writing to it's tape drive, it also sends the data on to the slave Mover. The slave Mover receives data from the master Mover.
    The "local mover" feature allows the mover process to read/write the user data file directly, if the file/directory is accessible and on a filesystem local to the computer on which the mover process is running. By default, this feature is enabled. It can be controlled via the "enstore" command. For example:

    enstore mvc --local_mover=0 fndaprdisk.mover

    When the Mover starts up, the 'idle_mover' request/command is sent to each Library Manager configured and the responses from the Library Managers are acted upon. After the startup, the Mover waits until it is 'summoned' by a Library Manager.

    When a Mover is summoned by a Library Manager, it will send one of three request/commands to the Library Manager that summoned the Mover:

    1. idle_mover
    2. have_bound_volume, idle
    3. have_bound_volume, busy
    When the Mover is busy, the Library Manager should respond with 'no_work.' Otherwise, the Library Manager can respond with 'no_work,' 'read_from_hsm,' or 'write_to_hsm.' When the Mover receives 'read_from_hsm' or 'write_to_hsm' it forks a subprocess which handles the transfer utilizing a shared memory buffer for both data transfer and communication with parent process. The main Mover process can read shared memory locations to get the status of the transfer. This design will allow for easy implementation of DESY's "slow user network abort" feature. The parent process can watch the transfer processes and determine (at some point early in the transfer) if the tape drive is being starved or throttled such that the tape drive resource is being abused. The first thing the transfer process does is check to see if the waiting encp is responsive. This involves contacting encp on the designated TCP control port and sending along the TCP port designation for the data transfer. If encp is responsive, the Mover proceeds with making sure the proper volume is loaded in it's tape drive.

    Reads -- Once a volume is bound the Mover may read a volume and send data to a waiting encp program. The steps are:

    1. Using the file_location_cookie, space to beginning of data.
    2. Read any wrappering information that precedes the actual data.
    3. Fork a process that reads and crc's the data from the volume verifying the sanity crc and placing the data in a 4 MB shared memory buffer.
    4. Write data from the shared memory to the user.
    5. Read any wrappering information that comes after the data.
    6. Close the data port.
    7. Tell the user done and all is well.
    8. Close the control port.

    Writes -- Once a volume is bound the Mover may receive data and write it to the volume. The steps are:

    1. Mark the volume as "writing". That will cause the volume to not be selected for subsequent writes, should we crash.
    2. Using the eod_space_cookie, space to end of volume. Try to verify that we are actually at the end of volume.
    3. Write any wrappering information that precedes the data.
    4. Fork a process that reads and crc's data from the user calculating the sanity crc and placing the data in a 4 MB shared memory buffer.
    5. Write data from the shared memory to the tape device.
    6. Close the data port.
    7. Write any wrappering information after the data.
    8. Compute new eod_cookie and tell Volume Clerk that the volume is writable. Update remaining bytes as well.
    9. Compute the file location cookie, and tell the bit File Clerk about the new file. Get a bit file ID in return.
    10. Give the bit file ID to encp. We are done.
    If any errors occur while reading or writing a volume, an attempt is made to characterize them as either media or drive. Depending upon the error, the Mover will issue either have_bound_volume or unilateral_unbind to the Library Manager. This is discussed more completely in the section on Error control. If the user drops the control tcp channel unilaterally, the Mover assumes he has aborted the transfer. If all is well with the entire transfer, the Mover issues a have_bound_volume to the Library Manager and waits for further instructions.

    1.3.4.1 Command Line Control of the Mover

    The Mover is started and controlled through a command line interface using enstore. Option commands for "enstore mvc [--option_command] " include the general option commands supported by all other servers. Additionally the "status" option command is supported and produces the following:

    $ enstore mvc --status fndaprdisk.mover
    {'wr_bytes': 6398896, 'rd_bytes': 6398896, 'no_xfers': 11, 'mode': 'w', 'bytes_to_xfer': 6398896, 'crc_func': '', 'state': 'idle', 'status': ('ok', None)}
    where

    wr_bytes
    the number of bytes written to the tape device when 'mode' is 'w' else the number of bytes written to the user device.
    rd_bytes
    the number of bytes read from the user device when 'mode' is 'r' else the number of bytes read from the tape device.
    no_xfers
    the number of completed transfers.
    mode
    'r' for reading from HSM, 'w' for writing to HSM.
    bytes_to_xfer
    number of bytes to be transfer for current transfer if state is 'busy' or the last transfer is state is idle.
    crc_func
    active crc function.
    state
    'idle' if no transfer active, else 'busy.'
    status
    should always be: ('ok', None)

    1.3.4.2 Mover Config File Values

    DICTIONARY ELEMENT DEFINITION DEFAULT EXAMPLE VALUE
    host node where Mover runs   hppc
    port UDP port for Mover communication   7516
    logname ascii value used for id in messages to Log Server   FMOV
    library list of libraries that the Mover will contact when it starts up.   ['fndaprdisk.library_manager']
    media_changer the name of the Media Changer server that will be communicated with in order to load and unload tape cartridges.   'fndaprdisk.media_changer
    mc_device a device name or number to include with communications with the Media Changer.   1
    do_eject used when testing stand alone tape drive (no robot). 'yes' 'no'
    driver the HSM driver   'FTTDriver'
    device device name used for driver device access. **   '/dev/rmt/tps2d2n' (make sure this is a no-rewind device)
    norestart do not restart this server if it crashes do a restart

    ** The mover process must have read and write access to the tape pseudo devices.

    Linux tape devices are called /dev/nstX by default where X is Xth serial device found on the system. X can change if devices are added or removed on the bus. A script in the FTT product etc/mkscsidev.Linux creates the files /dev/rmt/tpsNdMn where N is the bus number and M is the scsi id of the device. N and M do not change if the bus changes (unless scsi ids or controllers are changed) and so the enstore configuration files do not need to change. $FTT_DIR/etc/mkscsidev.Linux should be run at boot time; normally via /etc/rc.d/rc.local. A sample rc.local:

    echo "Making scsi tape devices"
    . /usr/local/etc/setups.sh
    setup ftt
    $FTT_DIR/etc/mkscsidev.Linux
    
    chmod 0666 /dev/rmt/*
    chmod 0666 /dev/sc/*
    


    1.3.5 Configuration Server

    The Configuration Server maintains and distributes all information about system configuration, such as the location and parameters of each server. Upon startup, each server asks the Configuration Server for the information pertaining to itself (e.g. the location of any other server with which to communicate). New configurations can be loaded into the Configuration Server without disturbing the current running system. Configurations are stored in a file called the Enstore configuration file in Python dictionary format. An example of this file is given below:

    configdict['blocksizes'] = { 'diskfile'  : 512, \
                                 'redwood'   : 131072, \
                                 'floppy'    : 512, \
                                 'cassette'  : 512, \
                                 'cartridge' : 512, \
                                 'exabyte'   : 131072, \
                                 '8MM'       : 131072, \
                                 'DECDLT'    : 131072 }
    
    configdict['file_clerk']   = { 'host':'rip6', 'port':7501, 'logname':'FILSRV' }
    configdict['volume_clerk'] = { 'host':'rip6', 'port':7502, 'logname':'VOLSRV' }
    configdict['alarm_server'] = { 'host':'rip10', 'logname':'ALMSRV', \
                                   'port'  : 7503 }
    
    configdict['log_server']   = { 'host':'rip6', 'port':7504, \
                                   'log_file_path':'/rip6a/enstore/log' }
    configdict['database']     = { 'db_dir':'/rip6a/enstore/db' }
    configdict['backup']       = { 'host':'rip6', 'dir':'/rip6a/enstore/db_backup'}
    
    configdict['inquisitor']   = { 'host':'rip6', 'port':7505, 'logname':'INQSRV', \
                                   'timeout':10, 'alive_rcv_timeout': 5, \
                                   'alive_retries':1, \
                                   'ascii_file':'/rip6a/enstore/inquisitor/', \
                                   'html_file':'/fnal/ups/prd/www_pages/enstore/', \
                                   'default_server_timeout': 15, \
                                   'timeouts' : { 'ait.library_manager': 15} }
    
    configdict['rip6.library_manager']  = { 'host':'rip5', 'port':7506, \
                                            'logname':'RP6LBM' }
    configdict['dlt.library_manager']   = { 'host':'rip5', 'port':7509, \
                                            'logname':'DLTLBM' }
    configdict['rip6.media_changer']    = { 'host':'rip6',  'port':7512, \
                                            'logname':'R6MC  ', \
                                            'type':'RDD_MediaLoader'  }
    configdict['de13.media_changer']    = { 'host':'rip10', 'port':7517, \
                                            'logname':'DE13MC', \
                                            'type':'EMASS_MediaLoader' }
    configdict['rip6.mover']    = { 'host':'rip6', 'port':7525, 'logname':'R6MOV ', \
                                    'library':'rip6.library_manager', \
                                    'device':'/rip6a/rip6/rip6.fake', \
                                    'driver':'RawDiskDriver', \
                                    'mc_device':'-1', \
                                    'media_changer':'rip6.media_changer' }
    configdict['DE13DLT.mover'] = { 'host':'rip1', 'port':7526, 'logname':'DE13MV', \
                                    'library':'dlt.library_manager', \
                                    'device':'/dev/rmt/tps2d1n', \
                                    'driver':'FTTDriver', \
                                    'mc_device':'DE13', \
                                    'media_changer':'de13.media_changer' }
    

    The keys/values used in the above example are typical of a running system. The blocksizes dictionary element specifies the size of a block on the different devices known to the system. The database dictionary element specifies where the Enstore database files are located. The backup dictionary element specifies the node and directory of where the database backups will go.

    Please see the individual server sections for more in depth descriptions of all the server keywords.

    1.3.5.1 Command Line Control of the Configuration Server

    Configuration Server functionality may be controlled through a command line interface using enstore. A summary of the supported commands is given below. In addition to the following commands, the Configuration Server command line interface supports the general commands supported by all other servers.

    FUNCTION COMMAND OUTPUT
    load the specified Enstore config file into the configuration server enstore cc --config_file=/path/to/config_file --load  
    output the currently loaded Enstore configuration file enstore cc --dict (same as the example in the previous section)
    output the keys in the currently loaded Enstore configuration file enstore cc --get_keys
    ['DE13DLT.mover',
     'alarm_server',
     'backup',
     'blocksizes',
     'database',
     'de13.media_changer',
     'dlt.library_manager',
     'file_clerk',
     'inquisitor',
     'log_server',
     'rip6.library_manager',
     'rip6.media_changer',
     'rip6.mover',
     'volume_clerk']
    


    1.3.6 Log Server

    The Log Server receives messages from other processes and logs them into formatted log files. Basically, these messages are transactional records. Log files are labeled by dates. At midnight each day, the currently opened log file gets closed and another one is opened. Below is an excerpt from the log file:
    10:03:42 sphinx.fnal.gov 006849 moibenko I FILC  File Clerk (re)starting
    10:03:46 sphinx.fnal.gov 006849 moibenko I HLIBM  Library Manager sphinxdisk.library_manager(re)starting
    10:03:50 sphinx.fnal.gov 006849 moibenko I HMC  Media Changersphinxdisk.media_changer(re) starting
    10:03:55 sphinx.fnal.gov 006849 moibenko I HMOV  Mover starting - contacting libman
    10:03:59 sphinx.fnal.gov 006849 moibenko I ADMC  Admin Clerk (re)starting
    10:09:34 sphinx.fnal.gov 006849 moibenko I HLIBM  read Q'd /pnfs/enstore/sphinx/ut1/mover.py -> ........
    10:09:34 sphinx.fnal.gov 006849 moibenko I HLIBM  read_from_hsm work on vol=flop1 ..........
    10:09:34 sphinx.fnal.gov 006849 moibenko I HMOV  READ_FROM_HSM start{'times': ..........
    10:09:35 sphinx.fnal.gov 006849 moibenko I HMOV  Performing precautionary offline/eject.........
    10:09:35 sphinx.fnal.gov 006849 moibenko I HMOV  Completed  precautionary offline/eject.......
    10:09:35 sphinx.fnal.gov 006849 moibenko I HMOV  Requesting media changer load {' ............
    10:09:35 sphinx.fnal.gov 006849 moibenko I HMOV  Media changer load status('ok', None)
    10:09:35 sphinx.fnal.gov 006849 moibenko I HMOV  Requesting software mount flop1 ........
    10:09:35 sphinx.fnal.gov 006849 moibenko I HMOV  Software mount complete flop1 ........
    10:09:35 sphinx.fnal.gov 006849 moibenko I HMOV  WRAPPER.READ........
    10:09:35 sphinx.fnal.gov 006849 moibenko I HMOV  READ DONE{'unique_id':  .............
    
    Fields in a log file are:
    • time
    • node name
    • user id
    • user name
    • severity indicator (I - information, E - Error)
    • client abbreviation
    • message

    1.3.7 Media Changer

    The Media Changer mounts and dismounts the media into and from the drive according to a request from the Mover. One Media Changer can serve multiple drives and libraries. When the drives are in the robot, the Media Changer is the interface to the robotic software.

    The Media Changer issues multiple simultaneous commands by forking processes that do the work. A Media Changer parameter, MAXWORK, limits the maximum number of simultaneous outstanding operations. If the Media Changer receives mount/dismount requests while there are MAXWORK unfinished operations then the new operations are ignored, the Mover request will time out, and the Mover will reissue the mount/dismount request.

    The reason for the MAXWORK parameter is because when the EMASS robot has an operation for ten minutes it reports a timeout failure even though it eventually finishes the operation. The MAXWORK parameter can be set to 0 when it is necessary to perform work on a robot.

    The Media Changer returns three status values:

    • A canonical translation of the underlying status with the values: ok , TAPE, DRIVE,BAD
    • The status returned by the underlying agent
    • A text description returned by the underlying agent

    The Media Changer and the Media Changer Client support the following requests:

    • maxwork=< max simultaneous operation>
    • getwork

    The Media Changer mounting agents:

    • EMASS/Grau robot
    • STK robot
    • null Media Changer used by the disk Movers and stand alone tape drives
    • OCS operator assisted mounts - to be implemented

    Tape Cleaning

    The Media Changer is not directly involved with tape cleaning. The EMASS AMU and the STK ACSLS tape library systems keep tape drive usage statistics and automatically mount cleaning tapes. The Media Changer will not issue mount requests during the cleaning process.

    Tape statistics

    The Media Changer does not keep tape drive or cartridge statistics. Summary statistics are not very useful and the media does not run on the machine connected to the tape drive. The overall tape and drive statistics repository is OCS, and a Enstore interface has not yet been designed.

    Enstore writes detail error statistics to its log when a file is closed. A separate mount/dismount log can be easily separated from the main log.


    1.3.8 Inquisitor

    The Inquisitor obtains information from the Enstore system and creates the following reports using this information:
    The reports are updated periodically based on timeout values in the Enstore config file directing the Inquisitor to gather each servers' information on a specific time frequency. Each Enstore server may have its own unique timeout value specified for it. For example, the Inquisitor may be instructed to gather information from the file_clerk every 60 seconds but from the log_server every 135 seconds. However the plots are not updated automatically and may be updated by a user initiated command or by a cron job for example. The information for plotting is obtained from the log files.

    In addition to the above reports, the Inquisitor will make available on the web, the contents of the configuration file, all current Enstore log files and any additional log files useful to the user.

    The Inquisitor will listen for command line requests sent to it and will periodically check to see if it is time to update information for any of the servers that it is monitoring. If so, then the server in question is contacted and the resulting information is formated for output to the various reports. The possible information gathered from each of the servers and which report it ends up in are listed below. In addition to gathering information from each Enstore server, the Inquisitor will collect information from the log files on encp commands and report on the blocksizes set in the Enstore config file.

    SERVER INFORMATION GATHERED REPORTS EFFECTED
      blocksizes continuous Ascii status file and
    html snapshot file
      encp command history continuous Ascii status file and
    encp html snapshot file
    Alarm Server alive status continuous Ascii status file and
    html snapshot file
    Configuration Server alive status continuous Ascii status file and
    html snapshot file
    File Clerk alive status continuous Ascii status file and
    html snapshot file
    Inquisitor alive status
    refetch config file from config server
    continuous Ascii status file and
    html snapshot file
    Library Manager(s) alive status
    suspect volume list
    mover list
    work queues
    continuous Ascii status file and
    html snapshot file
    Log Server alive_status continuous Ascii status file and
    html snapshot file
    Media Changer(s) alive status continuous Ascii status file and
    html snapshot file
    Mover(s) alive status
    Mover activity status
    continuous Ascii status file and
    html snapshot file
    Volume Clerk alive status continuous Ascii status file and
    html snapshot file

    Since the Inquisitor requests a new config file from the config_server periodically, it is possible to dynamically change the way information is displayed and the type of information that is displayed without restarting the Inquisitor.

    1.3.8.1 Command Line Control of the Inquisitor

    Inquisitor functionality may be controlled through a command line interface using enstore. A summary of the supported commands is given below. In addition to the following commands, the Inquisitor command line interface supports the general commands supported by all other servers. Any particular server name mentioned in the table below may be replaced by any legal server name.

    FUNCTION COMMAND OUTPUT
    get the maximum size of the ascii status file enstore ic --get_max_ascii_size maximum ascii size
    get the maximum number of encp status lines displayed enstore ic --get_max_encp_lines maximum number of encp lines
    get the html status file auto refresh rate enstore ic --get_refresh refresh time
    get the frequency for monitoring the Volume Clerk enstore ic --get_timeout volume_clerk volume_clerk timeout value
    get the frequency for looking for work enstore ic --get_timeout Inquisitor wakeup time
    reset the maximum size of the ascii status file enstore ic --max_ascii_size=40000  
    reset the maximum number of encp status lines displayed enstore ic --max_encp_lines=13  
    recreate the Inquisitor plots enstore ic --plot  
    recreate the Inquisitor plots, keep the data files and put them in /tmp. enstore ic --plot --keep --keep_dir=/tmp  
    recreate the Inquisitor plots and put the plot files in /tmp. enstore ic --plot --out_dir=/tmp  
    recreate the Inquisitor plots and use the log files located in the specified directory enstore ic --plot --logfile_dir=/tmp/logs  
    recreate the Inquisitor plots and only plot information after the specified start_time enstore ic --plot --start_time=1998-12-25  
    recreate the Inquisitor plots and only plot information before the specified stop_time enstore ic --plot --stop_time=1998-12-31  
    recreate the Inquisitor plots and only plot information between the specified times enstore ic --plot --start_time=1998-12-01 --stop_time=1998-12-31  
    reset the html status file auto refresh rate enstore ic --refresh=60  
    reset the frequency for monitoring the alarm_server to the value in the config file enstore ic --reset_timeout alarm_server  
    reset the frequency for looking for work to the value in the config file enstore ic --reset_timeout  
    reset the frequency for monitoring the file_clerk enstore ic --timeout=55 file_clerk  
    reset the frequency for looking for work enstore ic --timeout=10  
    close the current ascii status file and open a new one enstore ic --timestamp  
    monitor the Log Server now enstore ic --update log_server  
    monitor all the servers now enstore ic --update  

    1.3.8.2 Inquisitor Config File Values

    The Inquisitor looks for the following values in the Inquisitor section of the Enstore config file. The default value is used if the dictionary element is not found. Dictionary elements with no default must be specified in the Enstore config file. All frequencies are specified in seconds.

    DICTIONARY ELEMENT DEFINITION DEFAULT
    alive_rcv_timeout seconds to wait for response to alive request 5
    alive_retries times to retry alive request 2
    ascii_file directory for ascii status file(s) ./
    default_server_timeout frequency to monitor servers not listed in timeouts 60
    host node where Inquisitor runs  
    html_file directory for html status files ./
    logname ascii value used for id in messages to Log Server INQS
    max_ascii_size maximum allowed size (bytes) of ascii status file  
    max_encp_lines maximum number of encp lines to display 50
    port udp port for Inquisitor communication  
    refresh frequency for auto-refresh of html status page 120
    robot_adic_log_dir location of adic log files to point to in the Inquisitor log page (NOTE: replace 'adic' with other text to add a link to a different log directory)  
    timeout frequency that Inquisitor looks for work 5
    timeouts dictionary of frequencies for monitoring each server  

    In addition to the information listed above, the Inquisitor will look for the inq_timeout dictionary element in each of the individual server sections. If present, the value of this dictionary element will be used to specify the timeout frequency for monitoring this server. This is the same as if the timeouts dictionary element mentioned above contained a dictionary element for the particular server. For example, in order to monitor the file_clerk every 65 seconds, the Enstore config file must have one of the following in it:

    • in the Inquisitor section - a dictionary element, within the timeouts dictionary element, for the File Clerk set to 65
    • in the file_clerk section - the dictionary element inq_timeout set to 65
    The value in the individual server dictionary element will take precedence over the value in the Inquisitor dictionary element.

    In order to block monitoring of a particular server, set it's timeout value to -1.

    An example Inquisitor dictionary element is given below:

    configdict['inquisitor'] = { 'alive_rcv_timeout'  : 5,
                                 'alive_retries'  : 1,
                                 'ascii_file'  : '/tmp',
                                 'default_server_timeout'  : 15,
                                 'host'  : 'rip7',
                                 'html_file'  : '/fnal/ups/prd/www_pages/enstore/',
                                 'http_log_file_path'  : '/enstore/log/',
                                 'logname'  : 'INQSRV',
                                 'max_ascii_size'  : 100000000,
                                 'port'  : 7505,
                                 'robot_adic_log_dir'  : '/enstore/adiclog/',
                                 'timeout'  : 10,
                                 'timeouts'  : {'ait.library_manager': 15},
                                 'www_host'  : 'http://rip8.fnal.gov:' }
    

    1.3.8.3 Example Inquisitor Reports

    These examples reflect a running system on the rip cluster.

    1.3.8.3.1 Example Ascii Status File

    This file records a continuous history of the status of the Enstore system as monitored by the Inquisitor. It contains the following information:
    • block size information as recorded in the Enstore config file
    • alive status for each server including node, port, and time
    • Library Manager specific information
      • suspect volumes
      • list of known Movers, their ports, state, last time they were summoned and number of attempts to summon them
      • work queue including:
        • assigned Mover
        • node, node type, and port where Mover is located
        • work that Mover is doing
        • device label
        • file family and file family width
        • priorities of the work
        • associated times
      • pending work queue including:
        • node, node type, and port where work originated
        • work to be done
        • file family and file family width
        • priorities of the work
        • associated times
    • Mover specific information
      • number of completed transfers
      • current state of the Mover
      • number of bytes read and written on the last transfer (if idle)
      • number of bytes read and written so far on the current transfer (if working)
    ENSTORE SYSTEM STATUS
    DC03MAM.mover : timed out on (rip1, 7552) at 1999-May-27 13:50:34
                    last alive at ----
    DC04MAM.mover : timed out on (rip1, 7553) at 1999-May-27 13:50:34
                    last alive at ----
    DC05MAM.mover : timed out on (rip1, 7554) at 1999-May-27 13:50:34
                    last alive at ----
    DC06MAM.mover : timed out on (rip1, 7555) at 1999-May-27 13:50:34
                    last alive at ----
    DM07AIT.mover : timed out on (ripsgi, 7556) at 1999-May-27 13:50:34
                    last alive at ----
    DM08AIT.mover : timed out on (ripsgi, 7557) at 1999-May-27 13:50:34
                    last alive at ----
    DM09AIT.mover : timed out on (ripsgi, 7558) at 1999-May-27 13:50:34
                    last alive at ----
    DM10AIT.mover : timed out on (ripsgi, 7559) at 1999-May-27 13:50:34
                    last alive at ----
    DM11AIT.mover : timed out on (ripsgi, 7560) at 1999-May-27 13:50:34
                    last alive at ----
    DM12AIT.mover : timed out on (ripsgi, 7561) at 1999-May-27 13:50:34
                    last alive at ----
    adicr1.media_changer : alive on (rip10, 7521) at 1999-May-27 13:50:34
    adicr1TOM.media_changer : alive on (rip10, 9521) at 1999-May-27 13:50:34
    ait.library_manager : alive on (rip5, 7512) at 1999-May-27 13:50:34
    
        SUSPECT VOLUMES : NONE
    
        KNOWN MOVER           PORT    STATE         LAST SUMMONED        TRY COUNT
        DM12AIT.mover         7561    idle_mover    1999-May-26 00:19:39    0  
        DM08AIT.mover         7557    idle_mover    1999-May-26 00:19:39    0  
        DM11AIT.mover         7560    idle_mover    1999-May-26 00:19:39    0  
        DM07AIT.mover         7556    idle_mover    1999-May-26 00:19:39    0  
        DM10AIT.mover         7559    idle_mover    1999-May-26 00:19:39    0  
        DM09AIT.mover         7558    idle_mover    1999-May-26 00:19:39    0  
    
        No work at movers
        No pending work
    
    alarm server    : alive on (rip10, 7503) at 1999-May-27 13:50:34
    blocksizes      : diskfile : 512,  exabyte : 102400,  DECDLT : 102400,
                      floppy : 512,  cartridge : 512,  redwood : 102400,
                      cassette : 512,  8MM : 102400
    config server   : alive on (131.225.164.14, 7500) at 1999-May-27 13:50:34
    disk.library_manager : alive on (rip7, 7510) at 1999-May-27 13:50:34
    
        SUSPECT VOLUMES : NONE
    
        KNOWN MOVER           PORT    STATE         LAST SUMMONED        TRY COUNT
        disk1.mover           7530    idle_mover    1999-May-26 16:23:33    0  
        disk2.mover           7531    idle_mover    1999-May-26 16:23:32    0  
    
        No work at movers
        No pending work
    
    disk.media_changer : alive on (rip7, 7520) at 1999-May-27 13:50:34
    disk1.mover : alive on (rip7, 7530) at 1999-May-27 13:50:34
    
        Completed Transfers : 0,  Current State : idle 
        Last Transfer :  Read 0 bytes,  Wrote 0 bytes
    
    disk2.mover : alive on (rip7, 7531) at 1999-May-27 13:50:34
    
        Completed Transfers : 0,  Current State : idle 
        Last Transfer :  Read 0 bytes,  Wrote 0 bytes
    
    dlt.library_manager : alive on (rip5, 7514) at 1999-May-27 13:50:34
    
        SUSPECT VOLUMES : NONE
    
        No moverlist
        No work at movers
        No pending work
    
    encp            : 15:44:11 on rip4.fnal.gov by bakken (Data Transfer Rate : 2.62 MB/S)
                         1073741824 bytes copied to CA2252 at a user rate of 2.08 MB/S
                      15:43:39 on rip4.fnal.gov by bakken (Data Transfer Rate : 2.62 MB/S)
                         1073741824 bytes copied to CA2258 at a user rate of 1.88 MB/S
                      15:40:36 on rip4.fnal.gov by bakken (Data Transfer Rate : 2.68 MB/S)
                         1073741824 bytes copied to CA2257 at a user rate of 1.79 MB/S
                      15:34:30 on rip4.fnal.gov by bakken (Data Transfer Rate : 2.63 MB/S)
                         1073741824 bytes copied to CA2252 at a user rate of 2.11 MB/S
                      15:34:00 on rip4.fnal.gov by bakken (Data Transfer Rate : 2.64 MB/S)
                         1073741824 bytes copied to CA2258 at a user rate of 2.03 MB/S
                      15:30:57 on rip4.fnal.gov by bakken (Data Transfer Rate : 2.67 MB/S)
                         1073741824 bytes copied to CA2257 at a user rate of 1.84 MB/S
    
    file clerk      : alive on (rip6, 7501) at 1999-May-27 13:50:34
    inquisitor      : alive on (rip7, 7505) at 1999-May-27 13:50:34
    log server      : alive on (rip10, 7504) at 1999-May-27 13:50:34
    mam.library_manager : alive on (rip5, 7513) at 1999-May-27 13:50:34
    
        SUSPECT VOLUMES : NONE
    
        KNOWN MOVER           PORT    STATE         LAST SUMMONED        TRY COUNT
        DC05MAM.mover         7554    idle_mover    1999-May-26 00:19:39    0  
        DC03MAM.mover         7552    idle_mover    1999-May-26 00:19:39    0  
        DC04MAM.mover         7553    idle_mover    1999-May-26 00:19:39    0  
        DC06MAM.mover         7555    idle_mover    1999-May-26 00:19:39    0  
    
        No work at movers
        No pending work
    
    null.library_manager : alive on (rip7, 7511) at 1999-May-27 13:50:34
    
        SUSPECT VOLUMES : NONE
    
        KNOWN MOVER           PORT    STATE         LAST SUMMONED        TRY COUNT
        null2.mover           7533    idle_mover    1999-May-26 16:23:33    0  
        null1.mover           7532    idle_mover    1999-May-26 16:23:33    0  
    
        No work at movers
        No pending work
    
    null1.mover : alive on (rip7, 7532) at 1999-May-27 13:50:34
    
        Completed Transfers : 0,  Current State : idle 
        Last Transfer :  Read 0 bytes,  Wrote 0 bytes
    
    null2.mover : alive on (rip7, 7533) at 1999-May-27 13:50:34
    
        Completed Transfers : 0,  Current State : idle 
        Last Transfer :  Read 0 bytes,  Wrote 0 bytes
    
    volume clerk    : alive on (rip6, 7502) at 1999-May-27 13:50:34
    
    
    

    1.3.8.3.2 Example Html Status Snapshot File

    The html snapshot file contains the last known status of the Enstore system. As such it will be a repeat of the last set of information in the Ascii status file, formatted for browsing and minus the encp information.

    1.3.8.3.3 Example encp History Snapshot File

    Each encp history line contains the following information
    • end of transfer time
    • node of encp process
    • user running encp
    • number of bytes transferred
    • volume name
    • data transfer rate (MB/s)
    • user rate of transfer

    Enstore Status

    
    ENSTORE SYSTEM STATUS
    

    History of ENCP Commands
    TIME NODE USER BYTES VOLUME DATA TRANSFER RATE (MB/S) USER RATE (MB/S)
    15:19:22 rip8.fnal.gov moibenko 21036 rip6-01 2.47 0.04
    15:18:59 rip8.fnal.gov moibenko 21036 rip6-01 0.664 0.0322
    12:57:19 rip4.fnal.gov bakken 1048576 CA2904 0.698 0.00589
    12:57:04 rip8.fnal.gov bakken 1048576 CA2903 0.703 0.00589
    12:53:57 rip4.fnal.gov bakken 1073741824 CA2905 2.7 2.06
    12:53:41 rip8.fnal.gov bakken 1073741824 CA2903 2.7 1.88
    12:52:31 rip8.fnal.gov bakken 1048576 CA2904 0.711 0.0058
    12:49:51 rip8.fnal.gov bakken 104857600 CA2902 2.38 0.496

    1.3.8.3.4 Example Individual Transfer Activity Plot

    This plot shows the history of individual transfers (and their size) over a specified time interval. This includes both reads and writes.

    (also available in Postscript)

    1.3.8.3.5 Example Bytes Transferred/Day Plot

    This plot shows the number of bytes transferred per day over a specified time interval. This includes both reads and writes.

    (also available in Postscript)

    1.3.8.3.4 Example Mounts Per Hour Plot

    This plot shows the number of mounts per hour for a single day.

    (also available in Postscript)

    1.3.8.3.4 Example Mount Latency Plot

    This plot shows mount latencies.

    (also available in Postscript)


    1.3.9 Alarm Server

    The Alarm Server maintains a record of alarms raised by other servers. Since Enstore attempts error recovery whenever possible, it is expected that raised alarms will need human intervention to correct the problem. Currently, alarms are raised when the following conditions are detected -
    • A server has died and it is specified in the configuration file that it should not be restarted.
    • A server has died and the inquisitor was unsuccessful in restarting it.
    The alarm server compares a newly raised alarm with the previously raised ones in order to not raise the same alarm more than once. Raising an alarm means, the following -
    • logging the alarm
    • adding the alarm to the ascii alarm file
    • adding the alarm to the Patrol alarm file

    The ascii alarm file is located in the same directory as the log files and is called enstore_alarms.txt. The Patrol alarm file is located in the same directory and is called enstore_patrol.txt.

    Resolving an alarm means the following -

    • logging the cancellation
    • removing the alarm from the ascii alarm file
    • removing the alarm from the Patrol alarm file
    Currently it is only possible to cancel an alarm via the command line.

    1.3.9.1 Command Line Control of the Alarm Server

    Alarm Server functionality may be controlled through a command line interface using enstore. A summary of the supported commands is given below. In addition to the following commands, the Alarm Server command line interface supports the general commands supported by all other servers.

    FUNCTION COMMAND OUTPUT
    raise an alarm with root error of UNKNOWN and severity of WARNING enstore ac --alarm None
    raise an alarm with the specified root error and a severity of WARNING enstore ac --alarm --root_error="root_error" None
    raise an alarm with the specified severity and a root error of UNKNOWN enstore ac --severity=severity_value None
    resolve the specified alarm enstore ac --resolve=unique_id None
    get the name of the patrol file enstore ac --patrol_file patrol file name

    1.3.9.2 Alarm Server Config file Values

    The Alarm Server looks for the following values in the Alarm Server section of the Enstore config file. The default value is used if the dictionary element is not found. Dictionary elements with no default must be specified in the Enstore config file.

    DICTIONARY ELEMENT DEFINITION DEFAULT
    host node where Alarm Server runs  
    logname ascii value used for id in messages to Log Server ALARM_SERVER
    norestart do not restart this server if it crashes do a restart
    port udp port for Alarm Server communication  

    1.3.9.3 Ascii Alarm File

    The ascii alarm file that the Alarm Server creates stores all of the current raised alarms. When the Alarm Server is started this file is read. Below is an example file -
    [927226812.665, 'rip7.fnal.gov', 13917, 'enstore', 'E', 'INQ_CHILD', 'CANTRESTART', {'server': 'DM12AIT.mover'}]
    [927230044.586, 'rip7.fnal.gov', 18409, 'enstore', 'E', 'INQ_CHILD', 'CANTRESTART', {'server': 'DM07AIT.mover'}]
    [927255672.586, 'rip7.fnal.gov', 835, 'enstore', 'E', 'INQ_CHILD', 'SERVERDIED', {'server': 'DM07AIT.mover'}]
    [927255677.704, 'rip7.fnal.gov', 836, 'enstore', 'E', 'INQ_CHILD', 'SERVERDIED', {'server': 'DM08AIT.mover'}]
    

    1.3.9.4 Patrol Alarm File

    The Patrol alarm file that the Alarm Server creates stores all of the current raised alarms in a format that Patrol can parse. Below is an example file -
    rip7 Enstore 'E' INQ_CHILD on rip7.fnal.gov - CANTRESTART 
    rip7 Enstore 'E' INQ_CHILD on rip7.fnal.gov - CANTRESTART 
    rip7 Enstore 'E' INQ_CHILD on rip7.fnal.gov - SERVERDIED 
    rip7 Enstore 'E' INQ_CHILD on rip7.fnal.gov - SERVERDIED 
    
    Patrol was developed at SLAC and enhanced and modified at DESY. It is currently in use at these institutions and at Fermilab. We have begun investigating its use in association with the Enstore system.

    (also available in Postscript)


    1.4 Server Protocols

    Communications between clients and servers is implemented in the python modules udp_client.py and dispatching_worker.py which contain the classes UDPclient and DispatchingWorker respectively. For example, a Mover is a client of the edia_changer; i.e., it sends mount and dismount requests to the media_changer and waits for replies. The client and the server may run on the same or different machines and messages, that is requests and replies, are passed using the UDP network protocol. UDP is not a guaranteed reliable protocol but the Enstore protocols, described later, implement reliability.

    Generally, each server module has a corresponding client module that implements the client interface to the server. For the media_changer, the Mover imports media_change_client.py, which implements load and unload methods. So, mover.py imports media_changer_client which encapsulates the media_changer interface and media_changer_client imports udp_client which encapsulates the UDP communications. On the server side, Media Changer imports dispatching_worker which encapsulates server UDP implementation. So far, we have mentioned modules that are imported with the python "import" command. Within the modules, there are python "new" commands that instantiate the corresponding classes.

    All clients are themselves clients of the configuration server; so, each time they send a request to their server, they send a request to the configuration server to get the address of their server. In this way the configuration server is the only server that has a hard coded address. When each process starts it is given the IP address and port number of its configuration server.

    When a client is instantiated it determines a free UDP port on its machine on which it sends requests to its server. When the server reads a request it also gets the address (host, port) of the client which sent the request and uses it to reply.

    Client requests are called tickets and they are python dictionaries. The items in the dictionary are agreed upon between the client and the server. For example the Media Changer ticket must contain a volume id and a drive id.

    One item required in the ticket dictionary is "work". The "work" item in the dictionary is used in dispatching_worker as a method name and a corresponding method in the server is called to perform the work that the client requests. For example, the Media Changer ticket must contain a "work" item with a value "load" or "unload" and the Media Changer server has methods named load and unload.

    When UDPClient sends a request it first prepends a client identification stamp and a request time stamp (which serve as a unique identification) to the ticket and stringifies the result. Then it calculates a CRC of the message and appends a stringified version of the CRC to the message. Finally it sends the message to the server and waits for a response.

    The response format is client timestamp, response message, and server time stamp. If the client receives any response that does not start with the original client time stamp or if the wait for the response times out then the request is resent. More about this later.

    The server implementation of the protocol in DispatchingWorker does a select on a list of read file descriptors which includes the socket (host, port) as issued by the configuration server. The select is repeated if it times out.

    When input is detected on the socket, the server reads the request; checks the check sum; unpacks and saves the client id, the client time stamp, and the ticket; converts the ticket to a python dictionary; and calls the method specified by the "work" item in the ticket. If any of these things fail then the request is ignored presuming the client will resend the request.

    The "work" is a text string but python is interpreted and allows runtime evaluation of method names. In the Media Changer the load method is in the media_changer.py module which has imported and instantiated a DispatchingWorker class. When the work method is finished it calls the DispatchingWorker method reply_to_caller with a status result ticket.

    reply_to_caller builds a stringified reply with the client time stamp, the status ticket, and its own time stamp. It sends the reply to the client; saves the client address, the client message id, and the complete reply in case of errors; and waits for more requests.

    We save the complete reply in case of errors because we may get a request resent to the server which the server has executed but whose response was not reliably returned. Some requests, for example, mounting a tape are not redoable; so, we save the reply and simply resend the reply. When DispatchingWorker gets a request it first checks its request dictionary for a request that has the same client id and time stamp and if it finds a match it resends the reply rather than executing the request.

    The request dictionary contains all replies. If it grows beyond a certain size (currently 1000 entries) then entries older than 30 minutes are deleted.

    The scheme described so far requires that servers handle one request at a time and that clients queue in the servers udp input buffer waiting their turn. This is satisfactory if requests are guaranteed to finish quickly; however, the Media Changers operation may take a long time to complete while other operations might be done simultaneously. To accommodate this, dispatching_worker was extended to allow forking in servers.

    The select in dispatching_worker now watches for input from the client socket and a list of pipe fds on which the forked servers report their final status. The parent server process then reports this status back to the client.


    1.5 Trace

    Trace is a utility to trace execution of code through information saved in the circular buffer, residing in shared memory and available via special commands. It was adapted from previous work where it has been used in the real-time environment and is designed to have a minimal impact on the performance of components of the system, as well as the overall performance. Trace is widely used in all of the Enstore modules.

    2 Databases in Enstore

    Enstore uses databases to store persistent information. Aside from the databases associated with pnfs, there are two databases, "file" and "volume" used by File Clerk and Volume Clerk respectively.

    The database used in Enstore must provide the following:

    • Support journaling of the database to record all changes and support full database recovery.
    • Support transaction control to ensure the integrity of the information in database.
    • Support database check-pointing in order to enable full database recovery.
    • Support performing daily backups of the database, log, and journal files.
    • Support recovery of corrupted databases using the journal or log files.
    • Support the "python dictionary" interface.
    The directory that contains all database related files is called the "database directory" and is defined in configuration.

    2.1 Current Underlying Database Implemented in Enstore

    The current Enstore implementation uses LIBTP (http://www.sleepycat.com)(BSD DB v2.3) as the underlying database product. LIBTP is free for non-profit organizations like Fermilab, and has the following features:
    • one key dictionary-like database. It is designed to store/retrieve binary large objects (BLOBs) of arbitrary length, by text key.
    • ability to store data items of unlimited size
    • support for various data storage structures: hash table, binary tree, numbered records
    • allows duplicate keys (Enstore doesn't use them)
    • data scanning with cursors, multiple cursors may be opened at the same time
    • different levels of cursor stability
    • transactions
    • transaction logging
    • check-pointing
    • backup and recovery tools
    • custom locks
    • deadlock detection
    A LIBTP-Python shelve-like interface was developed. It provides access to:
    • All three data structures: hash table, binary tree, numbered records
    • Cursors
    • Transactions
    • Locks
    LIBTP was chosen based on the following considerations:
    • Nimbleness to allow us to set up test stands while developing and not being encumbered by database licensing issues.
    • It is similar to dbm-like databases used for the initial Enstore design. This made it easy to develop a Python interface for it and any necessary changes to the Enstore code were localized and relatively easy to make.
    • Database maintenance is relatively inexpensive. It requires only two processes to run. One for check-pointing and the other for deadlock detection.
    • It is simple and fast enough.
    • It provides tools for database transaction logging, database backup and recovery.
    • It is readily obtainable and free.

    We have examined LibTP, and find that it meets the current modest database requirements of the project. We have exploited the "freeware" aspect of it putting up many test stands. We could replace LibTP with a Run II standard database. However, we have no definite plans to do this, given our experience with the tool, and the lack of any driving requirement to do this.

    In addition to the databases, the File Clerk and Volume Clerk also maintain separate journal files. These journal files can be used to recover databases when they can not be recovered under normal circumstances.

    2.2 Backup and Recovery Procedures

    2.2.1 Backup

    Backup is a stand-alone procedure which can be performed manually at any time or routinely using a cron job. Currently the files that are backed up are database files which contain the persistent data, log files which record the transactions and journal files which are secondary transaction records implemented to further help the recovery of the database if there is a need. Enstore does live backups. It copies those files to a designated directory on a remote host. The remote host and directory are defined in the Configuration Server. The backup procedure will perform the following actions:

    Libtp database
    • identifies the log files that are involved in active transactions
    • creates the tar file of database files and all log files
    • deletes all log files that are not involved in active transactions
    Volume journal files
    • does journal file checkpointing (hold database access, move current file to volume.jou.time_stamp, open empty journal file, release database access)
    • creates tar file of volume database file and journal files
    • deletes old journal files
    File journal file
    • does journal file checkpointing
    • creates tar file of file database file and journal files
    • deletes old journal files
    Archives creation
    • creates new directory on remote host under designated "root archival" directory (name dbase.time_stamp)
    • moves all the tar files to this area
    Archives cleanup
    • deletes all the archival directories created more then N days ago (default is 10 days)

    2.2.2 Recovery

    Recovery (restore.py) is a job initiated manually in case of database corruption.

    restore.py
    • save current (corrupted?) database files
    • find the last backup
    • retrieve database files and log files from last backup
    • run db_recover to syncronize database files
    • retrieve journal files from last backup
    • check database files using journal files, make correction if it is necessary

    2.3 Administrative Tools

    *** Administrative tools will exist in a layer on top the current Enstore system and as such will not require any redesign or reimplementation of existing code. Administrative tools will provide the following operations:
    • Display all volumes for a specified media
    • List all the files and their location on a single or set of media
    • List file/files on the media by creation date
    • List all the media that belongs to a specified file family
    • List files that belongs to a specified file family
    • Display the date of the last mount for a specified volume
    • List all media belonging to a file family sorted by the most recent media mount date
    • List all media belonging to a file family where the last access date is before a specified date
    • Export metadata of ejected media into a flat file
    • Import metadata from a flat file when importing the media from outside the Enstore system
    • List all files/volumes that belong to user/group
    • Mark the volume as readonly if all of the files on the media are older than a specified date
    • Delete specified files in the pnfs trash bin.
    • Find and recycle volumes from which all files have been deleted.
    • Check for files known to the File Clerk but unknown to pnfs
    In addition, tools will be provided to implement the administrative functions mentioned in the D0 Functional Specification section of this document.


    3 Communication Protocols

    The base protocol for Enstore is UDP for "brief" messages and TCP for data transfers. UDP message sizes are all less than the size of the maximum UDP packet size so the protocol is very simple.

    The base server protocol is the same for all servers. State-fullness is minimized, not eliminated.

    Each transmission has a unique ID, timeout and maximum number of retries associated with it. The timeout allows for debugging. For each reception, the "message" is checked against messages received to see if the reception is a repeat. If the reception is a *repeat request*, send a saved copy of the response; if the reception is a *repeat response*, just ignore it. This will take care of the case when a timeout/retry happens just before a response is received.

    Some transfers do not require replies and using a described UDP communication may even hurt the system performance. One of such examples could be messages sent to the Log Server. For this purpose "pure" UDP messaging is used.

    3.1 Read Protocol

    The communications performed during a read operation are illustrated in the diagram below and described more fully in the following text.

    NOTE: The communications between the Mover and the Configuration Server happens approximately every two minutes. It has been added to the following drawing to show that this communication is important, but it can occur anywhere in the communications flow before the Mover contacts the Library Manager.

    (also available in Postscript)

    • The user (through encp) contacts pnfs asking for a bit file ID (bfid) for the named file.
    • pnfs returns the bfid to encp.
    • Encp asks the Configuration Server with which File Clerk should it be communicating.
    • The Configuration Server returns the location of the appropriate File Clerk.
    • encp asks the File Clerk for the information about the file with the given bit file ID.
    • The File Clerk asks the Volume Clerk for information for the given Volume Label
    • The Volume Clerk returns the information to the File Clerk
    • The File Clerk returns file information containing the Volume Label and the Library Manager name
    • encp asks the Configuration Server the location of the Library Manager dealing with this File Family
    • The Configuration Server returns the location of the Library Manager
    • encp sends the read request to the Library Manager
    • The Library Manager asks the Volume Clerk if the volume for the requested file has a read access.
    • The Volume Clerk confirms access permission
    • The Library Manager puts the read request into an internal request queue and tells encp that the request has been accepted
    • The Library Manager finds the next potentially available Mover in its list of Movers and sends a summon message to it initiating Mover dialog with this Library Manager
    • The Mover "wakes up" receiving the summon message from the Library Manager and asks the Library Manager if there is any work for it to do.
    • The Library Manager asks the Volume Clerk to mark the volume as "mounting".
    • The Volume Clerk confirms change of the state
    • The Library Manager moves the file internally from the request queue to the work queue.
    • The Library Manager tells the Mover which file to read and which volume to mount.
    • The Mover tells encp from which host and port to read the data.
    • The Mover asks the Media Changer to mount a particular volume.
    • The Media Changer responds once the volume is mounted.
    • The Mover (as the Media Changer Client) asks the Volume Clerk to mark the volume as "mounted".
    • The Volume Clerk confirms change of the state
    • The Mover sends the data to encp.
    • The Mover tells encp when all the data has been transferred and sends the crc information.
    • The read has completed.
    • The Mover tells the Library Manager that he still has the volume mounted.

    3.2 Write Protocol

    The communications performed during a write operation are illustrated in the diagram below and described more fully in the following text.

    NOTE: The communications between the Mover and the Configuration Server happens approximately every two minutes. It has been added to the following drawing to show that this communication is important, but it can occur anywhere in the communications flow before the Mover contacts the Library Manager.

    (also available in Postscript)

    • The user (through encp) contacts pnfs with a request to create a file.
    • pnfs returns the file family and volume library information to encp.
    • encp asks the Configuration Server with which Library Manager should it be communicating.
    • The Configuration Server returns the location of the appropriate Library Manager.
    • encp sends the write request to the Library Manager, including file family and number of bytes.
    • The Library Manager puts the write request into an internal request queue and tells encp that the request has been accepted.
    • The Library Manager finds the next potentially available Mover in its list of Movers and sends a summon message to it initiating Mover dialog with this Library Manager
    • The Mover "wakes up" receiving a summon message from the Library Manager and asks the Library Manager if there is any work for it to do.
    • The Library Manager asks the Volume Clerk for a volume for the file with the specified size and file family.
    • The Volume Clerk returns the volume to the Library Manager.
    • The Library Manager asks the Volume Clerk to mark the volume as "mounting".
    • The Volume Clerk confirms change of the state
    • The Library Manager moves the file internally from request queue to work queue.
    • The Library Manager tells the Mover which file to write and which volume to mount.
    • The Mover asks the Media Changer to mount a particular volume.
    • The Media Changer responds once the volume is mounted.
    • The Mover (as the Media Changer Client) asks the Volume Clerk to mark the volume as "mounted".
    • The Volume Clerk confirms change of the state
    • The Mover tells encp to which host and port to write the data.
    • The Mover tells the Volume Clerk that he is appending to this volume.
    • The Volume Clerk acknowledges this.
    • encp sends the data to the Mover.
    • The Mover tells the Volume Clerk that the append operation is done and how much space is left on the volume.
    • The Volume Clerk acknowledges this.
    • The Mover tells the File Clerk which file has been created.
    • The File Clerk responds with the bit file id.
    • The Mover tells encp that the file has been written and sends the bit file ID and the crc.
    • encp tells pnfs that the file has been created, and the bfid should be stored.
    • The write has completed.
    • The Mover tells the Library Manager that he still has the volume mounted.
    • The Library Manager asks the Volume Clerk to mark the volume as "unmounting".
    • The Volume Clerk confirms change of the state
    • The Library Manager tells the Mover that there is no work to be done.
    • The Mover tells the Media Changer to dismount the volume.
    • The Media Changer responds once the volume is unmounted.
    • The Mover (as the Media Changer Client) asks the Volume Clerk to mark the volume as "unmounted".
    • The Volume Clerk confirms change of the state

    4 Error Control

    4.1 Assumptions about Errors

    Enstore conforms to the (oral) statements made about Run II operating conditions:
    • No single error when reading a tape is fatal to upper level software.
    • When writing, errors should be handled by retries on different media.
    • Mover nodes may crash, with minimal disruption of the system.
    • The system should generate alarms and receive immediate service when its throughput falls below predefined levels.
    • Routine error conditions should be cleared in normal business hours.
    • It shall be possible to redirect writes to another library in case of library failure.
    • The system shall be capable of being monitored by PATROL.

    4.2 Error Overview

    Enstore is a distributed system. For a transfer to succeed, many of the Enstore processes must be up and running. Therefore, the servers are robust, and run on reliable computers. Nevertheless, it is good to consider the intrinsic ability for the system to recover when a process or system running a process crashes. This is summarized in the table below:

    Process

    Where is State?

    Effect of Crash

    Encp

    In the user's encp transfer command

    Transfer is canceled/aborted

    Configuration Server

    Static configuration file

    Wait for restart of server

    File Clerk

    Persistent database table

    Wait for restart of server

    Volume Clerk

    Persistent database table

    Wait for restart of server

    Library Manager

    In-memory lists of what work is queued, and what work is at what Mover

    Recovery of state is not yet implemented. Recovery of state is possible through encp retries.

    Mover

    If busy, the current transfer + the current volume

    Encp retries writes, exits with errors on read.

    pnfs NFS Servers

    DBM database "file metadata"

    NFS retry mechanisms

    Media Changer

    In memory lists of work given to library micro

    State refreshed by Enstore UDP retry protocol mechanism

    Log Server

    None

    Logs are not written

    Inquisitor

    None

    Displays are not refreshed until restart

    4.3 Detailed Error Discussion

    Much of the system state is stored within the user's encp client. This allows the encp client to retry on a large number of different errors. This retry is given a very high priority when it is received by the Library Manager so the user doesn't have to wait again for their job. It is the Library Manager's responsibility to ensure the error is not just repeated; for example, on a volume read error, the volume should not go to the same drive on a retry. The Library Manager gets the retry, volume and drive information from the ticket.

    Many interesting errors are related to cases where the volume cannot be written or read, or when it is suspected that volume is jammed, etc. More experience is needed with the actual hardware before the correct error control behavior is established. In the interim, Enstore will make the following working assumptions:

    • If there is trouble during a load or unload operation, the volume is assumed to be physically jammed. No operations on the drive or the volume until an administrator looks at the problem.
    • If several drives have fatal write errors on a volume, the volume will be marked read only.
    • If several drives have fatal read errors on a volume, the volume will be marked no access.
    • If a drive has several fatal errors on different volumes, the drive will be marked offline.


    The following table describes the error conditions that Enstore handles.



    Error Code

    Description

    Administrator

    Responsibility

    Mover

    Responsibility

    Library Manager

    Responsibility

    encp

    Responsibility

    Retry

    Volume

    State

    Drive

    State




    Volume Write Errors

    WRITE_NOTAPE

    Requested volume was not found in the library. Volume Clerk's data base is inconsistent with library micro's database. use case

    Check volume in morning

    Mark volume no access



    Retry

    Yes

    No access


    WRITE_TAPEBUSY

    Requested volume is in another drive. Enstore bug, or some other system has mounted volume or library micro put volume elsewhere. use case

    Check volume in morning

    Mark volume no access



    Retry

    Yes

    No access


    WRITE_DRIVEBUSY

    A volume is already in drive. Enstore bug or misconfiguration. Note: Mover waits for automatic cleaning tape to be ejected. use case

    Check drive and configuration in morning

    Offline the drive


    Retry

    Yes


    Offline

    WRITE_BADMOUNT

    Mount failure or load operation failed. Must assume jammed volume. use case

    Check drive and volume in morning

    Mark volume no access

    Offline the drive


    Retry

    Yes

    No access

    Offline

    WRITE_BADSPACE

    EOD cookie does not produce EOD. Wrong volume, Enstore bug or drive space error.

    Check drive and volume in morning

    Mark volume no access

    Offline the drive


    Retry

    Yes

    No access

    Offline

    WRITE_ERROR

    Error writing data block or file mark. use case

    Check drive in morning



    Check volume in morning

    If several errors occur with different volumes, offline the drive.

    If several errors occurs with different drives, mark volume as read only


    Retry



    Retry

    Yes



    Yes





    Read Only


    Offline

    WRITE_EOT

    Hit EOT while writing data block or file mark.


    Mark volume as full and read only


    Retry

    Yes

    Read only


    WRITE_UNLOAD

    Error unloading volume from drive. Must assume jammed volume.

    Check drive and volume in morning

    Mark volume no access

    Offline the drive


    Not involved


    No access

    Offline

    WRITE_NOBLANKS

    No more blank volumes.

    Administrator should be paged.

    DAQ should switch to alternate library.




    No



    WRITE_MOVER_CRASH

    If Mover is connected to an encp, encp will notice its sockets being torn down prematurely.

    Check drive and volume in morning



    Mark volume as no access, retry

    Yes

    No access

    Offline




    Volume Read Errors

    READ_NOTAPE

    Requested volume was not found in the library. Volume Clerk's data base is inconsistent with library micro's database.

    Check volume in morning

    Mark volume no access




    No

    No access


    READ_TAPEBUSY

    Requested volume is in another drive. Enstore bug, or some other system has mounted volume or library micro put volume elsewhere.

    Check volume in morning

    Mark volume no access




    No

    No access


    READ_DRIVEBUSY

    A volume is already in drive. Enstore bug or misconfiguration. Note: Mover waits for automatic cleaning tape to be ejected.

    Check drive and configuration in morning

    Offline the drive


    Retry

    Yes


    Offline

    READ_BADMOUNT

    Mount failure or load operation failed. Must assume jammed volume.

    Check drive and volume in morning

    Mark volume no access

    Offline the drive



    No

    No access

    Offline

    READ_BADLOCATE

    Failed space or initial CRC's don't match. Either file location cookie is corrupted, wrong volume in the drive or drive cannot space properly.

    Check drive and volume in morning

    Mark volume no access

    Offline the drive



    No

    No access

    Offline

    READ_ERROR

    Error reading data block. Run of the mill read error. use case

    Check drive in morning



    Check volume in morning

    If several errors occur with different volumes, offline the drive.

    If several errors occurs with different drives, mark volume as read only


    Retry

    Yes



    No





    No access

    Offline

    READ_COMP_CRC

    CRC mismatch Drive and the volume are suspicious. Corrupt file location cookie, drive space error, wrong volume in the drive, etc.

    Check drive and volume in morning

    Mark volume as no access

    Offline drive



    No

    No access

    Offline

    READ_EOT

    Hit EOT when reading. Corrupt file location cookie, drive space error, or wrong volume in the drive. Should have hit an EOF.

    Check drive and volume in morning

    Mark volume as no access

    Offline drive



    No

    No access

    Offline

    READ_EOD

    Hit EOD when reading. Corrupt file location cookie, drive space error, or wrong volume in the drive. Should have hit an EOF.

    Check drive and volume in morning

    Mark volume as no access

    Offline drive



    No

    No access

    Offline

    READ_UNLOAD

    Error unloading volume from drive. Must assume jammed volume.

    Check drive and volume in morning

    Mark volume no access

    Offline the drive


    Not involved


    No access

    Offline

    READ_MOVER_CRASH

    If a Mover is connected to an encp, encp will notice its sockets being torn down prematurely. The volume is tied up at a Mover.

    Check volume and drive in the morning



    Mark volume no access

    No

    No Access

    Offline




    Other Errors

    ENCP_GONE

    User has gone away while request is queued.


    Unilateral unbind



    No



    TCP_HUNG

    It appears that the data TCP link is hung.

    Check with user in morning

    Compute an anticipated transfer time for every socket operation and abort the transfer if the actual transfer takes more than three times the expected value.



    No



    LM_CRASH

    Library Manager crashes, and loses its queue of pending work. The encp's will never be called back, and will wait forever.




    Ping Library Manager, every N mins (30) to see if its request has gotten lost.




    MOVER_CRASH

    Mover is idle. The system degrades.

    Check drive in morning


    Remove from list when Mover fails to respond to summon




    Offline

    ANY_UNMOUNT

    Error unmounting volume. Volume is hanging in the drive. use case

    Check drive and volume in morning

    Mark volume no access

    Offline the drive


    Not involved


    No access

    Offline



    • "Freeze the volume in the drive" means:
      • Not unloading the volume from the drive
      • Freezing the volume
      • Offlining the drive
    • "Freezing the volume" means:
      • mark the volume as "system noaccess"
      • log that this happened and let an administrator look at the problem in the morning.
      • Encp shall not retry.
    • "Offlining the drive" means:
      • Preserving as much state as possible.
      • Writing a complete description in an error log.
      • Leaving the problem until business hours unless the capacity of the system falls below a threshold.

    5 Volume Import and Export

    It is certainly possible to import and export information by copying disk resident data from Enstore using the encp command. However, given the need to move large amounts of data, and the wide spread use of compatible tape drives and media, it is usually more efficient to interchange tape volumes: that is, to write tapes outside of Enstore and import them into the system, and to write tapes inside of Enstore and remove them from the system. In this way, for example, Enstore can be used as a kind of tape copy facility.

    What follows are draft design notions; importing and exporting are not yet fully implemented.

    5.1 Volume Export

    Exportable volumes are built in Enstore using the encp command, with the command line switch --ephemeral, which specifies a temporary, "ephemeral" file family. An ephemeral file family is a unique file family name created just for this encp command, with a file family width of exactly one. Under these conditions, files will be placed on the tape volume in the order specified by the user. Once the data is written to the tape, the file family name is changed to the tape_label_name.ephemeral.

    An experimenter wishing to build an exportable volumes would follow these steps:

    • Select a file structuring method which is supported in Enstore, which your users can read, and complies with tape interchange standards of your experiment. (For D0 this is the CPIO format.)
    • Decide what kind of media you want for the exported volume. (For D0 this will be the media selected by the SMWG.)
    • Identify and make locally disk resident all the files that you want to put on a volume. Consider the capacity of the tape, and whether you can tolerate files overflowing onto another volume.
    • Specify those files on a single encp command line, in the order you would like to have them placed on tape. Use the encp the --ephemeral file family switch on the command line. [After the initial encp copy, the file family for the tape is known. If the user chooses, she may append additional files to the tape. However, the no additional tapes will be added to the file family once it is full. It is recommended that all files be copied with one encp command.]
    • Optionally generate metadata for the volume, if your user will want to know what files are on the tape.
    • Move the volume from the robot to a shelf library, using an Enstore administration tool.
    • Remove the volume from the Enstore system, using an Enstore administration tool.
    An experiment can make tapes in the Enstore system at Fermilab and give the volumes to an experimenter, who can read the tapes anywhere. Experimenters can optionally generate a metadata file which provides a map of the exported tape. However, some tape formats, such as CPIO, are sufficiently self-describing so that the tape may be dumped to disk with standard utilities.

    As an example, an experimenter can stage an entire CPIO exported volume using gnu CPIO at her home institution. (Since there are many files on an Enstore tape, special care should be taken to select a non-rewind tape device. On a UNIX system, CPIO tapes can be read with no special infrastructure other than gnu CPIO. For example, here is a simple script to read an exported tape at a home institution:

    #!/bin/sh
    
    # en_dump_tape  A shell script to dump an  Enstore CPIO format tape.
    # This is pseudo code, not functional yet
    
    $tape=$1
    test ! -f $tape || exit 1  ## Try to make sure we have not selected a device file
    while /bin/true ; do  
       mt -t $tape fsf 1      || exit 1
       (dd if=$tape  | cpio -o  )  || exit 1
    done
    
    

    5.2 Volume Import

    Volume import is suitable for repeated and sizeable transfers of data. Volume import is not as easy as export and, therefore, it is not a good choice for small, occasional transfers. Planning is important. Tapes in a robot require special labels that can be automatically scanned and placed in the robot. The labels must be unique within a robot and within an Enstore system. You will need to procure tape volumes with labels meeting these requirements. Imported tapes are read-only in Enstore. (Details of these requirements are TBD, pending the serial media working group decision).

    Although it is not mandatory, an Enstore tool will very likely have to be produced to get good results for experimenters at home institutions wishing to make importable tapes.

    If the tapes are to be accessed many times, the experiment must take some time to think about the layout of data on the tapes and how the files ought to fit into the Enstore name space.

    • Files that are accessed together should likely be put on the same tape.
    • Most likely, the same file should not be placed on more than one tape. Duplicate data files (very important data) is handled differently.
    • Files must definitely not span tapes.
    • If a likely order of future access is known, files should be put on tape in that order.
    • Think of how you would like the files to appear in the Enstore name space.
    • Identify a file structuring method that is compatible with Enstore and your experiment's data interchange standards.
    • Files should be written with a blocksize yielding good performance. (Precise recommendation is TBD, pending SMWG decision).
    • Files should be written with the recommended partitioning. (Precise recommendation is TBD, pending SMWG decision).

    Since the objective is to import a large amount of data, it is required to generate metadata for each tape; otherwise the tape will have to be scanned to determine the metadata and this defeats the purpose of importing volumes!

    Metadata for each tape is:

    • The external bar-coded label of the tape.
    • The kind of tape media.
    • The file structuring method used to generate the tape.
    • The blocksize used in writing the tape.

    Metadata for each file is:

    • An Adler 32 CRC of the first 65536 bytes of the file. If not available, a value of "None" is acceptable and Enstore skips the check (not recommended).
    • An Adler 32 CRC of all the bytes in the file. If not available, a value of "None" is acceptable and Enstore skips the check (not recommended).
    • The number of blocks (or less desirably, the number of file marks) preceding the beginning of the wrapper.
    • The number of the partition holding the file (if this feature is used).
    • A name for the file.
    • A path for the file in the pnfs namespace.

    The procedure the user would follow to create an importable volume is:

    • Identify the files you want to put on a tape.
    • Put an optically bar-coded tape in the tape drive. Sheets of bar codes could be sent to the remote site or actual pre-labeled tapes could be used. The important point is that the robot will be able to recognize the tapes.
    • Using a tool we provide:
      • Place the files on the tape.
      • Generate the metadata for that file.
    • Remove the tape from the drive.
    The procedure for importing a volume is:
    • Deliver the volume to the data center.
    • Use an Enstore administration utility to make a record for the volume in the volume table using the volume metadata and to set the physical location as the "shelf" library.
    • Use an Enstore administration utility to place the metadata for each file into the file table and pnfs namespace.

    6 Test System

    The Enstore hardware test system was designed to be able to test data movement at rates comparable to requirements for Run II data logging, to evaluate gigabit networking technologies, and to determine scaling for the larger amount of hardware, required to support all of Run II data handling. The ability to sustain data rates comparable to Run II data logging requirements allows the test system to double as the RIP (Reconstruction Input Pipeline) test platform.

    The following test system is installed in the Feymann Computing center and runs several enstore systems. One is for HPPP systems development and testing and the othe is used by the D0/SAM project. D0/SAM has gigabit ethernet access to the enstore environment and uses the system for testing and presented it at Super-Computing 98. The RIP/Enstore hardware test system consists of the following:

    • 10 x86 PC's with:
      • dual 400 MHz Pentium II processors
      • 128 MB SDRAM memory
      • integrated fast ethernet
      • integrated ultra wide SCSI (1 bus)
      • integrated ultra SCSI (1 bus)
      • integrated video
      • 2 4-GB Seagate Barracuda SCSI disks
      • gigabit ethernet interface (Packet Engines)
      • two nodes with differential wide SCSI interface
      • four nodes with wide LVD SCSI interface
      • two nodes with Fibre Channel (for SCSI) interface
      • streams 275 MBytes/sec memory bus bandwidth measurement
    • Disk chassis with 4 ultra wide SCSI buses, each with 2 18-GB disks
    • Disk chassis with dual FC-AL loop, FC controller, 4 9-GB FC disks.
    • Foundry fast ethernet switch with gigabit ethernet uplinks (2)
    • Foundry gigabit ethernet switch
    • Packet Engines gigabit ethernet full duplex repeater
    Two nodes are connected to tape drives on the EMASS robot. One node, via a single wide differential SCSI bus, is connected to four Sony AIT tape drives. The other is connected via a single bus to four Quantum DLT 7000 drives, and two Exabyte Mammoth drives.

    One node is connected to two STK redwood tape drives in a STK Powderhorn robot.

    The RIP cluster is located physically adjacent to the SAM cluster, which has similar PC's. For testing throughput to/from clients, the SAM cluster can be connected to the RIP cluster via 10, 100, and 1000 Mbps network uplinks (though each SAM node has only 10/100 Mbps capability).

    (also available in Postscript)

    Enstore has been installed on the Test System and is fully operational. The administration of the system is flexible and can be changed by modifying a single configuration file.

    Currently, the configuration is as follows:

    NODE SERVERS
    rip1 4-AIT Media Changers
    2-AIT Media Changers
    4-DLT Movers
    rip2 4-AIT Movers
    rip3 cluster console
    rip4 General Use
    rip5 Disk Library Manager
    AIT Library Manager
    DLT Library Manager
    Mammoth Library Manager
    Redwood-50 Library Manager
    Redwood-20 Library Manager
    rip6 Configuration Server
    File Clerk
    Volume Clerk
    Admin Clerk
    Log Server
    Inquisitor
    Alarm Server
    Disk Mover
    Disk Media Changer
    rip7 General Use
    rip8 General Use
    Serves Home areas
    rip9 2-STK Movers
    rip10 4-DLT Media Changers
    STK Media Changer

    In the past, we have also tested exabytes on AIX machines. We have not continued with this effort, but rather have concentrated on drives and cpus that will most likely be used in conjunction with the EMASS robot.

    (also available in Postscript)

    6.1 Test System Results

    Test System Configuration:

    • EMASS Robot:
      2 AITS (15 of 84 tapes allocated)
      1 Mammoth (15 of 42 tapes allocated)
      2 DLTS (15 of 84 tapes allocated)
    • STK Robot:
      2 Redwoods (5 of 200 tapes allocated)
    • Disk test Movers

    Preliminary Rates Measurements:

    Device Writing Reading Network "Mem->tape"
      (MB/S) (MB/S) (Mbits/S) (MB/S)
    AIT 2.7 2.7 94 2.7
    Mam 2.8 2.7 94 2.8
    DLT 4.9 4.8 94 4.9
    STK 8.8* 7.5* 83 9.8
    * == STK tests not repeated (yet) after coding changes that improved performance
    Some STK details:
    • 1 GB file transferred
    • Mount time ~ 41 seconds
    • Seek time ~ 0 seconds (beginning of tape)
    • Enstore queue wait time ~ 1 second
    • Transfer time ~ 116 seconds (8.8 MB/S)
    • EOF time ~ 0 seconds
    • Get stats time ~ 34 seconds
    • Effective User rate: 5.7 MB/s (appending)

    Preliminary CPU Utilization for AIT transfers:

      Mover User's Encp
    no crc 5-10% 5-10%
    crc 10-20% 10-20%

    7 Interfaces and Integration

    Below is an (incomplete) summary of interfaces that are intrinsic to the Enstore software. They are specified and coded as part of the software project.

    Function

    Person

    Software

    Initiate a specific transfer between tape and disk

    Experimenter

    encp

    Organize names

    Experimenter

    pnfs

    Choose library to write to

    Experimenter

    pnfs

    Create file families, administer width

    Experimenter

    pnfs

    Current status on web

    Enstore Administrator

    Inquisitor

    Summary status on Web

    Enstore Administrator

    Inquisitor

    Routine periodic monitoring

    TBD

    Patrol (or TBD?) + Alarm module

    Move volumes between shelf and library

    Experimenter

    Enstore Administration Utility

    Move volumes between shelf and out-of-system (includes new volumes)

    Experimenter

    Enstore Administration Utility

    Drain system

    Enstore Administrator

    Enstore Administration Utility

    Shutdown system

    Enstore Administrator

    Enstore Administration Utility

    (Re)Start system

    Enstore Administrator

    Enstore Administration Utility

    Interfacing to an experiment means placing Enstore in a larger system context. From the point of view of Enstore with network attached tapes, there is an Enstore system which interfaces to the rest of D0 on the NIC-card cable connector. In addition to the interfaces to the user and administrators (which are intrinsic to Enstore software), there are other miscellaneous interface issues associated with a real instance of the software, since the whole system must conform to the Experiments, Division's and Laboratories system constraints:

    Type

    Constraint

    Imposed by

    Network

    Conforming Physical Media

    System

    Network

    Protocol Extensions

    System

    Network

    16K minimum UDP Datagram size

    Enstore

    Network

    Traffic pattern to machines where one NIC card is not sufficient,

    System

    Site

    Location

    System

    Operations

    Run II Operations Software Framework (Patrol or TBD)

    System

    Operations

    Failure planning (i.e. broken tape library, Power)

    System

    Operations

    Upgrade-ability

    System

    Operations + Security

    Standard Administration

    System

    Security

    Authentication etc are TBD

    System

    All production hardware is to come from the D0 Budget, requisitioned by D0. This includes the Enstore system, tape drives and other peripherals. D0 have a baseline system design. From the point of view of the storage management project, the main features of the D0 system are:

    • IP Based connectivity.
    • A few SMP boxes back-ending their DA at the D0 assembly building.
    • An analysis facility in FCC consisting of a large SMP box and smaller SMP boxes.
    • Many hundred farm nodes, most likely an ensemble of small PC's or economical RISC work stations provided by the lowest bidder. Whether there are "I/O nodes" associated with the farm is TBD.
    • An Enstore system with about 8-10 Mover nodes, and sufficient other nodes to run the rest of the system.

    Given the discussion above, the main interface issues with D0 are:

    • Specification and design of a data LAN.
    • The number and type of NIC cards on each computer.
    • Tuning the features in Enstore for the significant multiple NIC card machines.

    To keep transfers efficient, it is important to have a network design which avoids congestion.

    It is important to characterize the rate achievable on a NIC card, and the amount of CPU required to drive NIC cards. Typically, there is less CPU/BYTE when writing to the network, than when reading from it. One vendor reports, for CPU's available in late 1998:

    	10-11 MB/S on a 100 mbps ethernet NIC
    	30-35 MB/S on a Gbit ethernet card (standard MTU)
    	70-80 MB/S on a Gbit ethernet card (jumbo MTU)
    	
    	30-35 MB/S/CPU   standard ethernet frames
    	78    MB/S/CPU   "jumbo" (~9000 byte) frames.
    					(n.b. GB ethernet here...)
    

    On an analysis server, allocation of streams to NIC cards is most easily accomplished statistically. For machines with very many potential streams, this is best effected by fewer, fatter pipes. Ideally, there is little packet loss and traffic is regulated by the TCP window. If Jumbo frames were acceptable not generally, but only between the Enstore system and the D0 analysis machine, statistical load balancing over four GB NICs consuming two CPUS could easily sustain the (imagined) 150 MB/S peak tape rate for D0, with very little congestion.

    The problem becomes more difficult as the throughput of a NIC card decreases. The basic unit of transfer is a stream carrying the full tape rate. (i.e 5 MB/Sfor AIT-2 + some allowance for expansion). It is in fact, a little questionable whether two such tape streams should be multiplexed onto a single 100 MBPS NIC card -- Congestion, slow start and other rate-inhibiting mechanisms may be invoked.

    Other system design and integration issues

    • A specific site for the Enstore installation has not been identified.
    • The Enstore project has begun to learn about PATROL, but is unaware of its formal selection for Run II.
    • The OSS department has work items relating to installing and administering the operating systems for the D0 Enstore system.
    • Run II authentication systems are TBD.
    • The Enstore mover machines have not yet been identified.

    8 D0 Requirements

    As stated in the introduction, D0, and most specifically, SAM, has been very helpful in setting the direction for what is needed from Enstore. We believe a close and working collaboration has been developed in which both SAM and Enstore have profited. In the followin subsections, we present the requirements we have received from D0 and try to indicate how we fulfill them. We want to work with D0 to satisfy them all. This should be possible since we control the source code.

    8.1 Summary of D0 Functional Specifications

    The Functionality D0 expects of a Storage Management Layer

    There are 6 major functional areas. They are described in more detail and broken down further below.
    • Cataloging and Database Functions for Files and Tape Volumes
    • Specification and Control of Tape Volume Storage Locations
    • Control of various parameters which govern the functional behavior and performance of the system
    • Management of the robot resources (including error recovery and tracking)
    • Movement of files between users machine/local disk and tape in robot.
    • Operational procedures to run and manage the robot, the data stored in the system, the tape drives, and the "databases"

    1) Cataloging and Database Functions

    1.1) Maintenance of the primary "database" of file to tape volume information

    • reliable and backed up "database" of each volume and file. the volume location of each file, and the position within each volume for each file
    • Enstore has 2 simple internal databases, the volume and file databases and the pnfs databases. These databases are based on LIBTP and contain sufficient information to read each file in Enstore or write new volumes to available volumes.

      We believe LIBTP to be reliable - although we could replace it with a commercial database such as Oracle if required. We have developed backup scripts to recover from potential database corruption. Finally, we expect the databases to be saved in a SCSI RAID level 5 system for redundancy and reliability. We also support live backups without any user impact.

      D0's experiment catalog can also contain the basic information in our databases. The initial loading of this information is done via the return information from encp. And, since we do not expect any movement or compacting of data, this information should not change. If it does, syncing methods will need to be developed.

      Enstore also stores data in pnfs's databases, such as the bit file id. These pnfs databases have been reliable in our experience and they are supported by DESY.

      Pnfs backup issues are not completely understood and will be treated in DESY visit this February

    • assurance that all movement of files/deletion of files is correctly reflected in the "database"
    • If an experimenter moves or deletes files in the pnfs file system, it is immediately reflected in the pnfs databases. Therefore, in the case of moved files, Enstore still has the correct pointer information available to it for the transfers; for deleted files, the user won't be able to find the file in the namespace and won't be able to start the transfer.

      In addition to the user namespace, Enstore maintains another namespace that is ordered by file family, tape and position on the tape. The user can't delete or move items (UNIX permissions) in this namespace because it represents the physical ordering of files on tapes. Recovery of accidental deletions from the user's namespace can potentially be recovered by using this volume based namespace. [Not yet implemented.]

      No changes in the internal Enstore databases are required when a user moves or deletes files since nothing has been moved or deleted on the physical media. Initial tools are available to delete entire volumes, nothing is planned for compacting data on volumes.

    • notation of volume/file status (e.g. if unreadable or errors)
    • Enstore has implemented a special format, --data_access_layer, at the end of each file transfer. The status of the current file transfer is available in this way. The only successful status code is "OK" and an exit code of 0. All other values represent failures.

      Another possible return is NOACCESS, which indicates the volume can not be read. This return happens before submission to the library manager's queue, so it happens very quickly.

      In principle, since SAM has the volume information as Enstore and could update its tables to reflect the NOACCESS returns. If the volume is put back into service, a tool to reflect this change would have to be developed. Another possibility is for Enstore to flag the affected files from the NOACCESS volume in pnfs in pnfs for the user.

      A consistent, apriori way of marking all files from on a volume unreadable after the volume has been declared unreadable has not been fully worked through. (The general idea is that SAM never makes an encp request for files on unreadable tapes.)

    • tracking of the tape volume format
    • Enstore only one format is allowed per tape, it is not possible to mix formats. Enstore tracks this information in its volume database and errors out if it doesn't match.

      Another more obvious method we are developing is to add the tape format to the file family name. For example, the file family would be "top.cpio" and not just "top". In this way, the user can select the format he wants to use and tapes automatically have just one format (since different formats would have different file family names.)

    1.2) Provide access to File namespace and Volume Information

    • tools for users to easily and intuitively view all files in the system along with other commonly needed information about the file - such as owner, 'grouping', date written, etc.
    • tools for users to easily and intuitively view all volumes in the system
    • tools for viewing file/volume related information such as all files on a volume
    • Enstore uses the pnfs namespace from DESY and it furnishes all these items.

      The library manager has commands to list all the volumes in the system or file family as well as the statistics about specific volumes.

      Enstore also manages a duplicate namespace that is ordered by file family and volumes and position on tape that also supplies this information for the user.

    • tools for exporting all, or recently changed parts, of the file and volume database to the data access layer (for performance reasons) or to remote institutions
    • This has not been addressed.

    2) Specification and control of Tape storage locations

    • ability to handle several distinct robot storage locations
    • Robot storage locations are controlled by the pnfs library tag. This is user settable.

    • ability to treat a single physical robot as multiple logical storage locations
    • Enstore can divide a physical library into many virtual libraries. Mover computers, and therefore, tape drives, can be assigned to one or more of these virtual libraries.

    • ability to migrate tape volumes between storage locations
    • ability to handle various physical 'shelves' as possible storage locations

      Enstore treats 'shelves' as just another library. Tools are available to change volumes between different libraries. Insert/Eject tools are being developed for the EMASS robot to transfer volumes from robotic storage to vault shelves.

    • ability to import volumes into storage locations (given sufficient meta-data in an acceptable format)
    • ability to export volumes (with their associated meta-data)
    • Enstore is developing import and export tools that meet these requirements as described in section 5

    • possibly implementation of quota system for particular user/group within a storage location
    • This has not been addressed.

    3) Control of parameters which govern the functional behavior of the system

    3.1) Control of parameters which govern allocation and use of tape drives

    • possibly specification of preference or affinity between certain access modes, users or groups, and certain subsets or classes of physical tape drives
    • This has not been addressed.

    3.2) Control of parameters which govern how files are written to tape

    • specification of "groupings" or File Families for files
    • specification of "width" for a grouping

      File families and widths are a fundamental design notions of Enstore.

      Enstore allows the user to change the file family and width tags in pnfs. Regular UNIX permissions prevent unauthorized changes.

    • possibly specification of a list of files to be treated logically as one 'work unit'
    • Enstore allows a cp-like syntax where the input files can be a list.

    • specification of "append to tapes" policy
    • Enstore tries to append to tapes to fill them to their capacity. Tapes can also be marked "full" at any point

    • specification of file wrappering format
    • Enstore has developed a flexible wrappering module. We want to be able to support any wrappering format an experiment chooses. We promote the use of CPIO formats since it makes the tapes self-describing and readable on any UNIX machine.

    • association of tape volumes to a particular file family and tape library
    • All volumes are inherently assigned to a file family before they are written to by Enstore. The concept of File families has no meaning for reads.

      Volumes can currently only be in one tape library.

    3.3) Control of parameters which govern how files are read from tape

    • specification of error/retry behavior
    • Cannot do this dynamically, have static retry behavior, and will make this comply to D0 needs.

    3.4) Control of parameters which govern access to files and volumes

    • access control based on user/group for each file and each file family
    • Pnfs provides this with normal UNIX file permissions.

    3.5) Control of parameters which govern network routing between storage system Movers and client machines

    • ability to choose optimal path to load balance in the case of multiple network interfaces on a single machine
    • Enstore provides this through the mover config file and the normal UNIX table routing files.

      A simple round-robin plan is envisioned for multiple interfaces. This can be solved in any specific case, we are not addressing the general case.

      This has not been fully addressed.

    3.6) Ability to set defaults for many/most of the above parameters

    • storing of default values to be used for all transfers/work done for
      • a particular user/group
      • The default value is the logged-in users default values

      • a particular file family
      • File families are inherited from the parent directory.

      • a particular storage location
      • Libraries are also inherited from the parent directory.

      • ? possibly others

    4) Management of the robot resources (including error recovery and tracking)

    • Maintenance of a queue of work to do in case of excess demand on the robot or on the tape drives
    • Enstore's library manager maintains a queue of active work and pending work.

    • Ability to specify policies governing the ordering and manipulation of that queue of work, and therefore the delay seen by the user, including (but not limited to)
      • specification of a priority for all work requested
      • encp option --priority

      • specification of a priority increment and delta time in order to implement a priority boost/aging algorithm (or equivalent mechanisms)
      • encp option --delpri and --agetime

      • specification of policy for dismounting of tapes after work completed
      • encp option --delayed_dismount

      • other possible parameters to be decided based on tests/tuning of system
    • Cleanup of work queue in case of errors and canceled requesting processes
    • Enstore has spent considerable time developing robust error handling plans. The queues are cleaned up on errors or canceled requests.

    • Allocation of tape drives to units of work in the queue
    • Mover computers can be assigned to a specific library or multiple libraries. They can be changed dynamically, by an administrator, to reflect changing load conditions.

    • Retry of failed file reads/writes up to specified maximum
    • Repeat attempts at failed work using alternate tape drive resources
    • Enstore's philosophy is to retry internally and only return to the user a success code or a fatal error. Much work has been done towards this goal.

    • Notation and tracking of all work done, all errors encountered, all retries performed
    • encp option --data_access_layer provides this functionality, including retries.

    5) Movement of Files between users machine/local disk and tape in robot.

    • ability to transfer files from any network-connected machine to/from tape drive in robot
    • Assuming a sufficient network connection, the only requirements are the encp client and the pnfs namespace.

    • ability to transfer files reliably and with error detection, and correction by retry
    • This is inherent in Enstore's design

    • ability to transfer files at > N% of raw tape bandwidth. N is probably about 50.
    • Assuming a sufficient network connection, data should stream to tape at the maximum tape speed. We have designed the mover module to get the most performance out of the hardware are we can. Operations such as mounting, spacing, rewinding, etc, slow the overall rate down and are somewhat beyond the control of Enstore.

    • nothing done to exclude the possibility of adding an intermediate disk cache layer to adjust rate of data movement from tape drive to end user data sink - should that become necessary
    • nothing done to exclude the possibility of cooperation with the data access layer as a distributed disk cache of recently requested data
    • Disk buffering has not been excluded, as far as we can tell.

    6) Operational procedures to run and manage the robot, tape drives and "databases"

    6.1) Robot and Tape Drive Hardware

    • Well defined and safe procedures for dealing with repair and maintenance of the robot itself
    • There is a maintenance contract on the robot.

      Three Enstore people have attended training in Denver on the EMASS robot

      The ESH Department is involved and helping to specify safe operations. LOTO has already been instituted.

    • Procedures for monitoring the status of tape drives and for replacing faulty drives with drives which have been checked and tested through another well-defined process.
    • Enstore keeps track of drive errors and marks drives as "unavailable" when they exceed some limit. It is expected that an administrator review the bad drives in the morning.

      Extensive 24 hour burn-in tests are planned for all repaired or replaced drives to ensure that the end-user sees high quality drives.

    6.2) Operator procedures for import/export of batches of tapes

    • Interface between Storage Management system and Operator work/console system
    • Definition of policies for executing batch imports/exports of tape volumes
    • Enstore's plans on queuing work requests for volumes that are not in the robot into 3 categories:

      • Rejecting the request immediately
      • Requesting an operator mount of the volume
      • Storing the request into a Insert Volume Queue
      The policy for which one is chosen is TBD and will be up to the experiment.

      It is expected that an operator will visit the robot at most once/day and exchange at most 100 volumes per visit.

      This issue has not been addressed fully.

    6.3) Quality assurance procedures to assure integrity of the data and metadata

    • Routine backup of "database"
    • Routine Live backups of Enstore databases almost finished now.

      Pnfs backup is TBD

      SCSI RAID level 5 system needed for extra protection

    • Maintenance of a water-tight redo log of all transaction to be used in case of errors
    • All logs are kept online for at least 30 days and stored to tape after that. They are as complete as we can make them.

    • Recovery procedures in place, tested and executed in case of failures
    • We are working on recovery procedures and expect to have them completed shortly.

    • Ability to recover files and data on tape in case of complete and catastrophic loss of all "databases"
    • All volumes are self-describing. Metadata information can be recovered by scanning the tapes. [This has not been written!]

    • Routine checks on readability of sample of tapes - maintenance of statistics
    • Hopefully, normal transfers will allow a big enough sample to perform routine checks.

      This has not been fully addressed.

    8.2 Sam/Enstore Interface Notes.

    Below are notes on the SAM/Enstore interface provided by D0. Enstore complies in all software features except where noted. For the most part, Enstore complies in software features. Exceptions are noted.

    SAM/Enstore Interface

    The SAM data access layer uses the command/executable provided by Enstore to issue file commands.

    The basic format of this command is one of the following:

    encp <input file> <destination directory in pnfs space>
    encp <file in pnfs file space> <output file>

    The exact syntax of the above may be changing somewhat, but is immaterial.

    The following enhancements have been requested and (we think) agreed to by Enstore.

    Request to Enstore Implementation proposed Rationale
    Allow wild cards in input or output file spec. As each file arrives some notification should be provided.

    The notational issues in this items have not been addressed. Enstore provides cp-like list features. The user can launch many encps. Input wildcarding is allowed and furnished by the user's shell glob capabilities. Output wildcarding is more problematic -- Enstore allows the user to specify an input list of files and an output directory; in this case the input names are used in creating the output files in the directory (i.e., similar to the UNIX cp command).

    Enstore will implement notification by writing a message to stdout. Permits a number of files to be supplied or dispatched serially with one encp.
    Allow list of comma delimited files in input or output file spec

    The notational issues in this items have not been addressed. Enstore provides cp-like list features and the delimiter in this case is a space.

    Notification after each transfer is provided with the --data_access_layer switch.

    Notification of each file arrival (or dispatch) as for wild cards. Permits a number of files to be supplied or dispatched serially with one encp.
    At the end of each file transaction provide information about the physical location of the file, its position on the tape, error/retries, which tape drive it was written on.

    encp provides this with the --data_access_layer option

    This was originally discussed as being written to stdout along with informational messages about the state of the copy job. Latest thoughts appear to be to write all metadata related to the physical location of the file and how it got there into a separate, but parallel pnfs file system, into a file of the same name (we think?) It is very convenient when doing queries in order to gather information on files to optimize access patterns and when making reports, to have all of the physical information on the files in the SAM Oracle file and event catalog. Multiple pnfs query calls would be awkward and unsymmetric with respect to files managed by SAM, but not stored in the Enstore Robot space.
    Allow additional parameters on the Enstore 'copy' command to control the positioning of the job in the Enstore job queue. Initial priority, Aging Delta Time and Priority Increment would be sufficient.

    encp provides this with the --priority, --delpri and --agetime options

    Exact implementation of the desired effect left to Enstore. Whether at a certain priority a job becomes pre-emptive of a job already in progress left for later stages of the project, after some experience with resource allocation. Need some degree of control over the ordering and priority of jobs already submitted to the Enstore queue, in order to balance the flows of data and minimize job latency where necessary, but without rigid allocation of resources to particular access modes or projects
    At the end of each file transaction provide information about the job which copied the file - dwell time in queue, final priority, robot arm wait time, file seek time, file transfer time and MBs, etc.

    encp provides this with the --data_access_layer option

    This is now going to be available in the parallel pnfs file metadata file system This information is needed by the Global Resource Manager in order to feed into the algorithm which adjusts the rate of flow of jobs by access mode.
    When an Enstore job fails because of a tape error or failure of the receiving encp (or network or whatever) the job queue of Enstore should be cleaned up appropriately.

    Failed transfers are flushed from the Enstore queues.

    Could live without this in 1st implementation, but would be nice to determine what is appropriate behavior in each of the possible failure modes. We are expecting automatic retries when tape cannot be read or written in a particular drive and the tape only marked as unreadable if tried in n drives. SAM does not wish to handle tape errors, tape statistics or retries - merely to note relevant information on state of media and record drive used in the File and Event Catalog
    If the STK robot and a couple of drives cannot be hooked up with an Enstore test system by October 1, then Enstore needs to emulate the delays of a robot for Tape mount, File seek time, and File transfer time, in order to test the Global Resource Manager.

    SAM has used Enstore to write to both the STK robot and the EMASS robots.

    Part of this is already implemented as a 'simple' model. Is this adequate - it is not installed yet, SAM have not tried it. Essential to simulate queuing for scarce resources - the tape drive, and the network bandwidth.

    8.3 Other D0, non-SAM Requests to Enstore

    Besides the preceding requests from SAM, other D0 experimenters have suggested features, based on their experience, that would make Enstore more usable. We consider these requests valid and we will try to implement them; however, since some of the requests are outside of the main SAM framework, we assign them a lower priority and, in the cases where they conflict with mainline SAM architecture, we will only start them after all SAM requests have been satisfied and have extensive testing.
    • Enstore should assign drives randomly, not just the first few drives.

      Enstore uses all drives, not just the first ones, to allocate work.

    • Users, not just administrators, need easy access to the information about successful and failed mounts and transfers.

      Enstore logs will be available on the web for users to inspect. We welcome help!

    • A separate log with just mounts and a success/failure code would be useful.

      Enstore will provide this capability, also available on the web.

    • All information in a log needs to be on a single line, otherwise it is impossible for mortals to parse.

      Enstore logs are all single, sometimes very long, lines.

    • Enstore needs a procedure (certify job) that replaced drives must pass before they are put back into service. This procedure should last around 24 hours and should exercise all functions. It is foolish to put drives back in service and have them fail again right away.

      This is planned.

    • Broken drives should be replaced. This budget should rest outside of the experiment.

      Enstore agrees but doesn't set this policy.

    • Enstore should be able to deal in volumes. Ie, transfer a whole tape to disk and vice versa.

      All SAM traffic is with files, so this request is contrary to the SAM architecture. It is currently possible to write a list of files to a tape. It is also possible to query the system to list the files that are on a tape, and then use that list to copy all the files to the disk. We believe these methods to be adequate. Otherwise, this issue will have to be developed and represents new work.

    • D0 needs a scripts to create Enstore importable tapes.

      Enstore will provide these.

    • D0 is not interested in file sets. The impact on the databases is not so great that all the entries can't just in entered.

      Enstore will not implement any file set features.

    • Enstore needs the capability to override its priority queue and do things in the exact order requested.

      This request is outside the mainline SAM architecture which has indicated it wants priorities and optimal transversals of tapes. Moreover, this request is not straightforward and requires defeating the Enstore system in many ways and we'd prefer not to do it. More explicit needs can be addressed as they arise.

    • Enstore needs to guarantee a certain set of files are grouped on tape.

      This can be done with ephemeral file families that have a width of 1. Or, it can be done if only 1 user is writing to the file family at a time.

    • Enstore needs to duplicate important data.

      This is part of the overall architecture, but it is not yet developed.

    • Bad tapes need to be flagged for SAM to handle. Maybe mark pnfs filenames so transfer is never tried? What does DESY do?

      This is TBD. We will discuss it with DESY during our upcoming visit.

    • Direct RPC calls to the pnfs server might be nice so a user would not have to mount pnfs.

      This is outside the main SAM architecture and there are no current plans to implement this.


    9 WBS and Effort Estimates

    The original plan for Enstore was to start with a working prototype, representing about 6-8 months of effort, and evolve the code to the initial release of the Enstore product for Run II. This approach has been followed. There has been substantial overlap and simultaneous development in all phases the WBS plan. One of Enstore's original goals was to have a working system during all phases of the development. This, too, has been achieved. This goal has slightly lengthened and sometimes constrained overall development. (For example, incompatibilities between the original file/volume database design and the current one led to extra effort to allow simultaneous operation of both design.) Overall, however, the ability of D0 to test the Enstore software as it needed to, provided valuable feedback to us allowing us to allocate effort to problem areas.

    Generally, the project is on track to its original estimate. I believe Enstore needs its current work force of 6 people, (Bakken, Berman, Huang, Moibenko, Rechenmacher, Ruthmansdorfer) or their equivalents until May 99. After May the Run II effort could drop to 4 people. I expect to be able to deliver a fully Run II functional version of Enstore by the end of July 99. At that point, I expect serious integration to be well underway with D0. Depending on how well this commissioning goes, what further requirements and enhancements are deemed necessary, the Run II Enstore effort could drop to 2-3 people.

    Beyond the Run II work, Enstore will require effort to fulfill its Computer Division strategic role in Mass Storage for the Laboratory. It is expected that this work will begin in the Summer 99 time frame. The scope and requirements of this effort are not yet fully determined.

    The basic design philosophy of Enstore is to use layered products when possible. There are 2 products which need to be updated or enhanced to make Enstore fully functional:

    • OCS for Run II
      • Enstore plans on using OCS for the few operator mounts that are required for Run II operations. OCS needs to streamlined to allow this capability.
      • OCS has been deemed the long term repository of tape and drive statistics. A mechanism for sending these statistics will be needed.
    • FTT optimization for selected drive and media
      • FTT is the Fermilab source of all media and drive knowledge. This product will have to be enhanced to use all the capabilities, for example partitioning, of the drive that is chosen by the Serial Media Working Group for Run II.

    Below is the current Storage Management WBS. Instructions were to quit working on it if it was correct to a factor of 2. I believe the actual estimate is good to 0.5. I have attempted to fill in the percentage complete column in the WBS with a step of 25%. Items listed as 0% generally mean that no coding has been done for this item. It doesn't mean no thought has gone into the item. Some thought has gone into each item.

    ID Task Name Duration Start Finish Resource Names % Complete
    1 Storage Management 151 wks Mon 6/1/98 Fri 4/20/01   24%
    2 Management - ongoing 50% of JAB 120 wks Mon 6/1/98 Fri 9/15/00 JPP[50%] 0%
    3 D0 Liaison - ongoing 120 wks Mon 6/1/98 Fri 9/15/00 JPP[20%] 0%
    4 Hardware and Interface Problem Resolution - ongoing (DJH) 80 wks Thu 1/7/99 Wed 7/19/00 JPP[20%] 25%
    5 Working Operations 8.6 wks Sat 5/1/99 Thu 7/1/99   0%
    6 Working OCS for operator mounts 0 wks Sat 5/1/99 Sat 5/1/99   0%
    7 Working Interface to Tape/Drive Repository 0 wks Sat 5/1/99 Sat 5/1/99   0%
    8 Working FTT with drive chosen by serial media working group 0 wks Tue 6/1/99 Tue 6/1/99   0%
    9 Working Drives and Media in Robot for D0 0 wks Thu 7/1/99 Thu 7/1/99   0%
    10 Enstore V1 56.2 wks Mon 6/1/98 Mon 6/28/99 JPP[450%] 51%
    11 Organization and Methods of Working     56.2 wks Mon 6/1/98 Mon 6/28/99   69%
    12 Packaging methods 4 wks Mon 6/1/98 Fri 6/26/98   75%
    13 Coding standards 4 wks Mon 6/1/98 Fri 6/26/98   100%
    14 Development tools 4 wks Mon 6/1/98 Fri 6/26/98   100%
    15 Bug reporting and tracking procedure (GNATS) 4 wks Tue 6/1/99 Mon 6/28/99   0%
    16 Requirements                         33 wks Mon 10/26/98 Fri 6/11/99   69%
    17 Input and specification from experiments 12 wks Mon 10/26/98 Fri 1/15/99   75%
    18 Input and specification from mss groups and operators 6 wks Mon 12/7/98 Fri 1/15/99   75%
    19 Understand commonality of requirements and iterate 6 wks Mon 12/7/98 Fri 1/15/99   75%
    20 Understand interfaces to other Run II Projects 6 wks Mon 12/7/98 Fri 1/15/99   75%
    21 Understand testing dates and scope 6 wks Mon 12/7/98 Fri 1/15/99   75%
    22 Understand hardware constraints 6 wks Fri 1/22/99 Thu 3/4/99   100%
    23 Agree on change control mechanisms 6 wks Mon 5/3/99 Fri 6/11/99   0%
    24 Evolution of Prototype to Run II Product  51.2 wks Wed 7/1/98 Wed 6/23/99   47%
    25 Client server framework 6 wks Mon 8/3/98 Fri 9/11/98   100%
    26 Communications protocol and errors 6 wks Mon 8/3/98 Fri 9/11/98   100%
    27 Robustness 6 wks Mon 8/3/98 Fri 9/11/98   100%
    28 Error handling philosophy 16 wks Mon 8/3/98 Fri 11/20/98   75%
    29 Component Retries 4 wks Mon 8/3/98 Fri 8/28/98   75%
    30 End-to-end recovery 4 wks Mon 8/31/98 Fri 9/25/98   75%
    31 Fault tolerance and availability 4 wks Mon 9/28/98 Fri 10/23/98   75%
    32 Reliability 4 wks Mon 10/26/98 Fri 11/20/98   75%
    33 Encp framework  12.8 wks Tue 9/1/98 Fri 11/27/98   83%
    34 Design evaluation 4 wks Tue 9/1/98 Mon 9/28/98   75%
    35 Options and switch analysis 4 wks Tue 9/1/98 Mon 9/28/98   75%
    36 Optimization 4 wks Mon 11/2/98 Fri 11/27/98   75%
    37 Binary distribution studies 6 wks Thu 10/1/98 Wed 11/11/98   100%
    38 Improvements to Servers/Clients and Clerks Design 35 wks Tue 9/1/98 Mon 5/3/99   71%
    39 Configuration server and clients 6 wks Thu 10/1/98 Wed 11/11/98   75%
    40 Library manager and clients 20 wks Tue 12/15/98 Mon 5/3/99   75%
    41 Media Changer and clients 10 wks Tue 12/15/98 Mon 2/22/99   50%
    42 Volume clerk and clients 6 wks Mon 11/2/98 Fri 12/11/98   75%
    43 File clerk and clients 6 wks Mon 11/2/98 Fri 12/11/98   75%
    44 Log server and clients 2 wks Tue 9/1/98 Mon 9/14/98   100%
    45 Mover Modifications 19 wks Mon 11/2/98 Fri 3/12/99   38%
    46 File wrappering - self describing, different types, etc 8 wks Mon 11/2/98 Fri 12/25/98   50%
    47 Optimization 8 wks Mon 11/2/98 Fri 12/25/98   75%
    48 Read/Write Entire Volumes 6 wks Mon 2/1/99 Fri 3/12/99   0%
    49 FTT - new drives to support 4 wks Mon 2/1/99 Fri 2/26/99   0%
    50 Testing Framework 18.2 wks Mon 11/2/98 Mon 3/8/99   63%
    51 Debug and integration framework 12 wks Tue 12/15/98 Mon 3/8/99   50%
    52 Configure Test Hardware Platform 4 wks Mon 11/2/98 Fri 11/27/98   100%
    53 Database framework  17 wks Mon 11/2/98 Fri 2/26/99   45%
    54 Evaluation of underlying database choice 8 wks Mon 1/4/99 Fri 2/26/99   25%
    55 User Queries 4 wks Mon 11/2/98 Fri 11/27/98   25%
    56 Fault tolerance 4 wks Mon 11/2/98 Fri 11/27/98   75%
    57 Backup 4 wks Tue 12/15/98 Mon 1/11/99   75%
    58 Admin tools  48.6 wks Wed 7/1/98 Fri 6/4/99   26%
    59 Pnfs 8 wks Wed 7/1/98 Tue 8/25/98   75%
    60 Web status 22 wks Mon 11/2/98 Fri 4/2/99   25%
    61 User queries and reports 22 wks Mon 1/4/99 Fri 6/4/99   25%
    62 System and Tape Monitoring and Statistics (Patrol) 14 wks Mon 2/1/99 Fri 5/7/99   0%
    63 Volume Import/Export 8 wks Mon 2/1/99 Fri 3/26/99   0%
    64 Facility to Export/Eject Tapes from EMASS Robot 4 wks Mon 3/1/99 Fri 3/26/99   0%
    65 Facility to Import Foreign Tapes to EMASS Robot 4 wks Mon 2/1/99 Fri 2/26/99   0%
    66 Security, with respect to Fermilab Policy 12 wks Thu 4/1/99 Wed 6/23/99   0%
    67 Data protection, Authentication, Access 12 wks Thu 4/1/99 Wed 6/23/99   0%
    68 Accidents 12 wks Thu 4/1/99 Wed 6/23/99   0%
    69 Documentation 24 wks Mon 7/6/98 Fri 12/18/98 JPP[600%] 50%
    70 Integration 33.4 wks Fri 1/1/99 Mon 8/23/99 JPP 9%
    71 Integration with Experiment RIP and Production Farms 28 wks Fri 1/1/99 Thu 7/15/99   25%
    72 Integration with Experiment Data Handling and Analysis 28 wks Fri 1/1/99 Thu 7/15/99   0%
    73 Commissioning 12 wks Tue 6/1/99 Mon 8/23/99   0%
    74 Tuning 10 wks Tue 6/1/99 Mon 8/9/99   0%
    75 Enstore V2 16 wks Mon 8/23/99 Fri 12/10/99 JPP[250%] 0%
    76 Support for Commissiong of D0 before run starts 16 wks Mon 8/23/99 Fri 12/10/99   0%
    77 Addition of new features as discovered 16 wks Mon 8/23/99 Fri 12/10/99   0%
    78 Enstore V3 16 wks Tue 1/4/00 Mon 4/24/00 JPP[250%] 0%
    79 Support for run 16 wks Tue 1/4/00 Mon 4/24/00   0%
    80 New features discovered when there is beam 12 wks Tue 1/4/00 Mon 3/27/00   0%
    81 Ongoing Support 52 wks Mon 4/24/00 Fri 4/20/01 JPP 0%


    10 Year 2000 Issues

    Problems associated with two digit year differences fall into 3 broad categories for the Enstore project: