Go to On Line Documents,
Go to Go to Antique Computer home page
A History of Supercomputing at Florida State University
Jeff Bauer
(Written in early 1991)
INTRODUCTION
As the result of an unsolicited proposal by the Florida State University (FSU)
to the U. S. Department of Energy (DOE), a collaborative agreement was initiated
in late 1984 between the State of Florida, FSU, DOE and Control Data Corporation
(CDC). This first brought about the creation of the FSU Supercomputer Computations
Research Institute (SCRI) followed by the delivery in March 1985 of interim
computer hardware to the FSU Computing Center (FSUCC).
By April, the CDC Cyber 205 supercomputer became available to SCRI associates,
other DOE researchers around the U. S. and the State University System of Florida.
The intention was to get both users and support personnel up-to-speed on the
Cyber 205, which was running the VSOS operating system, in advance of the installation
of ETA-10 Serial # 1. In its final configuration, the ETA-10 would provide twelve
times the processing power and a compatible VSOS environment. The plan assumed
that after a successful period of acceptance testing on the ETA-10, the Cyber
205 and its file server (a Cyber 835) would be returned to CDC.
The first ETA-10 processor was shipped from St. Paul, Minnesota on December
31, 1986. The next year, 1987, was principally a year of installations and monitor
mode testing before an operating system was available on the machine. The ETA
Operating System (EOS) VSOS environment had become sufficiently mature by January
1988 to allow early user access. However, late in that year, EOS was overtaken
in functionality by a native UNIX system, ETA System V, based on the UNIX System
V (release 3.0) operating system licensed by AT&T. In April of 1989 the four
processor ETA-10G, with the shortest clock cycle time yet (7 ns), was installed.
Ironically, the week after the completion of the installation of the G, the
parent company of ETA Systems, Control Data, shut down the ETA operations.
The demise of ETA Systems, Inc. forced a re-evaluation of the
continued use of ETA hardware. An agreement was reached between Control
Data, FSU, and Cray Research, Inc. to exchange the ETA-10G with a four
processor
Cray Y-MP. The ETA-10G remained in production until March of 1990. A
two processor ETA-10Q provided an interim platform until the installation
of the Cray Y-MP was completed in early April of the same year. The ETA-10Q
was available until November of 1990 as an additional computing resource
alongside the Y-MP.
In February of 1990, a Connection Machine-2 was installed at SCRI, providing
a different style of supercomputing: massively parallel processing. Researchers
and scientists from a variety of disciplines are using the CM-2 to investigate
parallel algorithms in high-energy physics, lattice gauge theory, and materials
science.
All of these supercomputers provided unique features and abilities that have
enhanced the ``high-end'' computer capabilities at FSU for the past
six years. In addition, the FSU supercomputer experience gave rise to a
number of ``lessons learned'' that are mentioned later, especially with
respect to the ETA experiment.
The Control Data Cyber 205
1985 : Our First Supercomputer
In March, 1985, the Computing Center took delivery of its first
supercomputer, a CDC Cyber 205, along with a front-end file server
CDC Cyber 835. The Cyber 205 system included: CPU with 20-nanosecond
clock cycle, 2 vector pipelines, 32 megabytes of central memory and
7.2 gigabytes of on-line disk storage. The Cyber 205 had a theoretical
peak performance of 200 MFLOPS, and a LINPACK rating of 17 MFLOPS.
The Cyber 835 system added another 20 gigabytes of on-line disk. In
addition, four 6250 bpi on-line tape drives were shared between the
Cyber 205 and the Cyber 835. Communications between the Cyber 205,
its peripherals, and various front-end mainframes were handled by a
Loosely-Coupled Network (LCN) consisting of four separate coaxial
trunks. In its final configuration, the LCN connected the Cyber 205
to two CDC Cybers, two DEC VAX computers, the ETA-10 and an IBM
mainframe.
By April, 1985, the Cyber 205 was running production code from
local researchers and DOE researchers around the country. The
operating system software was soon upgraded to VSOS 2.2,
providing more features and increased stability. Languages
supported were: FORTRAN (with vectorizing pre-processor), C with
vector extensions, and the Cyber 205 assembly language. Numerous
mathematical packages such as CERN, IMSL and MAGEV were
installed, as well as the DI-3000 and NCAR graphics packages.
Local System Enhancements
During the first few months of production use, the operating
system was gradually tailored to better meet the needs of our
user base. Job categories with various system resource limits
were created to maximize throughput for our particular job mix.
Modifications to the operating system were made in job scheduling
and accounting. Utilities were written so that researchers could
better manage allocated CPU time and batch job execution.
In the area of job scheduling, two system modifications were
made to prevent any single user from monopolizing a job
category. The first mod changed the way in which jobs in the
input queue were processed. In standard VSOS, jobs were
processed on a first-in-first-out basis, thus allowing a user
who submitted a large number of jobs at once to block others
from running. This mod caused the input queue to be processed
on a round-robin basis by user number. The second mod, obtained
from Purdue University, limited the number of jobs a user could
execute simultaneously in a job category.
In November, 1985, variable rate accounting was implemented.
Jobs of different classes were charged different rates and were
allocated resources accordingly. Three job classes were defined:
standby, normal and high priority. Standby class jobs received
one fifth the normal time slice and were charged one fifth the
normal rate of one System Billing Unit (SBU) per CPU second.
High priority jobs were automatically expedited by the system,
given a time slice five times normal, and charged five times
the normal rate. To help researchers keep track of their CPU
allocation, a message was written to the dayfile at job
termination giving the number of SBUs remaining.
Some stand-alone utilities were also written to aid in CPU
allocation management and batch job tracking. Companion
programs MOVETIME and LISTTIME allowed the ``master user'' of
an account to transfer time between sub-ordinals of that
account, and list the time remaining for the sub-ordinals.
SUPSTAT periodically sent snapshots of various console displays
to the front-end. The information included queue statuses, disk
and CPU utilization, and system uptime. SUPDROP allowed a user
to drop one of his supercomputer jobs from the front end after
supplying the appropriate validation data (i.e. user number,
password, etc.).
Availability and Usage
The Cyber 205 remained in service until October 24, 1989. During its 4 1/2
years at FSU, the Cyber 205 was available for use over 38,000 hours (95% of
wall-clock time, and 97.6% of scheduled uptime). The addition of an Uninterruptible
Power Source (UPS) in December, 1986, significantly decreased downtime due to
power outages, from 148 hours in fiscal 1986 to only 10 hours in fiscal 1987.
The mean-time-between-failure rate went from one failure every 35 hours before
the UPS was installed to one every 127 hours afterward.
From the day it came up for production until the day it was shut down, the
Cyber 205 CPU was in use 96% of the time it was available. After the implementation
of the standby job category in November, 1985, it was in use over 98% of the
time available. Though not cutting-edge technology (ours was the next to last
one built), the Cyber 205 was a consistent performer, and provided a reasonably
stable supercomputing environment for researchers at FSU and around the country.
The ETA Systems ETA-10
1987 : BEFORE AN OPERATING SYSTEM
Installation of the first prototype ETA-10 processor began at the FSU
Computing Center on January 5, 1987. The clock cycle time for this CPU was
12.5 nanoseconds and within two weeks it was running a FORTRAN job
transferred from the Cyber 205, in monitor mode. This process required the
compilation of source code and subsequent loading of object code on the Cyber
205 to create a controllee, or executable file, which was then transferred to
the
ETA via an Apollo workstation. The binary would then be loaded into memory on
the ETA-10 and run directly on hardware, not under the control of an
operating system.
Limited output could be obtained from the equivalent of core dumps.
A second CPU arrived in the Spring, and by Summer, a four processor (12.5 ns
clock) configuration was in place. No user access was available at this stage,
but the FSU
installation team was able to perform some benchmarking and special purpose
testing.
It was in the fall of 1987 that the machine was upgraded to full ETA-10E
specifications with a 10.5 nanosecond clock cycle time, 4 million words local
memory per
CPU, 128 milllion words of shared memory and 14.4 billion bytes of online disk.
In
October, an ISlNG model was running in multiprocessing monitor mode achieving
a new
world record for performance (6 giga-flips per spin). Table 1 from the LINPACK
report
showed the ETA-10 leading the list for performance on full precision,
all
FORTRAN benchmark. The top three entries were ETA-10E: 52 MFLOPS, NEC SX-2 43
MFLOPS, and Cray X-MP/4 39 MFLOPS.
Work had begun in St. Paul on the `W series' of the developmental EOS operating
system and, by the end of the year, prototypes were being evaluated at FSU. In
the
meantime, supercomputer support personnel from FSUCC and SCRI were still
concentrating on the production Cyber 205 service. The user base had built up
while
expertise had been gained and passed on. FSUCC consolidated operations in
December
with the relocation to Sliger Building, Innovation Park, of the Cyber 205
and support staff.
It is worth noting that the new home for FSUCC had essentially been built
around the
ETA-10 which was already in place.
1988 : ETA OPERATING SYSTEM - VSOS
By January 1988, the W15 pre-release of EOS was considered mature enough for
official release and an ``early user access'' program began. This system was
billed as
providing a fairly full and relatively stable VSOS environment along with
local batch/interactive access,
RHF file transfer and a state-of-the-art vector preprocessor (ETA VAST-2).
It was released
on February 12th as EOS 1.0. X and Y series versions of the operating system
were under
development by ETA to become EOS 1.1 and 1.2, respectively, over the next two
quarters.
At the end of April, EOS 1.1 was installed at FSU. Features included a
remote batch facility, large page support and improved handling of multiple
users
per CPU. Interim EOS releases were received over the next few months with
EOS 1.1A in May, 1.1B in July and culminating in EOS 1.1C on September 1.
The main emphasis here was on operating system stability and some benefit was
observed. However, with more than two users on a processor simultaneously,
software crashes were still expected frequently under EOS 1.1C. Furthermore,
EOS 1.2 was now looking to be a distant prospect, this system being the first
expected to provide multiprocessing support.
It was around this time that plans were made to evaluate ETA System V, the
active port by ETA of the UNIX operating system licensed by AT&T. A task force
was assembled by FSU and dispatched to St. Paul on September 9, under the observance
of DOE. After two solid days of testing on a single liquid-cooled ETA-10 processor,
the team returned to Tallahassee to report its findings. The conclusion that
the improved stability, response and usability of the pre-released UNIX system
outweighed the production performance losses, notably in the area of I/O, against
EOS/VSOS. FSU was assured that certain ``buggy features'' under UNIX would be
eliminated in the released version.
FSU support personnel were galvanized into action after the official release
on October 3 of UNIX 1.0. The EOS service was discontinued at FSU on
October 23 and UNIX installed. By then, UNIX 1.0a was available and
extensive testing began within the week. Users were given access on November
7 and brief FSUCC Technical Information
Packets (TIPs) were distributed. Areas covered included an introduction to
UNIX on the ETA-10, the ftn77 compilation system, ed and vi editors, telnet and
FTP. Ftn77 provided an integrated system with access to a pre-processor,
the ETA
VAST-2 vectorizer, the FTN77 compiler (like FORTRAN 200 on the Cyber 205)
and a library link editor.
A single processor, air cooled ETA-10Q (19 ns clock) was deployed on December
12 intially to help those ETA users with short term migration problems. The
``piper'', as the air cooled ETA-10 was known, ran the latest version of EOS,
namely 1.1C, and allowed two or three local researchers the chance to complete
calculations that would not otherwise have been possible on a busy Cyber 205.
The system proved to be a success for a user community of this size with production
VSOS jobs.
1989 : ETA SYSTEM V - UNIX
Before moving completely into 1989, we mention that UNIX 1.0b became available
at FSU on December 29. Some improvement in stability was observed, although few
``UNIX bugs'' were fixed. It is worth recalling that FSUCC maintained an ETA
UNIX
bugs
list from the outset which, by the end of the year, had grown to some fifty
local entries.
Despite its problems, UNIX was really working out much better for the FSU
supercomputer community at large. The system was more versatile and supported
more users at any one time than EOS was capable of. Often, a single processor
would be running a dozen interactive sessions, three NQS local batch jobs and
the occasional background process. As a result, system stability has remained
much the same as with EOS which could only support a couple of jobs concurrently
at best. The main difference has proven to be that, under UNIX, the ETA-10 was
an interactive supercomputer while, under EOS, the ETA-10 could only be used
as a remote batch machine.
Progress was being made in St. Paul on the ETA-10G, the four processor (7
ns clock) replacement for the ETA-10E, so plans were made for interim access
to UNIX. EOS was discontinued on the ETA-10Q on February 21, and the machine
removed, while installation of a dual processor ETA-10Q ``piper'' began February
27. A pre-release of UNIX 1.1 was mounted for testing on this new piper that
provided improved I/O performance, a factor roughly 5-10 times better. This
was mostly realized for shared memory to/from disk transfer, although paging
from local to/from shared memory had also become more efficient. Moving users
from a four processor E series machine to a two processor Q was obviously going
to be difficult due to a four-fold reduction in processing power. It was planned
that user disks would be moved temporarily onto the piper in advance of the
G series machine becoming available.
The ETA-10E was removed from service on March 16, and the next day users had
access to
pre 1.1 on the piper. The system coped quite well but it was fortuitious
that
several users took a break from supercomputing for a month. Installation of
the ETA-10G
began on March 28, by which time its predecessor was out of the way. During
the first
quarter of 1989, FSUCC supercomputer support staff had been devising a new
user guide
known as the ``ETA-10 Quick Book''. This was completed at the end of March and
distributed to all principal investigators.
As we were planning the announcement of the migration of users to the ETA-10G,
word was received on April 17 that Control Data Corporation had closed ETA Systems
and terminated its employees. This news came as a shock to FSU who had entered
formal negotiations with CDC regarding a potential hardware upgrade to the ETA
equipment. In the meantime, it was to be ``business as usual'' at FSUCC and
users were duly given access to UNIX on pre 1.1 on the ETA-10G on April 21.
The ETA-10G continued to provide supercomputing UNIX cycles until deinstallation
and replacement with a Cray Y-MP in March of 1990, detailed in the next section.
The ETA-10Q was used as an interim machine between the deinstall of the G and
the install of the Y-MP. The Q continued to provide supercomputing UNIX cycles
with no hardware or software maintenance until hardware problems forced a shutdown
in November of 1990. It proved to be a useful platform for researchers still
requiring ETA-10 cycles while porting their applications to the Y-MP and ran
for quite a few months under ``local support''. Some of the peripheral devices
taken from the Q have proven useful on other computers.
The Cray Y-MP
1990 : Installation
On November 15th, 1989 it was announced that an agreement had been reached
between FSU, Control Data, and Cray Research that the existing ETA-10G would
be exchanged for a comparably-equipped Cray Y-MP, to be manufactured and
delivered by Cray in late February and early March of 1990. Over the next few
months, FSUCC, SCRI and Cray personnel hammered out the fine details of the
exact configuration and a time line of events generated.
The first item on the time line was training on UNICOS installation for
Systems group members, which took place between February 19th and February
21st at Cray's training facility, located in Eagan, Minnesota. The
Systems group trainees then went to Cray's manufacturing checkout facility
in Chippewa Falls, Wisconsin and, along with Cray analysts, installed UNICOS
5.1 on the yet-to-be shipped Cray Y-MP, serial number 1513.
Events started to really pick up at the end of that week. On March 9th, the
disks and tape drives connected to the ETA-10G were moved onto the ETA-10Q2
and Control Data began the removal of the ETA-10G equipment. By March 10th,
plumbers and electricians were busy at work running power lines and pipes for
the Cray. This work required some machine room downtime due to the interruptions
of basic services, like chilled water and electric power, but all were completed
and the machine room back in shape by March 12th.
The next two weeks saw the orderly installation of the Cray support
equipment, including the motor generator and condensing unit. The
raised floor was rebuilt to support the Cray as the ETA-10G's footprint
was larger and did not use the raised floor for support. Since the Cray
is totally supported on the raised floor, additional pedestals and tiles
were installed.
The Cray and all of it's peripherals arrived on Monday, March 26th. By
March 28th, the mainframe and peripherals were installed and powered up.
Engineers then spent the next few days going through the exhaustive hardware
checkout and testing process. Cray analysts then flocked to the machine
and, using the file systems and UNICOS kernel built earlier in Chippewa Falls,
quickly brought up the operating system for software checkout and testing.
The final piece put into place was the installation and checkout of the
Network Systems Hyperchannel gear. This would provide ethernet access to the
Cray, as well as a high speed link to the SCRI VAX 8700. FSUCC, Cray, SCRI,
and Network Systems personnel worked diligently over the weekend of
March 31st to achieve this milestone.
On April 5th Cray was satisfied with the installation and turned the machine
over to FSUCC analysts who began the customization process. The next
day files were copied from the ETA-10Q2 via magnetic tape and user names
created using the password files from piper0 and piper1. The machine was
almost ready for users.
On April 9th, as originally scheduled, the Cray became officially available for
production with a pre-installed user base of files and user names, although
some researchers had been running production programs since April 5th.
1990-1991 : A PRODUCTIVE FIRST YEAR
In August of 1990, an additional 10 GB of on-line mass storage was
added to the Cray.
In late March or early April of 1991, the Cray will be upgraded to
UNICOS 6.0.
The first year of supercomputing on the Cray Y-MP ends on a high note. The
machine has enjoyed a better than 99% of scheduled uptime and usage has been
high, with greater than 90% of the time available being busy.
The dramatic difference between the maturity of UNICOS on the Cray versus
ETA System V has contributed to a much more stable software platform and
enhanced use of the supercomputing environment. All aspects of FSU
supercomputing have been aided by the Y-MP presence, including networking,
user training and documentation, vendor support, systems monitoring and
tuning, and overall hardware reliability.
The Connection Machine
In addition to the more traditional vector supercomputers mentioned
earlier, FSU installed in February of 1990 a massively parallel SIMD (Single
Instruction Multiple Data) Connection Machine-2 from Thinking Machines,
Inc. The CM-2 was installed on the fourth floor of the Dirac Science
Library, within the domain of SCRI.
The CM-2 is a 16 dimensional hypercube interconnect of 65,536 single-bit
processors, with an additional 2,048 64-bit floating point
processors
available. It is connected to a front end machine, a VAX 6420 running
Ultrix. A 10 GB parallel disk array, the Data Vault, is available for
mass storage requirements and a high speed video frame buffer provides
real time graphic images.
The CM-2 is being used to solve problems in high-energy physics, lattice
gauge theory, and materials science.
The FSU Supercomputing Experience
Florida State University has gained much knowledge regarding the
installation, operations, administration, and applications of supercomputers.
Along the way, a variety of ``lessons learned'' are worth note.
A user of a computer perceives the success of the equipment in many ways,
from how much application software is available to how often the computer is
accessible. A top level indicator of the success of providing supercomputer
accessibility over the past six years can been seen in the Mean Time Between
Failure (MTBF) rates:
Supercomputer
MTBF (in hours)
Cyber 205 (with no UPS)
34.7
Cyber 205
127.2
Cray Y-MP
2,064.2
ETA-10
25.4
The dramatic differences between the failure rates reflect not only the different
vendors ability to create and maintain a particular hardware solution, but they
also indicate the increasing aptitude of FSU and the FSU Computing Center in
particular for running large-scale computing facilities.
The ETA-10 experience was certainly unique and is worthy of additional comments:
Although not reflected in the MTBF rate, since it includes hardware and
software failures, in actuality the ETA-10 hardware was quite reliable. Even
the liquid-nitrogen cooled systems, using cryogenic techniques not traditionally
associated with computers (and apparently not since), enjoyed a high amount
of availability.
The largest stumbling block with the ETA-10 was the apparent late start
with serious operating system development. The ETA-10 could have been more
fully utilized from the beginning if a stable, robust operating system been
available.
The ETA-10 demonstrated excellent use of state-of-the-art and emerging
technologies, with the use of custom CMOS VLSI, B.E.S.T. built-in self test
logic, a 40+ multilayer board, fiber optics connectivity to I/O devices, cryogenic
cooling, and the broad range of available configurations. In retrospect, however,
the continued use of the Cyber 205 abstract architecture, with memory-to-memory
long vector pipelines supplemented by a somewhat underpowered scalar processor,
did not seem justified with respect to the lack of wide acceptance of the
earlier 205. It is ironic that even with such careful attention to almost
identically matching the instruction set between the 205 and the ETA-10, the
approach to operating system development appeared to be an effort almost from
scratch, with the subsequent delays and unreliability that any major software
effort of that magnitude would experience.
It certainly did not help ETA that major components of their computer system,
such as the custom VLSI logic chips and the high speed memory chips, suffered
scheduling problems. This reliance upon other domestic and foreign firms that
were unable to produce either sufficient quantity or chips that were fast
enough made schedules slide.
The ETA-10 software experience occurred in the midst of the ``open standards''
cry and hue that arose in the mid to late '80s. ETA was late to jump on the
UNIX bandwagon, with software talent distracted in the early years doing EOS
development. It is pretty much accepted that if the UNIX effort had been the
original operating system then it would have been more timely and widely accepted,
perhaps to the point of ensuring ETA's success. Witness the ease of migration
between the relatively immature ETA System V UNIX and Cray's UNICOS during
the supercomputer switchout -- UNIX allowed the user's files and shell scripts
to port over to the Cray with little to no changes.
Environmental support for a cryogenic supercomputer is not without cost.
The original cryogenerator system, which recycled the nitrogen, experienced
a higher frequency of maintenance periods and proved to be more expensive
than just buying the liquid nitrogen in bulk and allowing the excess to vent
off. Even so, over 7,000 gallons a week were required to keep the two cryostats
containing the CPU boards at sufficient levels for daily operation.
Sufficient resources and expertise were not available at FSU to take over
software development on the ETA-10. Had Serial # 1 been placed at a large
government laboratory, resources may have been brought to bear at an earlier
stage to overcome the software problem, perhaps keeping ETA Systems in business.
Support for the U. S. supercomputer industry was considered an important element
of the FSU/DOE strategy, but it would appear that there is now only one domestic
vendor in the market place.
If you have comments or suggestions, Send e-mail
to Ed Thelen
Go to Antique Computer home page
Return to top
Updated April 12, 2000
|
|