Operational procedures and infrastructure
The Operations and Information Support (OIS) Section monitors SCD's
computers and UCAR's networks 365 days a year for 24 hours per day,
maintains the necessary environmental and hardware infrastructure,
manages user accounts and software licenses, develops and maintains
user databases, and provides an array of information to users and
SCD staff. These functions are provided by three groups within the
section: the Computer Production Group (CPG), the Infrastructure
Support Group (ISG), and the Database Services Group (DBSG).
The Computer Production Group (CPG) monitored and maintained SCD's computing
resources 24 hours a day for 365 days during FY1998. The many system attributes
monitored included system utilization, load average, job queues, status of
network connections, Mass Storage System (MSS) accessibility and data transfer,
filesystems, memory errors, and disk errors. Prompt identification of
software, hardware, and network problems of the more than 35 machines in the
machine room and the hundreds of network connections led to timely resolutions
and minimized interruptions to service.
The timeliness of alerts to machine
failures was enhanced by instituting the use of a pager that sounds when
production on systems in the SCD Computer Room halts. CPG staff added scripts
to the network monitoring system to sound alerts when connectivity to network
nodes fail. Staff updated, edited, and maintained the database containing
contact and problem resolution information for the machines and network nodes
monitored.
CPG produced a detailed, daily report of system and network
activity, problems, and steps toward resolution. This online documentation
provided SCD with an efficient means of determining the time and cause of
system and network malfunctions as well as the time equipment and systems
were returned to production.
The Computer Production Group (CPG) developed Perl scripts that provide
statistical information on the use of Mass Storage System cartridges. MSS
cartridge mounts continued to increase with over 419,000 robotic mounts. The
total archive tapes mounted by SCD Operations staff and Tape Handler
temporary staff totaled more than 414,000 with the daily manual mounts across
three shifts ranging up to 3,000. Additional Perl scripts produced daily
information on total cartridges written in each of two media categories and
the total available inventory in each media category. These totals showed
trends as well as sudden shifts in use from one media category to another.
The Computer Production Group assumed responsibility for administering
dialup accounts for the entire NCAR organization. Efforts were
coordinated with Network Engineering staff and Technical Consulting staff
to develop procedures for the addition and deletion of dialup accounts.
CPG administered the addition and deletion of three different types of SCD
dialup accounts: RAS, PPP, and 1-800 connect accounts. CPG added more
than 300 RAS accounts, 100 PPP accounts, and seventy-five 1-800
connect accounts. More than 100 dialup accounts were deleted.
CPG worked with members of the Infrastructure Support Group on the design
of the web form used for requesting dialup accounts. Enhancements were
made to automate requests for divisional approval for accounts.
The Computer Production Group maintained the
machine dependencies diagram that depicts the sequential order
for rebooting SCD computer equipment. Updates were made each time equipment
was removed or installed and when system services were moved from one machine
to another. The dependencies diagram was maintained on the Windows NT
platform using Visio Professional. Visio software was selected
because of the enhanced quality of diagrams produced, and because
it offered flexibility for updates and edits.
With the continued rapid pace of technology in FY1998, many changes to the
computer room occurred. A Cray J90se (chipeta) was added to the floor in the
spring. Also, a 128-processor Silicon Graphics Cray Origin2000 was delivered
and installed in the summer. Plans have been made to install a third
automated cartridge system and remove the T3D. These will be finalized by
the Infrastructure Support Group (ISG) early in FY1999.
The removal of the Cray YMP8-D (shavano) freed up a considerable amount of
space in the computer room. This space is designated for new
"supercomputer"-class equipment that may arrive in the future.
However, the continued growth
of the Mass Storage System has put considerable pressure on the available
floor space in the east side of the computer room. Unfortunately, no immediate
easy solution is available. ISG is considering replacing smaller-capacity
racks with larger-capacity racks, but is concerned about the cost.
Additionally, ISG is considering moving some equipment or taking over
additional space adjacent to the computer room.
In FY1997, ISG requested funds for new UPS units and a backup
generator. It was decided that in mid-FY1998, the UPS units
could be purchased. ISG put together an RFP and awarded the contract in July.
The units were ordered and will be installed in very early FY1999.
Additionally, along with FSS, SCD has again proposed a backup generator for
purchase with Director Reserve Funds.
In FY1998, ISG upgraded the DataTrax system and added a product
that allows for remote monitoring of the environmental conditions in the
room. This has been used very effectively to analyze the temperature,
humidity, and power consumption in the room. Several
additional sensors are planned for FY1999 to provide advanced
warning on the cooling tower and temperature in the battery room.
The move from liquid-cooled to air-cooled equipment has put different
stresses on the existing cooling in the room. The air handling capacity in
the room is near maximum and will need to be augmented with the addition
of new equipment. However, the chilled water loops are under-utilized.
ISG is researching some possible solutions, including augmenting the existing
air handlers with chilled water assistance.
The Trouble Ticket system will provide a homogeneous method for
different SCD sections to track and provide solutions for user problems and
requests. The system will provide a knowledge base and a collective set of
solutions to known problems, thereby increasing the efficiency of the
division and its users.
The committee decided that the system must meet the following requirements:
- Provide help to staff without hindering productivity
- Provide a seamless interface to users so they don't need to know "who"
to talk to, only "what" their request is
- Provide the ability to transfer requests between organizational units
of SCD
- Eliminate separate systems used by different organizational units,
which will facilitate the sharing of information
- Provide managers with meaningful data they can use in decision-making
The committee has made considerable progress in the development and
implementation of the trouble ticket system. By the end of FY1998,
we had:
- Completed and demonstrated the Remedy interface to SCD staff
- Started friendly user testing for TCG, ISG, SSG, and CPG staff
- Demonstrated a prototype web interface
Friendly user testing is progressing exceptionally well with only minor
problems. DSG, OSG, and NETS staff will be phased in to the system in
early FY1999.