1998 ASR Home
Back
SCD ASR Index
Next
SCD Home

Operational procedures and infrastructure

The Operations and Information Support (OIS) Section monitors SCD's computers and UCAR's networks 365 days a year for 24 hours per day, maintains the necessary environmental and hardware infrastructure, manages user accounts and software licenses, develops and maintains user databases, and provides an array of information to users and SCD staff. These functions are provided by three groups within the section: the Computer Production Group (CPG), the Infrastructure Support Group (ISG), and the Database Services Group (DBSG).

Monitoring and initial problem identification

The Computer Production Group (CPG) monitored and maintained SCD's computing resources 24 hours a day for 365 days during FY1998. The many system attributes monitored included system utilization, load average, job queues, status of network connections, Mass Storage System (MSS) accessibility and data transfer, filesystems, memory errors, and disk errors. Prompt identification of software, hardware, and network problems of the more than 35 machines in the machine room and the hundreds of network connections led to timely resolutions and minimized interruptions to service.

The timeliness of alerts to machine failures was enhanced by instituting the use of a pager that sounds when production on systems in the SCD Computer Room halts. CPG staff added scripts to the network monitoring system to sound alerts when connectivity to network nodes fail. Staff updated, edited, and maintained the database containing contact and problem resolution information for the machines and network nodes monitored.

CPG produced a detailed, daily report of system and network activity, problems, and steps toward resolution. This online documentation provided SCD with an efficient means of determining the time and cause of system and network malfunctions as well as the time equipment and systems were returned to production.

MSS statistics

The Computer Production Group (CPG) developed Perl scripts that provide statistical information on the use of Mass Storage System cartridges. MSS cartridge mounts continued to increase with over 419,000 robotic mounts. The total archive tapes mounted by SCD Operations staff and Tape Handler temporary staff totaled more than 414,000 with the daily manual mounts across three shifts ranging up to 3,000. Additional Perl scripts produced daily information on total cartridges written in each of two media categories and the total available inventory in each media category. These totals showed trends as well as sudden shifts in use from one media category to another.

Dialup account administration

The Computer Production Group assumed responsibility for administering dialup accounts for the entire NCAR organization. Efforts were coordinated with Network Engineering staff and Technical Consulting staff to develop procedures for the addition and deletion of dialup accounts. CPG administered the addition and deletion of three different types of SCD dialup accounts: RAS, PPP, and 1-800 connect accounts. CPG added more than 300 RAS accounts, 100 PPP accounts, and seventy-five 1-800 connect accounts. More than 100 dialup accounts were deleted. CPG worked with members of the Infrastructure Support Group on the design of the web form used for requesting dialup accounts. Enhancements were made to automate requests for divisional approval for accounts.

Machine dependencies diagram

The Computer Production Group maintained the machine dependencies diagram that depicts the sequential order for rebooting SCD computer equipment. Updates were made each time equipment was removed or installed and when system services were moved from one machine to another. The dependencies diagram was maintained on the Windows NT platform using Visio Professional. Visio software was selected because of the enhanced quality of diagrams produced, and because it offered flexibility for updates and edits.

Equipment installations and removals

With the continued rapid pace of technology in FY1998, many changes to the computer room occurred. A Cray J90se (chipeta) was added to the floor in the spring. Also, a 128-processor Silicon Graphics Cray Origin2000 was delivered and installed in the summer. Plans have been made to install a third automated cartridge system and remove the T3D. These will be finalized by the Infrastructure Support Group (ISG) early in FY1999.

Floor space

The removal of the Cray YMP8-D (shavano) freed up a considerable amount of space in the computer room. This space is designated for new "supercomputer"-class equipment that may arrive in the future. However, the continued growth of the Mass Storage System has put considerable pressure on the available floor space in the east side of the computer room. Unfortunately, no immediate easy solution is available. ISG is considering replacing smaller-capacity racks with larger-capacity racks, but is concerned about the cost. Additionally, ISG is considering moving some equipment or taking over additional space adjacent to the computer room.

Infrastructure

In FY1997, ISG requested funds for new UPS units and a backup generator. It was decided that in mid-FY1998, the UPS units could be purchased. ISG put together an RFP and awarded the contract in July. The units were ordered and will be installed in very early FY1999. Additionally, along with FSS, SCD has again proposed a backup generator for purchase with Director Reserve Funds.

In FY1998, ISG upgraded the DataTrax system and added a product that allows for remote monitoring of the environmental conditions in the room. This has been used very effectively to analyze the temperature, humidity, and power consumption in the room. Several additional sensors are planned for FY1999 to provide advanced warning on the cooling tower and temperature in the battery room.

The move from liquid-cooled to air-cooled equipment has put different stresses on the existing cooling in the room. The air handling capacity in the room is near maximum and will need to be augmented with the addition of new equipment. However, the chilled water loops are under-utilized. ISG is researching some possible solutions, including augmenting the existing air handlers with chilled water assistance.

Trouble Ticket Committee

The Trouble Ticket system will provide a homogeneous method for different SCD sections to track and provide solutions for user problems and requests. The system will provide a knowledge base and a collective set of solutions to known problems, thereby increasing the efficiency of the division and its users.

The committee decided that the system must meet the following requirements:

  1. Provide help to staff without hindering productivity
  2. Provide a seamless interface to users so they don't need to know "who" to talk to, only "what" their request is
  3. Provide the ability to transfer requests between organizational units of SCD
  4. Eliminate separate systems used by different organizational units, which will facilitate the sharing of information
  5. Provide managers with meaningful data they can use in decision-making

The committee has made considerable progress in the development and implementation of the trouble ticket system. By the end of FY1998, we had:

  1. Completed and demonstrated the Remedy interface to SCD staff
  2. Started friendly user testing for TCG, ISG, SSG, and CPG staff
  3. Demonstrated a prototype web interface

Friendly user testing is progressing exceptionally well with only minor problems. DSG, OSG, and NETS staff will be phased in to the system in early FY1999.

1998 ASR Home
Back
SCD ASR Index
Next
SCD Home