Distributed Climate Simulation Laboratory

CO-OP 3D FINAL REPORT

                                  Final Report
                          COOP 3D ARPA Experiment 109
                    National Center for Atmospheric Research

EXERIMENTAL OBJECTIVE - NETWORK PERSPECTIVE:

To advance the knowledge and understanding of next generation high data rate
telecommunications between widely distributed high performance computing 
systems via terrestrial fiber/satellite hybrid testbed network architecture.

ABSTRACT: 

Coupled atmospheric and hydrodynamic forecast models were executed on the
supercomputing resources of the National Center for Atmospheric Research (NCAR)
in Boulder, Colorado and the Ohio Supercomputing Center (OSC)in Columbus, Ohio.
respectively.  The interoperation of the forecast models on these geographically
diverse, high performance Cray platforms required the transfer of large three 
dimensional data sets at very high information rates.  High capacity, 
terrestrial fiber optic transmission system technologies were integrated with 
those of an experimental high speed communications satellite in Geosynchronous 
Earth Orbit (GEO) to test the integration of the two systems.  Operation over a 
spacecraft in GEO orbit required modification of the standard configuration of 
legacy data communications protocols to facilitate their ability to perform 
efficiently in the changing environment characteristic of a hybrid network.  The
success of this performance tuning enabled the use of such an architecture to 
facilitate high data rate, fiber optic quality data communications between high 
performance systems not accessible to standard terrestrial fiber transmission 
systems.  Thus obviating the performance degradation often found in contemporary
earth/satellite hybrids.

Introduction

Interworking of legacy supercomputer systems with the revitalized technologies 
embodied by the high power NASA Advanced Communications Technology Satellite 
(ACTS) required the integration of high performance terrestrial systems which 
were complementary with the spacecraft's expanded capabilities.  Dissimilar 
physical layer technologies required integration, providing the opportunity to 
investigate their interoperability over high speed terrestrial and satellite 
media.  Native High Performance Parallel Interface (HiPPI) traffic from the Cray
supercomputers was converted to Synchronous Optical Network (SONET) framing.  
Asynchronous Transfer Mode (ATM) cells generated by the high performance 
workstations controlling model interaction and visualizing output were converted
to HiPPI, then SONET.  SONET was the physical layer for the ACTS satellite 
transmisson system, a key component in the hybrid architecture was the 
ability to facilitate the interoperation between physical layer technologies. 

Another critical factor in the success of the hybrid model's interoperability 
was the optimization of Transmission Control Protocol/Internet Protocol (TCP/IP)
for the high performance satellite channel.  The high data rates afforded by 
SONET combined with the great latency (delay) of the space segment exceeded the 
domain of classical TCP functionality.  This is manifested in the delay for 
the acknowledgement of a transmitted packet being great enough for the source's
transmission window to timeout well before source packets reach the destination.
The net result is a very inefficient "stop and wait" state on the source, 
effectively shutting off the flow of data between the application, the kernel 
and the transmission media.  The link remains idle until an acknowledgement is 
received by the source which then transmits another packet.  Performance is 
impacted as throughput declines radically from the source's inability to keep 
the channel full; the bit length of transmitted data being far less than that 
which the channel can accommodate.  Performance enhancements to the TCP 
Automatic Retransmission Request (ARQ) or sliding window were taken advantage of
to match the performance of TCP to the hybrid network.

IP->HiPPI->SONET Physical Layer Configuration

In local environments where the time-distance separation between machines is 
slight, HiPPI interfaces and HiPPI switches may directly perform the 
interconnection of end systems.  Wide area terrestrial interoperability is then 
facilitated by layered protocol suites suited for longer distances such as 
TCP/IP which is mapped to the HiPPI stream providing either connectionless 
datagram service (UDP) or connection-oriented reliability (TCP).

Model data sets transferred by the Cray's HiPPI interaces, via a HiPPI switch 
were converted to a serial SONET stream by a HiPPI/SONET gateway developed at  
the Los Alamos National Laboratory (LANL).  This stream was framed in SONET 
OC-3c (concatenated pointer, 155.52 mbps) Synchronous Payload Envelopes (SPE) 
and transferred to the ACTS ground segment High Data Rate (HDR) terminal via 
Single Mode optical fiber.  SONET Section and Line Overhead was terminated by
the HDR prior to transmission and regenerated by the receiving HDR's SONET 
section. 

IP->ATM->HiPPI->SONET Physical Layer Configuration

The modeling components were linked using Parallel Virtual Machine (PVM).  PVM 
is a software library which provides a uniform, parallel computing architecture 
which is independent of the underlying hardware and network topology.  PVM also
linked the models to the data managers, software written to filter and route the
model output streams.  This data was fed to visualization components which were
executed on high performance Silicon Graphics Inc. (SGI) workstations.  Control
and evaluation of the simulation was afforded by this component.  Collaboration 
among the scientific researchers was facilitated by packet-based (IP) over ATM
video conferencing at each site.

The video conferencing and collaborative workstations were directly connected 
via ATM to Fore ASX-200 ATM switches.   IP packets segmented into 53 byte cells
by the workstation's ATM Network Interface Cards (NIC) were converted to HiPPI 
by a NetStar GigaRouter and routed through the HiPPI switch to the HiPPI/SONET 
gateway.  As with the Cray output, this HiPPI stream was framed in SONET OC-3c 
SPE and transferred to the HDR terminal over Single Mode fiber.  Section and 
Line Overhead being terminated by the transmitting HDR and regenerated by the 
receiver.  SONET frames from the workstations were transmitted along with model 
data via the single OC-3c ACTS channel, completing the terrestrial complement of
the hybrid. 

TCP Performance 

The ability of an end system to transmit data is ultimately limited by the 
information capacity of the transmission medium.  Efficient use of the medium is
achieved by maintaining transmission rates at or close to the maximum.  The 
combination of this data rate capability and the round trip time (RTT) between 
source and destination specifies how much data is flowing at any instant between
the sender and receiver.  

TCP is a reliable, end to end connection oriented transport layer protocol which
uses a sliding window based flow control system or ARQ to recover from loss or 
corruption of data over the medium.  To achieve this, TCP requires the source to
hold the data transmitted in buffer for a minimum of the time required to send 
the data to the destination and receive an acknowledgement from the receiver or 
the RTT.  Should data be corrupted or lost the entire contents of the source 
buffer is resent.

Maximum performance is obtained from TCP not just from high information rates
but from the product of the information rate and the RTT.  This "bandwidth-
delay" product is equivalent to the amount of unacknowledged data outstanding at
any instant on the transmission medium.  The bandwidth-delay product then 
corresponds to the minimum buffer size or window size which will keep the "pipe"
or link full and provide adequate recovery to congestion or loss. The larger 
the window, the more data can be outstanding and the capacity of the data link 
maintained at or near maximum capacity.  

TCP window size corresponds to the size of the socket buffer space or send and 
receive buffers in both the source and destination UNIX operating system 
kernels.  During connection establishment the source and destination negotiate
the size of this window, facilitating the smooth, continuous flow of data for 
the duration of the connection.  To provide efficient use of high capacity links
with high latency, very large window sizes are required. 

In the original TCP specification, RFC 793, the TCP header contains a 16 bit 
window size field corresponding to the receiver's window size. The 16 bit field 
can support a window size of 2E16, a maximum of 64 KBytes.  RFC 1323 prescribes
a window scalability option for the TCP header which can accommodate larger
window sizes, up to 1 Gbyte.  This option can improve the performance of modern
networks with high bandwidth-delay products.  The extension maps the standard 16
bit window size field to a 32 bit value and uses the window scale option to bit-
shift this value, producing a new maximum window size value.  

The window scale option occupies 3 bytes in the TCP header, it specifies the 
type of option as window scale and the second 3 bytes the length of the option 
and the shift count.  The window scale indicates the sender is able to accept
send and receive buffer or window scaling and sends the scale factor to the 
receiver.  The window scale is a log base2 value and the shift count is the 
number of bits the receiver's window value is to be right shifted.  Right shift 
applies to the default TCP window specified in the TCP header.  Values less than
the 2E16 maximum will only be right shifted by the shift factor.  

An application may set a larger window size with the setsockopt call, based on 
the available buffer space of the operating system kernel.  The implementation 
of window scale will then determine the appropriate shift factor.  The maximum 
window shift could be obtained starting with a default maximum window size of 
2E16 and a scale factor of 14.  This results in the maximum window size of 
1 GByte (2E16 * 2E14 = 2E30 = 1.073 Gbyte).

Application of this TCP performance extension requires that the operating system
kernel of the source and destination include the extensions to TCP performance
detailed in RFC 1323.   The maximum amount of socket buffer space available in 
the operating system kernel must be great enough to accommodate the window scale
factor anticipated.   The RTT of the data link and the maximum information rate 
must be known to facilitate performance enhancement using the window scale 
option to adjust window size.  

System Integration and Test Configurations

Due to the complexity of the architecture, various levels of system integration
were performed.  Commensurate continuity and performance tests were made to 
validate progress and functionality of the physical layer integration and the 
TCP performance enhancements prior to advancement to the next level.  The levels were:

	1. Earth Station Installation
	2. Network Hardware Installation of GigaRouter and HiPPI/SONET gateway
	3. Window Size Optimization
	4. Satellite Loopback Tests
	5. End-to-end connectivity over ACTS 
	
As all physical layer data streams were converted to SONET for transmission 
over the ACTS spacecraft, interoperation of the two HiPPI/SONET gateways was 
verified at NCAR prior to the shipment and installation of the second gateway 
at the OSC site.  Various configurations of both HiPPI/SONET gateways were 
tested in loopback with single Cray connectivity to test continuity and 
validate raw HiPPI and HiPPI/SONET performance for each.  The gateways were 
then tested between two local Crays to simulate end-to-end connectivity 
involving both gateways and different machines.  Finally each gateway was tested
between two local Crays via a local loopback at the NCAR HDR.  HiPPI performance
was validated by a simple program which writes 10 Mbyte buffers of raw HiPPI 
data across logical interfaces on NSC PS-32 HiPPI switch to a single or pair of 
Crays.

HiPPI tests were made for the following HiPPI/SONET configurations. 

	1.  Single Cray looped through each gateway individually via HiPPI 
	    switch.
	2.  Single Cray looped through both gateways back-to-back via HiPPI
	    switch. 	  	
	3.  Cray to Cray via both HiPPI/SONET gateways back-to back via HiPPI 
	    switch.
        4.  Cray to Cray via HiPPI switch, single HiPPI/SONET gateway and HDR 
	    digital terminal in loopback.  

Terrestrial and spacecraft TCP performance baselines were established prior to 
TCP window optimization.  Tests were conducted between two local Crays at NCAR 
interconnected similarly to HiPPI test configurations and to the spacecraft in 
loopback (bent pipe). 

TCP performance tests were made for the following HiPPI/SONET and spacecraft 
configurations.

	1.  Cray to Cray via single HiPPI/SONET gateway in loopback via HiPPI
            switch.   
	2.  Cray to Cray via both HiPPI/SONET gateways back-to back via HiPPI
            switch.
	3.  Cray to Cray via HiPPI switch, single HiPPI/SONET gateway and HDR 
	    digital terminal in loopback.
	4.  Cray to Cray via HiPPI switch, single HiPPI/SONET gateway over the 
 	    ACTS OC-3c bent pipe.

Once validated,  latency for the round trip spacecraft OC-3 channel was 
incorporated into a Long Link Emulator (LLE) to simulate the satellite delay 
for performance enhancement and application development during periods when 
spacecraft time was not available.

        1.  Cray to Cray via both HiPPI/SONET gateways and Long Link Emulator 
	    via HiPPI switch.
	2.  Cray to Cray via single HiPPI/SONET gateway, LLE in loopback via 
	    HiPPI switch. 

Full end-to-end connectivity between NCAR and OSC was established after 
completion of all above integration tests.  Successful NCAR Cray to OSC Cray
connectivity was made through identical configurations at each site. 

TCP Performance Tuning

The required TCP window size for the Crays and hence the window shift was
determined from the bandwidth-delay product of the hybrid.  The round trip time 
was calculated empirically then validated.    

A round trip for a packet and its corresponding acknowledgement requires two 
satellite hops.  The packet traverses the link once enroute to the destination
where if received correctly, elicits an acknowledgement.  The acknowledgement
then traverses the link back to the source, completing the round trip.  The 
calculation of the time required for this round trip between two Cray machines
at NCAR interconnected via the satellite loopback was made based on the the 
one-way propagation time between the NCAR HDR and the spacecraft.  The 
documented location of the NCAR HDR terminal was used in this calculation.   

39 degrees 58 minutes 39 seconds north latitude
105 degrees 16 minutes 28 seconds west longitude
6113 feet above mean sea level

For the purpose of calculation latitude was rounded to 40 degrees, the radius
of the earth was taken to be 3960 miles and the orbital altitude of the 
spacecraft was assumed to be 22300 miles above the earth's equator.  The speed 
of light, c was taken to be 186400 miles/second in the vacuum of space and the 
atmosphere.  The law of sines was used given a triangle formed by the earth 
segment antenna (point B), the center of the earth (point A) and the spacecraft
(point C).  The sides of triangle ABC opposite these angles are a,b and c
respectively. 

	a/sin40 = 3960/sinB = 22300 + 3960/sinC
	
	A + B + C = 180, if A = 40, then,	
	40 + B + C = 180
	B + C = 180 - 40
	B + C = 140
	C = 140 - B 

	a/sin40 = 3960/sinB = 26260/sin(140 - B)

	Angle B must be found to complete the equation.  To do so two right 
        triangles are superimposed on triangle ABC with sides b', c' and x.

	sin40 = b'/3960
	3960 (sin40) = b'
	b' = 2545

	c' = c - x
	cos40 = x/3960
	3960 (cos40) = x
	x = 3034
	c' = 26260 - 3034
	c' = 23266

	tanB = 2545/23226
	arctan 2545/23226 = B
	B = 6.25

	a/sin40 = 3960/sin6.25
	3960 (sin40) = sin6.25 (a)
	3960 (sin40)/sin6.25 = a
	a = 23381 miles

	Delay = 23381 miles/186400 miles/sec
	      = .125434 sec to spacecraft one way
	      = 2 (.125434 sec)
	      = .250869 sec

	Round trip time = 2(.250869 sec)
	 		= .501738 sec	
			= 502 ms

Validation of the empirical RTT was made using a series of Internet Control 
Message Protocol (ICMP) queries (ping) sent between a Cray Y-MP8 and a Cray 
EL-92 over the spacecraft bent pipe.  

16-aztec% /etc/ping echo-h
PING echo-h.ucar.edu: 56 data bytes
64 bytes from 128.117.5.5: icmp_seq=0. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=1. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=2. time=541. ms
64 bytes from 128.117.5.5: icmp_seq=3. time=544. ms
64 bytes from 128.117.5.5: icmp_seq=4. time=543. ms
64 bytes from 128.117.5.5: icmp_seq=5. time=542. ms
64 bytes from 128.117.5.5: icmp_seq=6. time=540. ms
64 bytes from 128.117.5.5: icmp_seq=7. time=540. ms
64 bytes from 128.117.5.5: icmp_seq=8. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=9. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=10. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=11. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=12. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=13. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=14. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=15. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=16. time=540. ms
64 bytes from 128.117.5.5: icmp_seq=17. time=541. ms
64 bytes from 128.117.5.5: icmp_seq=18. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=19. time=539. ms
64 bytes from 128.117.5.5: icmp_seq=20. time=539. ms

----echo-h.ucar.edu PING Statistics----
21 packets transmitted, 21 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 539/539/544

Local pings between a Cray Y-MP8 and a Cray EL-92 over the HiPPI/SONET gateway 
in loopback.

63-echo% ping aztec2-h.acts
PING aztec2-h.acts.ucar.edu: 56 data bytes
64 bytes from 192.157.4.32: icmp_seq=0. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=1. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=2. time=6. ms
64 bytes from 192.157.4.32: icmp_seq=3. time=5. ms
64 bytes from 192.157.4.32: icmp_seq=4. time=3. ms
64 bytes from 192.157.4.32: icmp_seq=5. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=6. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=7. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=8. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=9. time=7. ms
64 bytes from 192.157.4.32: icmp_seq=10. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=11. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=12. time=5. ms
64 bytes from 192.157.4.32: icmp_seq=13. time=3. ms
64 bytes from 192.157.4.32: icmp_seq=14. time=4. ms
64 bytes from 192.157.4.32: icmp_seq=15. time=4. ms

----aztec2-h.acts.ucar.edu PING Statistics----
16 packets transmitted, 16 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 3/4/7

The average RTT of 539 ms was compared to the calculated value of 502 ms and the
assumption was made that the additional 37 ms was the delay imposed somewhere 
between the Crays and the NCAR HDR terminal.  One-way delays observed in raw 
HiPPI tests between the Crays with the HDR digital terminal in loopback revealed
fairly consistent first packet delays of 20 ms.   Local pings across the 
HiPPI/SONET gateway in loopback and the LLE set for 0 ms emulation delay, 
revealed very small delays suggesting the bulk of the additional 37 ms delays 
was the result of the involvement of two trips through the HDR terminals; 
approximately 20 ms for the transmitted packet in one direction and another 
20 ms for the acknowledgement in the opposite direction. 

The data below shows the delay for the first raw HiPPI packet to return from 
the HDR loopback consistently in the 20 ms range. 

Output channel -- HXCF_HIPPI is set
HXCF_HDR is set (user buffer has FP header)
HXCF_IND is zero (I-field not in user buffer)
HXCF_ISB is zero (Short burst at end)
 Data length = 32768 bytes.
 FP header: 8580001800007fe0
 I-field is 702a3d2
 HXC_SET for output OK.
 Write of 10485760 bytes completed successfully
 Elapsed microsecs = 935094
 Fastest single block was 112.267 Mbits/sec
 Slowest single block was 46.463 Mbits/sec
 Overall data rate was    89.709 Mbits/sec
Catch completed successfully
 ..First packet delay was 20876 microsec.
 ..Overall data rate with delay was    87.733 Mbits/sec
 ..Overall data rate without delay was    89.691 Mbits/sec

Output channel -- HXCF_HIPPI is set
HXCF_HDR is set (user buffer has FP header)
HXCF_IND is zero (I-field not in user buffer)
HXCF_ISB is zero (Short burst at end)
 Data length = 32768 bytes.
 FP header: 8580001800007fe0
 I-field is 702a3d1
 HXC_SET for output OK.
 Write of 10485760 bytes completed successfully
 Elapsed microsecs = 882882
 Fastest single block was 112.993 Mbits/sec
 Slowest single block was 29.008 Mbits/sec
 Overall data rate was    95.014 Mbits/sec
Catch completed successfully
 ..First packet delay was 19809 microsec.
 ..Overall data rate with delay was    91.847 Mbits/sec
 ..Overall data rate without delay was    93.883 Mbits/sec

Output channel -- HXCF_HIPPI is set
HXCF_HDR is set (user buffer has FP header)
HXCF_IND is zero (I-field not in user buffer)
HXCF_ISB is zero (Short burst at end)
 Data length = 65536 bytes.
 FP header: 858000180000ffe0
 I-field is 702a3d1
 HXC_SET for output OK.
 Write of 10485760 bytes completed successfully
 Elapsed microsecs = 749291
 Fastest single block was 122.269 Mbits/sec
 Slowest single block was 40.016 Mbits/sec
 Overall data rate was    111.954 Mbits/sec
Catch completed successfully
 ..First packet delay was 21504 microsec.
 ..Overall data rate with delay was    109.334 Mbits/sec
 ..Overall data rate without delay was    112.486 Mbits/sec
 
The SONET OC-3 information rate is 155.52 mbps, however Section and Line 
Overhead are terminated by the HDR but the Path Overhead is still present in the
SPE.  The nettest/nettestd performance measurements are made at the application 
level, so a more conservative rate of 135 mbps was used in the bandwidth-delay 
product calculation to account for this overhead.

(135E6 b/s) * (539 ms) = 72765000 bits/8 bits/byte = 9095625 bytes

The bandwidth-delay product above specifies the minimum send/receive buffer or 
window size required for optimum TCP performance on the hybrid network.  This 
value specifies a window shift of 8 (2E8).  While a window size of 9095625
bytes is optimal, the window shift will be set for the next larger increment.  A
shift of 8 would yield a much larger buffer that the computed value, while a 
shift of only 7 would not provide one that is large enough.  Specification of 
the correct buffer size in the nettest utility with the -b option will set the 
correct buffer size.  A window shift of 8 will accommodate that size and any
size up the maximum shift value fo (2E16) * (2E8).

A modified version of the Cray UNICOS TCP test utility nettest and nettestd was 
used to measure the effect of the window size and window size changes on 
throughput performance.  The nettest/nettestd utility performs client and server
functions for measuring network throughput of interprocess communication.  The 
nettest program establishes a connection with the nettestd program which 
performs the server function, waiting for the nettest client to initiate the 
process communications.  As with any TCP connection, the window scale option is
sent at connection establishment in the  segment.  Thus the window scale 
value is fixed when the connection is made and remains for the duration of the 
connection.  The nettest program writes a number of bytes to the nettestd 
program which reads a number of bytes and reports the throughput.  Nettestd in 
turn writes a number of bytes to nettest which reads them and also reports the
performance.  The preformance of the two processes is averaged to disclose an 
average data throughput rate for the connection.  
 
Modification to the nettestd source code was required because Cray's nettestd 
does not have the capability to allow the user to set the the window size or
shift factor.  Rather the default maximum window size defined in the operating 
system kernel (in this case 2E16 = 65536 byte) is used by nettestd.  To overcome
this limitation it was put forth that perhaps during connection establishment 
nettestd "spoofs" the server that the size of its receive buffer is the same as 
the sender's, a requirement for smooth data flow in both directions.  The 
connection is established but an imbalance in buffer sizes between the source 
and destination may results in asymmetric transfer rates observed between the 
data sent between the client and the server. 

While unable to verify that this spoofing is in fact occurring, modification of
the nettestd code allows the user to define the window shift for both nettest
and nettestd by defining the -s and -b options in the nettest program.  Upon 
connection establishment the specified window sizes are set on both client and 
server, resulting in symmetric data transfers between client and server assuming
similar loads on the two machines.
 
Results

Performance tests using the modified nettest/nettestd programs were executed for
the various configurations.  The results of the tests between different Cray 
platforms were validated against each other for bent pipe tests as well as full 
connectivity end-to-end tests between NCAR and OSC.  Tests were made using 
various combinations of window shift and buffer size to determine and validate 
optimal TCP performance parameters. 

Data below describe tests between a Cray EL-92 and a Cray J-916 at NCAR.  The 
window scale was set for a shift of 8 and the send/receive buffer size is set
for 10000000 in the setsockopt call.  Large amounts of data are required to 
keep the "pipe full" and ensure link efficiency.  The number of read or write 
operations and the number bytes to be read or written with each operation can be
varied to affect the amount of data. 

In the set of data below the number of times data was read/written was set to 
100 and the number of bytes for each of the operations was 3000000.  A disparity
in performance is evident between the results of these tests and tests 
immediately following where the number of times data was read/written was set to
1000.  It is not clear why more optimal performance is observed when the amount 
of data is increased by an order of magnitude.  The amount of data transmitted 
is sufficient to "fill the pipe" in either case.  While this is puzzling the 
total 'Real' time to effect the transfer increases by a only factor of seven, 
indicating more efficient use of the channel.  

aztec=client
echo=server

64-aztec% rtx nettest -s 8 -b 10000000 echo-h.acts.ucar.edu 100 3000000 

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 100*3000000 bytes from     aztec to echo-h.acts.ucar.edu
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 28.9430  4.9760 (17.2%)  0.0013 ( 0.0%) 10122.25  79.080    82.921
   read 30.5650  4.3535 (14.2%)  0.0626 ( 0.2%) 9585.11  74.884    78.521
    r/w 59.5080  9.3295 (15.7%)  0.0639 ( 0.1%) 9846.36  76.925    80.661

echo=client
aztec=server

83-echo% rtx nettest -s 8 -b 10000000 aztec2-h.acts.ucar.edu 100 3000000 

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 100*3000000 bytes from      echo to aztec2-h.acts.ucar.edu
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 29.5932  9.1337 (30.9%)  0.0020 ( 0.0%) 9899.87  77.343    81.100
   read 30.3065  6.8523 (22.6%)  0.0977 ( 0.3%) 9666.85  75.522    79.191
    r/w 59.8997 15.9860 (26.7%)  0.0997 ( 0.2%) 9781.97  76.422    80.134
 
aztec=client
echo=server

72-aztec% rtx nettest -s 8 -b 10000000 echo-h.acts.ucar.edu 1000 3000000 

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 1000*3000000 bytes from     aztec to echo-h.acts.ucar.edu
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 197.3138 52.9243 (26.8%)  0.0124 ( 0.0%) 14847.86 115.999   121.634
   read 215.8563 43.6969 (20.2%)  0.6384 ( 0.3%) 13572.40 106.034   111.185
    r/w 413.1701 96.6213 (23.4%)  0.6507 ( 0.2%) 14181.51 110.793   116.175

echo=client
aztec=server

91-echo% rtx nettest -s 8 -b 10000000 aztec2-h.acts.ucar.edu 1000 3000000 

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 1000*3000000 bytes from      echo to aztec2-h.acts.ucar.edu
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 236.7876 95.9306 (40.5%)  0.0201 ( 0.0%) 12372.64  96.661   101.357
   read 196.6013 68.4667 (34.8%)  0.9884 ( 0.5%) 14901.67 116.419   122.075
    r/w 433.3889 164.3974 (37.9%)  1.0085 ( 0.2%) 13519.90 105.624   110.755

This phenomena was observed in both bent pipe configurations between machines at
NCAR and in later end-to-end tests between NCAR and OSC Crays.

aztec=client
osca=server

20-aztec% rtx nettest -s 8 -b 10000000 osca3.acts.ucar.edu 100 3000000

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 100*3000000 bytes from     aztec to osca3.acts.ucar.edu
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 28.8057  4.9513 (17.2%)  0.0013 ( 0.0%) 10170.51  79.457    83.317
   read 30.0736  4.4901 (14.9%)  0.0623 ( 0.2%) 9741.73  76.107    79.804
    r/w 58.8793  9.4414 (16.0%)  0.0635 ( 0.1%) 9951.51  77.746    81.523

ztec=client
osca=server

22-aztec% rtx nettest -s 8 -b 10000000 osca3.acts.ucar.edu 1000 3000000 
-a temp

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 1000*3000000 bytes from     aztec to osca3.acts.ucar.edu
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 205.1117 53.1284 (25.9%)  0.0126 ( 0.0%) 14283.38 111.589   117.009
   read 194.9550 45.1127 (23.1%)  0.6294 ( 0.3%) 15027.50 117.402   123.105
    r/w 400.0667 98.2411 (24.6%)  0.6420 ( 0.2%) 14646.00 114.422   119.980

In spite of this anomaly, 1000 read/write operations appeared to be the minimum 
number of operations which would produce optimal throughput performance.  It was
used for all subsequent tests.

Batteries of nettest/nettestd suites were run to validate the empirical optimum
of window shift 8 and a send/receive buffer size of 9095625 bytes.  Data below 
supports the notion that window shifts and buffer sizes above or below 
theoretical optimum result in inferior performance.

(135E6 b/s) * (539 ms) = 72765000 b /8 b/B = 9095625 Bytes requires a window
shift = 8 

Window size is .5 optimal

311-aztec% rtx nettest -s 7 -b 4547813 echo-h 1000 1000000 

Final SO_SNDBUF=4547813
Final SO_RCVBUF=4547813
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 7, recvwindshift= 7.
Transfer: 1000*1000000 bytes from     aztec to    echo-h
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 124.9139 16.5351 (13.2%)  0.0098 ( 0.0%) 7817.89  61.077    64.044
   read 124.9731 10.9561 ( 8.8%)  0.1682 ( 0.1%) 7814.18  61.048    64.014
    r/w 249.8870 27.4912 (11.0%)  0.1781 ( 0.1%) 7816.03  61.063    64.029

Window size is optimal

293-aztec% rtx nettest -s 8 -b 9095625 echo-h 1000 1000000 

Final SO_SNDBUF=9095625
Final SO_RCVBUF=9095625
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 1000*1000000 bytes from     aztec to    echo-h
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 72.0108 18.6173 (25.9%)  0.0098 ( 0.0%) 13561.34 105.948   111.094
   read 76.0022 10.5164 (13.8%)  0.1684 ( 0.2%) 12849.13 100.384   105.260
    r/w 148.0130 29.1338 (19.7%)  0.1782 ( 0.1%) 13195.63 103.091   108.099

Window size 1.5 optimal

331  rtx nettest -s 8 -b 13643438 echo-h 1000 1000000 

Final SO_SNDBUF=12000000
Final SO_RCVBUF=12000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 1000*1000000 bytes from     aztec to    echo-h
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 131.4753 15.5650 (11.8%)  0.0098 ( 0.0%) 7427.72  58.029    60.848
   read 138.7509 10.7622 ( 7.8%)  0.1711 ( 0.1%) 7038.24  54.986    57.657
    r/w 270.2263 26.3272 ( 9.7%)  0.1809 ( 0.1%) 7227.74  56.467    59.210
 
The results indicate that the optimal window shift and socket buffer size for 
the ACTS channel was a shift of 8 and a send/receive buffer size of no less than
9095625 bytes, validating the computed results.  10000000 bytes was chosen as a 
simple figure to work with and as the data below illustrates, was a valid
assumption.

82-echo% rtx nettest -s 8 -b 10000000 aztec2-h 1000 3000000 

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 1000*3000000 bytes from      echo to  aztec2-h
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 202.3199 97.4109 (48.1%)  0.0202 ( 0.0%) 14480.47 113.129   118.624
   read 195.3330 69.4343 (35.5%)  0.9929 ( 0.5%) 14998.42 117.175   122.867
    r/w 397.6530 166.8452 (42.0%)  1.0131 ( 0.3%) 14734.90 115.116   120.708

Repeated nettest/nettestd suites were executed using the validated window scale
parameters in the bent pipe configuration.  Consistent performance in the 120
mbps range was noted.

Having validated Cray to Cray performance in bent pipe configuration for the two
machines at NCAR, validation tests using nettest/nettestd between the Cray J-916
at NCAR and the Cray Y-MP8 at OSC in end-to-end configuration were commenced.  
This would be the final test configuration and the one that the coupled 
atmospheric and hydrodynamic models would run on.  The objective was to use all
of the parameters from the bent pipe configurations tests to validate end-to-end
connectivity performance.  

OSCA$ nettest -s 8 -b 10000000 aztec2-h.acts.ucar.edu 1000 3000000 

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 1000*3000000 bytes from      OSCA to aztec2-h.acts.ucar.edu
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 103.8431 11.1099 (10.7%)  0.0022 ( 0.0%) 14106.32 110.206   115.559
   read 104.6689  7.3552 ( 7.0%)  0.0896 ( 0.1%) 13995.03 109.336   114.647
    r/w 208.5119 18.4651 ( 8.9%)  0.0918 ( 0.0%) 14050.45 109.769   115.101

98-aztec% nettest -s 8 -b 10000000 osca3.acts.ucar.edu 1000 3000000 

Final SO_SNDBUF=10000000
Final SO_RCVBUF=10000000
Final TR_SENDWNDSHIFT = 800000000000, sendwinshift = 8, recvwindshift= 8.
Transfer: 1000*3000000 bytes from     aztec to osca3.acts.ucar.edu
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 103.8082 25.2396 (24.3%)  0.0063 ( 0.0%) 14111.06 110.243   115.598
   read 125.4580 22.2055 (17.7%)  0.3108 ( 0.2%) 11675.97  91.218    95.650
    r/w 229.2662 47.4451 (20.7%)  0.3171 ( 0.1%) 12778.54  99.832   104.682

As is evidenced above, TCP performance over the hybrid between the Cray J-916 at
NCAR and the Cray Y-MP8 at OSC was as expected, validating the TCP Performance 
Extensions outlined in RFC 1323.  These parameters were conveyed to the 
researchers for incorporation into the coupled model-PVM applications.

ATM->HiPPI Performance

Due to time constraints ATM to HiPPI performance over the hybrid was not 
validated.  However with the exception of the Maximum Transmission Unit (MTU) or
Maximum Segment Size (MMS) the performance parameters were the same as those
between the Crays over the hybrid.  Optimum performance is obtained in any
environment by transmitting as large a packet (IP) as practicable.  An MTU of 
65 Kbytes for the HiPPI physical layer is used by the Crays while one of 9188 
bytes is used by ATM for the video conferencing and collaborative workstations.

ATM to HiPPI performance was validated from the NCAR Cray J916 to the NCAR SGI
Onyx.  The path extended from the Cray, over HiPPI to the NetStar GigaRouter, to
the SGI via ATM.  The HiPPI stream from the Cray was converted to ATM and 
SONET by the NetStar GigaRouter.  The success of this communications was the 
result of the use of dynamic MTU discovery on both machines.  The correct MTU 
for the physical layer (9100 bytes) was automatically selected by both machines 
based on the installed configuration.  

The following data examine the nettest/nettestd performance between the Cray 
J-916 and the SGI Onyx.  The noticeably smaller window sizes reflect the low 
latency of the terrestrial fiber network at NCAR.  

>From the Cray J916 to the SGI Onyx

magic.47: nettest -b 37500 aztec2-h.acts 100 1000000
Transfer: 100*1000000 bytes from     magic to aztec2-h.acts

          Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 15.9000  3.7700 (23.7%)  0.0100 ( 0.1%) 6141.90  47.984    50.314
   read 15.3600  5.2200 (34.0%)  0.1100 ( 0.7%) 6357.83  49.671    52.083
    r/w 31.2600  8.9900 (28.8%)  0.1200 ( 0.4%) 6248.00  48.813    51.184

magic.48: nettest -b 65535 aztec2-h.acts 100 1000000
Transfer: 100*1000000 bytes from     magic to aztec2-h.acts
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 11.2000  3.8100 (34.0%)  0.0100 ( 0.1%) 8719.31  68.120    71.429
   read 12.7500  5.0100 (39.3%)  0.1100 ( 0.9%) 7659.31  59.838    62.745
    r/w 23.9500  8.8200 (36.8%)  0.1200 ( 0.5%) 8155.01  63.711    66.806

>From the SGI Onyx to the Cray J-916

aztec.6: nettest -b 37500 magic.acts 100 1000000

Final SO_SNDBUF=37500
Final SO_RCVBUF=37500
Final TR_SENDWNDSHIFT = 0, sendwinshift = 0, recvwindshift= 0.
Transfer: 100*1000000 bytes from     aztec to magic.acts
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 18.0001  6.5029 (36.1%)  0.0012 ( 0.0%) 5425.32  42.385    44.444
   read 15.7929  9.2299 (58.4%)  0.1281 ( 0.8%) 6183.55  48.309    50.656
    r/w 33.7930 15.7328 (46.6%)  0.1292 ( 0.4%) 5779.67  45.154    47.347

aztec.8: nettest -b 65535 magic.acts 100 1000000

Final SO_SNDBUF=65535
Final SO_RCVBUF=65535
Final TR_SENDWNDSHIFT = 0, sendwinshift = 0, recvwindshift= 0.
Transfer: 100*1000000 bytes from     aztec to magic.acts
           Real  System            User          Kbyte   Mbit(K^2) mbit(1+E6)
  write 12.4164  6.2488 (50.3%)  0.0013 ( 0.0%) 7865.09  61.446    64.431
   read 13.3255  9.0434 (67.9%)  0.0996 ( 0.7%) 7328.50  57.254    60.035
    r/w 25.7420 15.2922 (59.4%)  0.1009 ( 0.4%) 7587.32  59.276    62.155

The performance in both directions, alternating client and server is consistent
and shows marked performance increase with larger window sizes.  Notice that the
nettest/nettestd commands do not specify the -s option.  This is due to the fact
that the SGI will invoke the appropriate window shift when the -b option signals
a socket buffer size greater that the default window size set in the kernel.  
The Cray however requires the -s option but not in this case as the window size 
does not exceed the default.  Large socket buffers are not required due to a 
trivial bandwidth-delay product.  For interactive control of the models via the 
hybrid by the SGI workstations however, the larger window sizes were required.

Problems and Recommendations

A.  Difficulty integrating SONET components of the HiPPI/SONET gateway with the
    SONET section of the HDR digital terminals.  

A major objective of the project was to investigate the interoperation of
different physical layer technologies into the hybrid architecture.  A prototype
HiPPI/SONET gateway was integrated to serialize the parallel HiPPI datastreams 
for mapping to the SONET SPE of the earth station HDR's.  A great deal of time 
was spent researching problems establishing loopback as well as end-to-end 
connectivity through this device.  The result was a loss of scheduled time on 
the spacecraft due to earth segment unavailability.  This had an overall adverse
impact on the progress of the entire experiment.  Project milestones were not 
reached, causing a slide to the right of the experimental time line.

It was proposed that this phenomenon was the result of synchronization problems 
between the gateway and the HDR.  Numerous earth station components were 
replaced in an effort to locate this inconsistency.  Toward the end of the 
project BBN, LANL, NCAR and OSC engineers determined that the cause was a 
continuous cyclic reset between the two components.  A procedure was developed 
to circumvent what appeared to be a State Machine problem with the SONET 
section of the HDR terminals.

Upon spacecraft acquisition the HiPPI/SONET gateways must be reset in such a 
manner as to establish synchronization during a perceived time window of 
opportunity.  Timing must be set to source on BOTH HiPPI/SONET gateways, each 
recovering timing from the other.  The sequential order of synchronization was
found to be critical.  Once found, this procedure consistently established 
immediate end-to-end connectivity between the two sites.  It was used with 
complete success for the duration of the experiment.  Regrettably this discovery
was made late in the life cycle of the experiment and could not recover time 
lost. 
 
The procedure is outlined as follows:

	1.  Terminate all traffic over spacecraft link.
	2.  Turn off HiPPI/SONET gateway and FIFO at first site.
	3.  Leave HiPPI/SONET gateway and FIFO down at first site during power-
      	    off of the HiPPI/SONET gateway and FIFO at second site.
	4.  Restart FIFO at second site, and restart HiPPI/SONET gateway after 
	    FIFO.
	5.  Restart FIFO at first site, and restart HiPPI/SONET gateway after 
	    FIFO.
	6.  Do not attempt to transmit ANY data traffic until verification of  
	    HDR state and spacecraft (TDMA) acquisition at both sites.
	7.  If state of both HDR's not OK go back to step 2. and begin again.
	8.  If state of both HDR's is OK initiate data traffic and evaluate link
	    performance.


B.  Poor reliability and MTBF for critical earth segment hardware components. 
    
Problems encountered with hardware reliability in both HDR earth stations 
resulted in the loss of time slots for critical end-to-end connectivity on the 
spacecraft.  The OSC site experienced Traveling Wave Tube Amplifier (TWTA) 
failure twice during the June-October '95 period.  Digital Terminal components 
at NCAR were frequently replaced to correct Network Processors, Timing and 
Control Card and SONET section failures.  This component instability had an 
adverse impact on the schedule due to earth segment unavailability at critical
experimental phases.  In spite of this BBN was very responsive and effective in
troubleshooting and replacing failed components.

C.  Loss of NCAR earth segment availability due to rodent damage to outside 
    earth station components.

Significant spacecraft time was lost to earth segment unavailability as the 
result of rodent damage to the power cabling for the Low Noise Amplifier (LNA) 
on the HDR terminal.  The cabling connecting the outside LNA to the interior 
power supply was completely dissected by rodents.  While it took a fair amount 
of time and effort to locate the source of earth station inoperability, repair 
was simple and complete functionality was restored quickly.  Outside components 
were protected as well as practicable, however more intense hardening of 
facilities and routine, periodic inspection for such damage in future HDR 
implementations may help avoid such downtime. 
 
D.  Difficulty meeting project milestones as a result of organizational
    boundaries. 

The magnitude of the experiment required a number of organizations with very 
different missions, goals, charters and management to interact.  Government, 
state and private/commercial entities cooperated to operate and manage the 
significant resources involved in the experiment.  While some bodies could 
dedicate sufficient numbers of personnel and resources to operate the various 
systems and objects, others were almost hamstrung to provide the minima. 
Differences in organizational procedures and policies had the capability to 
impede planning and execution of the complex and interdependent facets of the 
experiment.  One entity's difficulty completing a task or procedure in a timely 
manner would have an adverse or domino effect on the others. 

No overall project management for the experiment was possible due to the 
distributed nature of the system and diversity of the participants.  Loss of 
coordination of the separate and disparate project management entities could 
result in duplication of effort or loss of synergy.  While regular meetings and 
telephone conference calls were held, progress was sometimes irregular and 
strained.  The inability to cross organizational boundaries to affect action or
progress during a critical phase often left project managers feeling helpless, 
exacerbated by an nearly impossible schedule. 

While the intrusion into autonomous organizational domains by singular project 
management is neither possible nor desirable, future deployments of 
supercomputing experiments of the COOP 3D nature may benefit from a parallel or 
peer distributed project management architecture.  In this scenario each entity 
could designate project managers having a similar level of management authority 
in their respective organizations.  Temporary but similar management and 
reporting structures could be established at each organization which would 
mirror overall project structure, practices and goals.  Through this an 
automatic synergy would be realized; eliminating much duplicated effort and 
standardizing procedures by establishing a ubiquitous infrastructure common to 
critical participants.  Managers could then focus their team's efforts to 
resolve technical or procedural problems crossing organizational boundaries with
out frustration while adhering to the commonly drawn plan, timeline and goals. 
  
Conclusion 

In spite of the difficulties the goals of the experiment were met.  Adequate 
performance was obtained from a terrestrial/satellite hybrid network 
architecture to support interactive data communications between high performance
end systems here-to-fore requiring low latency, high bandwidth interconnections.

New ideas were developed and proven which revitalized existing technologies.  
The venerable technologies of geostationary satellites were enhanced to provide
high capacity, fiber optic quality communications in a new arena.  Mature and 
proven by its functionality on the Internet, the classical TCP/IP protocol suite
was successfully adapted to operate over modern, high capacity physical layer 
technologies.  Successful integration of dissimilar physical layer architectures
was achieved, permitting interoperability between legacy systems and modern 
fiber-based communications technologies.

Inexpensive and reliable high bandwidth satellite communications systems like  
ACTS could increase the the utility of high performance computing.  Such 
systems could facilitate real-time, collaborative efforts in scientific research
between non-collocated researchers anywhere in the world at any time without 
reliance on terrestrial systems.  Earth and atmospheric scientists could access 
supercomputing resources immediately from sites in Lesser Developed Countries 
or remote areas not served by modern high capacity data communications, enabling
them to advance the state of scientific discovery sooner and at less expense.
 
Future deployment and availability of wide band satellite communications systems
will be an enabling technology of the emerging Broadband Integrated Services
Digital Network (B-ISDN).  These systems will provide cost effective, high 
capacity communicaitons for the integrated voice, data and imaging applications 
which will make up the National Information Infrastructure (NII) and Global
Information Infrastructure (GII).  The presence of such systems will accelerate
the deployment of these backbones in the United States and worldwide. 


@(#)coop3d.html 1.1 96/01/19 fair@ncar.ucar.edu