The installation of WFO-Advanced workstations at the Denver NWS Forecast Office (WSFO) in May 1996 represents a significant milestone in the NWS modernization. Based on functional specifications for the Advanced Weather Interactive Processing System (AWIPS), WFO-Advanced (MacDonald and Wakefield, 1996) supports essentially all diagnostic and forecasting operations at the Denver WSFO.
Components of WFO-Advanced include data ingest and management, user interface, display, and text generation. Each of these components needs to be monitored to ensure that the system operates as planned, providing the required support to WSFO operations. Experience gained in this monitoring effort can also be applied to the operation of the AWIPS Network Control Facility (Thigpen, 1996), whose responsibilities include remote monitoring of operations at AWIPS sites.
In this paper, we describe three aspects of WFO-Advanced monitoring. A data monitor (covering the ultimate success of the data ingest and management system), designed for both forecaster and developer use, is described in Section 2. In Section 3, we discuss the process monitor and restart mechanism, intended primarily for forecaster use. System performance monitoring (addressing the general status of the computer system), of more interest to systems administrators and developers, is outlined in Section 4.
Figure 1.
Figure 2.
The monitor checks individual data directories, finding the time of the most recent data in each case. The summaries display the "weak link" state of each set.
Figure 3.
The example data-set, profiler plot, should be available each hour. Thus, it shows a green check mark if the latest data have been received within an hour, a yellow triangle if between one and two hours, and an X in a red circle if it has been more than two hours since the last dataset was received.

Since we store all grids from a specific model run in one file, noting that the current model run is "in" is not sufficient, since that indicates simply that at least one grid of perhaps thousands has arrived. To provide more information on the completeness of the data, an inventory is performed on each file, and the percentage of grids actually present is shown with the check mark, in increments of 5%.
Figure 4.
Our monitor uses scripts to restart the workstation ingest processes. These scripts provide an interface that lets the forecaster choose which process(es) to restart.
Figure 5.
The WFO-Advanced monitoring system has been used by systems administrators, software developers, and management for these purposes. The monitor was written using SAR (System Activity Reporter), a tool that is included with most UNIX operating systems, including Hewlett-Packard's HP-UX, as well as a standard Web server, Perl, and the PBM-Plus and gnuplot packages, all readily available on the Internet.(4)
Data collection occurs on each data server, applications processor, and display workstation in the WFO-Advanced configuration. Striking a balance between reporting detail and system loading, we collect performance data averaged over 15 minute periods for each node. Data are stored for each CPU and disk present on the system, consuming 300 to 500 kilobytes per day for each WFO-Advanced host.
The WWW server gathers these data from each host and saves them for use by the performance monitor page. The script that copies these data also removes old data from those hosts. Currently, the server stores data for the past 30 days for each of seven hosts, for a grand total of approximately 100 MB.
Each host's data files are stored in an appropriately named subdirectory on the WWW server. Once a day, the directory structure is scanned and a new main page for the performance monitor is generated. Newly monitored hosts are automatically added to the list of hosts with available performance data.
Figure 6.
From the main page, users select the data to display, choosing as many hosts, dates, and datasets as they wish. This request is then submitted to the server by clicking on Show Data, and the requested data for each host appears on a new page that is generated by the script handling the request from the main page. The data page includes appropriate identifying information, plus a back link and the time when the page was generated.
Data can be displayed in either a tabular or graphical format. The former is the raw output from SAR plus a legend to explain the abbreviations used.
For graphical display, gnuplot is used to create Portable Bitmap (PBM) format files. As seen in the sample in Fig. 7, the abscissa is time of day (UTC), with the actual collected data as the ordinate. Where data have no natural scale, the range is selected automatically by gnuplot.
Figure 7.
PBM-Plus filters are used to convert the graphics from PBM to Graphical Interchange Format (GIF) for display. Since these GIF files will not be displayed until the script has exited, the script cannot remove them. They and other temporary files used for constructing the displays are removed periodically by a cron script.
SAR breaks CPU use into four states. These include user CPU time (time spent running user programs, including executing numeric and other calculations), system (time spent by the system executing kernel code on behalf of user programs, such as input/output (I/O) requests), I/O wait (where a process is waiting on a read from or write to physical memory or disk), and idle.
UNIX uses buffers in main memory to help improve disk I/O performance. Data written to disk by a user program will first be written to this cache, then later to physical disk. These data can be re-read from cache until the space is needed by other data. Similarly, when the system reads from disk, it uses an input cache, and will usually read more data than requested. Buffer caches are monitored by SAR, reporting the number of cache reads and writes, the number of disk reads and writes, and the cache hit ratio.
Block device activity (file system I/O) is also monitored by SAR. Each file system is monitored independently for the portion of time it is busy servicing requests, the average number of outstanding I/O requests for that file system, transfers and bytes per second to and from the file system, the average time transfer requests waited idly on queue, and the average time to service transfer requests. If these statistics show an imbalance in file system activity, the system administrator might wish to relocate some datasets in order to balance the load.
Another monitored operation is tty device activity, traffic to and from modems and terminals. The number of input and output characters and interrupt rates is monitored.
System calls occur when user programs request system services. Parameters that are monitored include reads, writes, forks, execs, and the number of characters transferred by system calls to block (random-access storage) devices.
System swapping and switching activity is one of the most important parameters that can be monitored on a UNIX system. A process is swapped in, or moved to primary memory, when it is ready to be run by the CPU. It remains in memory until the system needs the space for use by another process, at which point the contents of the space allocated to the process is moved to swap space on disk. SAR monitors the number of swap-ins and swap-outs per second and the amount of data transferred during these swaps. It also monitors the number of process context switches. The context of a process is generally defined as its state, including values of user variables and data structures and machine registers. A context switch occurs when the UNIX kernel decides to run (execute in the context of) another process (Bach, 1986, p. 29). Excessive swapping is an indication that the machine is memory-starved, and overall performance suffers.
UNIX run and swap queue lengths are monitored by SAR. The run queue contains the processes that are either running or waiting to run. The swap queue is the list of processes that are swapped out but ready to be run. In addition to the average lengths of these queues, SAR also reports the percentage of time these queues are occupied.
Information on several internal UNIX tables is reported by SAR. The current and maximum sizes of the process table, representing the number of processes that may be run on the system at any time, are displayed. When the process table is full, the system will not start any new processes until the ones that are currently running exit. The inode table is a cache of data structures that contain information about files in filesystems, and is monitored for its size. When the inode table fills, the system removes older entries and replaces them with new ones. The file table limits the number of files that can be open at any given time. When it fills, subsequent opens on files fail until those that are currently open are closed and its table entry released. SAR also reports the number of times each of these tables filled (Loukides, 1990, pp 74-75).
Messages and semaphores are System V interprocess communications facilities, and the number of calls to these are monitored by SAR.
Loukides, Mike, 1990: System Performance Tuning, O'Reilly & Associates, Inc., 312pp.
MacDonald, A. E., and J. S. Wakefield, 1996: WFO-Advanced: An AWIPS-like Prototype Forecaster Workstation. Preprints Twelfth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Atlanta, Amer. Meteor. Soc., 190-193.
Thigpen, R. K., 1996: The AWIPS Network Control Facility, An Introduction. Preprints Twelfth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Atlanta, Amer. Meteor. Soc., 528-530.