OVIS is Open Source software designed for the monitoring of the performance, health and efficiency of large scale computing systems. Although OVIS was designed for HPC systems, the distributed, modular design of OVIS makes it well suited to cloud environments as well. OVIS has been deployed on systems of various sizes from thousands to tens of thousands of cores.
OVIS is an Open Source project that is cooperatively developed by Open Grid Computing, and Sandia National Labs. It has been deployed on large scale HPC systems at Lawrence Livermore National Labs, Sandia National Labs, and The University of Illinois at Urbana (NCSA). OVIS is distributed under the terms of a Dual GPL/BSD License.
The principal components of OVIS are shown in the figure below. On each monitored element, a Sampler daemon is run that collects the data being monitored. In OVIS, each monitored data is called a metric. Metrics are grouped together into Metric Sets.
Metric sets are gathered and stored by one of more Aggregator daemons. The API and network protocol for gathering this data is called the Lightweight Distributed Metric Service (LDMS).
LDMS is designed to minimize the monitoring overhead on the sampled element and to provide a structured mechanism to exchange metric data over networks of many different types.
System log data is gathered via rsyslog by one or more Baler daemons. The Baler daemon analyses, indexes and stores this data in the Structured Object Store (SOS).
SOS is an object based database that supports the indexing and storage of very large numbers of objects. The design and implementation of SOS was motivated by the need for the very high data injest rates required when collecting data from tens of thousands of monitored elements.