OVIS 3.3 User Guide

You can get the source code at https://github.com/ovis-hpc/ovis by cloning the OVIS repository.

$ git clone https://github.com/ovis-hpc/ovis.git
$ cd ovis

There are multiple sub-projects under the OVIS project.

  • Light-weighted Distributed Metric Service (LDMS): a low-overhead, low-latency framework for collecting, transfering, and storing metric data on a large distributed computer system.
  • Baler: an aggregation of log message exploration and analysis tools. Please see baler documentation here.
  • OVIS utility libraries (lib): A collection of utility libraries used in the OVIS project. Some examples are
    • Zap transport library: A library that is transparent the use of RDMA-liked operations on RDMA network (IB, iWarp, roce), Cray uGni network and socket. Zap unifies how to use each transport specific operation. Therefore, users do not need to worry about the underlying APIs and protocol of different network.
    • A collections of libraries for hashing and indexing objects, e.g., red-black tree, Fowler-Noll-Vo (fnv) hash, etc.
  • Scalable Storage Object (SOS): a high-performance, indexed, object-oriented database designed to efficiently manage structured data on persistent media. Please see more details here.

This guide will not cover SOS, you can find SOS guideline here.

Here are the content of this guideline.


Dependencies

  • autoconf (>=2.63), automake, libtool
  • For Centos6,
    • yum groupinstall “Development Tools”
  • libevent2 (>=2.0.21)
    • For recent Ubuntu and CentOS 7, libevent2 can be installed from the central repo.
    • If you want to install from source, please find it here. http://libevent.org/
  • openssl Development library for OVIS Authentication
  • For LDMS and Baler Python Interface:
    • Python-2.7
    • swig
    • python-dateutil
    • Python Development library
  • Doxygen for documentation

Quick Start

This section will show you a guide line how to build, install, and setup environment to have all the  OVIS basic functionalities.

  • LDMSD with authentication, most of samplers that collect metrics from /proc,  csv, function-csv and sos storages, and LDMSD control interface (ldmsd_controller)
  • baler, bquery, and bclient
  • OVIS socket transport is available.
Assumptions
  • You have cloned the OVIS git repository or downloaded the source code from github. The repository contains LDMS, baler and ovis librarires. SOSdb is a submodule you need to call git-submodule commands to get the source code. See instruction below.
git clone https://github.com/ovis-hpc/ovis.git
  • You have installed all the dependencies above.
  • The source directory is ovis.
  • We will install OVIS to /opt/ovis.

Quick build guide

$ cd ovis

# Getting the SOSdb
$ git submodule init sos
$ git submodule update sos

# Start the build and install process
$ ./autogen.sh
$ mkdir build
$ cd build

# Add CFLAGS='-g -O0' to the configure line below
# if you want to have the debugging symbols.
$ ../configure --prefix=/opt/ovis --enable-baler --enable-sos --enable-swig
$ make
$ sudo make install
  • If there is an error when you call `sudo make install` about re-linking a library, it might be because root does not have permission to write in your build directory. To solve the problem, in build directory call
    chmod o+w -R .

    The error message will look similar to the messages below.

libtool: install: warning: relinking `libcoll.la'
 libtool: install: (cd /home/foo/ovis/build/lib/src/coll; /bin/sh /home/foo/ovis/build/lib/libtool  --silent --tag CC --mode=relink gcc -I../../../../lib/src/coll/../ -I../../../../lib -g -O0 -fdiagnostics-color=auto -o libcoll.la -rpath /opt/ovis/lib rbt.lo idx.lo str_map.lo ovis-map.lo label-set.lo heap.lo ../third/libovis_third.la ../ovis_util/libovis_util.la )
 mv: cannot move 'libcoll.so.0.0.0' to 'libcoll.so.0.0.0U': Permission denied
 libtool: install: error: relink `libcoll.la' with the above command before installing it

More build options per OVIS feature can be found at Build options by features.

Quick Setup Environment Guide

$ OVIS_HOME=/opt/ovis
$ export PATH=${OVIS_HOME}/bin:${OVIS_HOME}/sbin:$PATH
$ export LD_LIBRARY_PATH=${OVIS_HOME}/lib:${OVIS_HOME}/lib64:$LD_LIBRARY_PATH
$ export PYTHONPATH=${OVIS_HOME}/lib64/python2.6/site-packages:${OVIS_HOME}/lib/python2.6/site-packages:$PYTHONPATH

# If you have python 2.7, change python2.6 to 
# python2.7.

Setup the OVIS plugin library path (ZAP_LIBPATH and LDMSD_PLUGIN_LIBPATH)

$ ls ${OVIS_HOME}/lib

# If there are ovis-lib and ovis-ldms, do the 
# following.

$ OVIS_LIB=${OVIS_HOME}/lib
# If ovis-lib and ovis-ldms are not in # ${OVIS_HOME}/lib, try

$ ls ${OVIS_HOME}/lib64

# You should see ovis-lib and ovis-ldms. # In this case, do the following.

$ OVIS_LIB=${OVIS_HOME}/lib64
$ export ZAP_LIBPATH=${OVIS_LIB}/ovis-lib
$ export LDMSD_PLUGIN_LIBPATH=${OVIS_LIB}/ovis-ldms

Setup the LDMS authentication secret word file. Assume that the secret word is ‘password.’

$ echo 'secretword=password' >~/.ldmsauth.conf
$ chmod 600 ~/.ldmsauth.conf
$ cat ~/.ldmsauth.conf
$ secretword=password

Alternatively you could create the secret word file anywhere you want on your system and set the environment variable LDMS_AUTH_FILE. Assume that the secret word file is at /etc/ovis/ldms_secretword.conf.

$ cat /etc/ovis/ldms_secretword.conf
$ secretword=password

$ chmod 600 /etc/ovis/ldms_secretword.conf
$ export LDMS_AUTH_FILE=/etc/ovis/ldms_secretword.conf

At this point, you should be ready to run LDMS Daemons or balerd.

To run an LDMS Daemon, please see here.

Build options by features

There are many modules in the OVIS project. All the build options are listed here by OVIS modules and features.

OVIS Sub-projects

  • OVIS-lib is built and installed by default because it is a dependency of all other OVIS sub-projects.
  • SOS could be installed after the SOS submodule is initialized and updated by
$ git submodule init sos
$ git submodule update sos
Sub-projectsBuild options
LDMS/LDMSD--enable-ldms (default)
Baler--enable-baler
SOSdb--enable-sos

Documentations

FeatureBuild optionsDependencies
Documentations--enable-docDoxygen
HTML documents--enable-doc-html--enable-doc
Man pages--enable-doc-man--enable-doc

Note that if only –enable-doc is given, no documents will be built.  Please choose either HTML documents or man pages or both.

Transports

FeatureBuild optionDependencies
socket--enable-socket (default)No dependencies
RDMA (IB, iWarp, roce)--enable-rdmalibibumad
libibmad
libibverbs
librdmacm
Cray uGNI--enable-ugniCray uGNI-related libraries.
These should exist on your Cray system already.

LDMSD authentication

FeatureBuild optionsDependencies
LDMSD Authentication--enable-ovis_auth (default)openssl

For how to use the authentication, do man 7 ldms_authentication after install or see ldms/man/ldms_authentication.man.

LDMSD Control Interface

FeaturesBuild OptionsDependencies
LDMSD controller Interface which is for configuring an LDMSD--enable-ldms-python (default)
--enable-swig***
python 2.7

swig***

LDMSD Control Interface or ldmsd_controller is a python interface to configure an LDMS Daemon. There are two ways the interface connects to an LDMS Daemon: ovis TCP/IP or UNIX Domain socket.

***’swig’ is an optional dependency. Without ‘swig’ you still are able to use the LDMSD control interface; however, there are some limitations depending on whether you build OVIS with or without authentication.

  • Build OVIS with authentication (default):
    • You will be able to use the LDMSD control interface only with the UNIX Domain Socket. This means that you cannot configure an LDMS Daemon remotely.
  • Build OVIS without authentication (–disable-ovis_auth)
    • There is no limitation. The ‘swig’ is required for authentication.

LDMSD Store plugins

Featuresbuilding optionsdependencies
store csv
store function csv
--enable-csv (default)None
store sos--enable-sosSOS 3.3.0
store rabbit--enable-rabbitv3
--enable-amqp
librabbitmq < 0.7
amqp

LDMSD Sampler plugins

FeatureBuild optionsDependencies
/proc/meminfo--enable-meminfo (default)None
/proc/vmstat--enable-vmstat (default)None
/proc/stat--enable-procstat (default)None
/proc/stat/util--enable-procstatutil (default)None
/proc/net/dev--enable-procnetdev (default)None
/proc/diskstats--enable-procdiskstats (default)None
lustre metrics--enable-lustre (default)None
Infiniband metrics--enable-sysclassiblibrbmad-devel
libibumad
libibumad-devel
/proc/kgnilnd/stats--enable-kgnilndNone

LDMSD Cray Sampler plugins

FeatureBuild optionsDependencies
Cray system metrics--enable-cray_system_sampler--enable-gemini-gpcdr
or
--enable-aries-gpcdr
Cray Gemini metrics--enable-gemini_gpcdr--enable-cray_system_sampler
Cray Aries metrics--enable-aries-gpcdr--enable-cray_system_sampler
Cray Aries mmr metrics--enable-aries-mmr--with-aries-libgpcd=,
* You can find the source code of the libgpcdr here.

Build, Configure and Install Guidelines

Assume that the project root is at ovis.

  • Go to the project root directory and generate the configure and Makefile.
$ cd ovis
$ ./autogen.sh
  • Create the build directory so that the build files will be separately in its own directory.
$ mkdir build
$ cd build
  • configure the project
    • --prefix:  Specify the directory to install the OVIS binaries
    • build options:  space-separated list of build options in the Build options section
$ ../configure --prefix=<install path> [build options]

For example,

To install LDMS, Baler, and SOS at $HOME/opt/ovis do the following.

$ ../configure --prefix=$HOME/opt/ovis \
               --enable-baler --enable-sos
  • Compile the code
$ make
$ make install # Use sudo if you want to install as root.

Before Starting the OVIS Framework

There are some environment variables that you might need to setup.

Environment variables

Mandatory environment variables

  • PATH: Add the OVIS binary path
  • LD_LIBRARY_PATH: Add the OVIS library path.
  • PYTHONPATH: Add the OVIS python module path
  • ZAP_LIBPATH: Path to the Zap library. It is located at <OVIS install path>/lib/ovis-lib or <OVIS install path>/lib64/ovis-lib LDMSD_PLUGIN_LIBPATH: Path to the LDMSD plugin libraries. It is located at <OVIS install path>/lib/ovis-ldms or <OVIS install path>/lib64/ovis-ldms.
  • Which path you need to use: /lib or /lib64?
    • This depends on which Linux distro you are using and how you install OVIS. For example, in RHEL 7, if you install OVIS as users, the OVIS libraries will be in <OVIS install path>/lib. On the other hand, if you install OVIS system-wide, the libraries will be in <OVIS install path>/lib64.
    • The best thing to do is to look into your install path.
      • ls <OVIS install path>, e.g,  ls /opt/ovis.
      • If you see lib64, you must set ZAP_LIBPATH and LDMSD_PLUGIN_LIBPATH  to <OVIS install path>/lib64/ovis-lib and <OVIS install path>/lib64/ovis-ldms, respectively.
$ TOP=<OVIS install path> # a dummy variable
$ export PATH=$TOP/bin:$TOP/sbin:$PATH
$ export LD_LIBRARY_PATH=$TOP/lib:$TOP/lib64:$LD_LIBRARY_PATH
$ export PYTHONPATH=$TOP/lib/python2.7/site-packages/:$PYTHONPATH
$ export ZAP_LIBPATH=$TOP/lib/ovis-lib
$ #or export ZAP_LIBPATH=$TOP/lib64/ovis-lib
$ export LDMSD_PLUGIN_LIBPATH=$TOP/lib/ovis-ldms
$ #or export LDMSD_PLUGIN_LIBPATH=$TOP/lib64/ovis-ldms

Zap environment variables (Optional)

Zap is a non-blocking network library and provide RDMA-liked operations, e.g., read, write, and share a memory map. Upon receiving an event from an underlying transport, Zap immediently consumes the event to release the underlying transport resource, makes a copy of the event, and enqueues the copy to its queue to process and deliver to its application later. This is to alleviate network congestion for large scale applications.

  • ZAP_EVENT_QDEPTH: Zap event queue depth. The default is 4096.
    • If the event queue is full, Zap will not receive new transport events, e.g., disconnected event, read complete, or receive new message.
  • ZAP_EVENT_WORKERS: The number of threads that will dequeue Zap events in the event queue. The default is 4.

Zap Cray uGNI Environment Variables

If you are not using a Cray machine, you could skip this section.

Cray uGNI comes with a specific way to setup your environment to use its resources. Whether you need PTAG and/or Cookie depends on the uGNI design your system has (Aries or Gemini).

  • ZAP_UGNI_PTAG (mandatory): Set your ptag.
  • ZAP_UGNI_COOKIE (mandatory): Set your cookie.

To handle some specific scenario of Cray uGNI, there are additional environment variables that you might want to set to an appropriate value for your use cases.

  • ZAP_STATE_INTERVAL: If the environment variable is set, Zap uGNI gets the states of all nodes in the uGNI network every N seconds. Otherwise, Zap uGNI will not check the node states.
  • ZAP_UGNI_CQ_DEPTH: The uGNI CQ depth. The default is 2048.
    • For an ldmsd that will just create sets (sampler daemon), there is no need to change the number. For an ldmsd that will collect sets from a remote ldmsd (aggregator), the number depends on the number of sets and the update interval. The maximum number we used to collect 24000 sets every 1 second is 65536 which is more than enough.
  • ZAP_UGNI_UNBIND_TIMEOUT: Zap uGNI will try to unbind the endpoint every N seconds (5 is the default) after it receives a disconnected event from the underlying library.
  • ZAP_UGNI_DISCONNECT_EV_TIMEOUT:Zap uGNI flushes all outstanding requests after it has tried to unbind a disconnected endpoint for N seconds and failed. (The default is 3600, i.e.,  1 hour)

LDMS/LDMSD/ldms_ls Environment Variables (Optional)

Lastly are the environment variables for setting up ldmsd and ldms_ls.

  • LDMS_AUTH_FILE: Set the path to the shared secret file.
  • LDMSD_MAX_CONFIG_STR_LEN: The maximum length of configuration line sent through ldmsd_controller.  For more information about ldmsd_controller, please see man /opt/ovis/share/man/man8/ldmsd_controller.8.
  • LDMSD_MEM_SZ: The memory pre-allocated for all LDMS sets in an LDMSD. The default is 512kB. This is equivalent to the -m option of the ldmsd command line.
    • For a sampler daemon, this is usually enough.
    • For an aggregator, it depends on the number of sets it is collecting and the size of each set. You can get the size of each set from ldms_ls. For more information about ldms_ls, please see man /opt/ovis/share/man/man8/ldms_ls.8.
  • LDMS_LS_MEM_SZ: The memory pre-allocated for all LDMS sets in an ldms_ls process. Similarly to LDMSD_MEM_SZ but for ldms_ls. This is equivalent to the -m option of the ldms_ls command line.

LDMSD User Guide

LDMSD is designed for collecting, transferring, and storing numeric data at a large scale system. Therefore, it has very low computational overhead. It also works great on a single machine.

There are two types of LDMSD plugins: sampler plugin and storage plugin. Sampler plugins are plugins that sample metric values from their source. Storage plugins are plugins that store the metric values to a destination.

Available Sampler Plugins

  • meminfo: Source /proc/meminfo
  • vmstat: Source /proc/vmstat
  • procnetdev: Source /proc/net/dev
  • procdiskstat: Source /proc/diskstats
  • procstat: Source /proc/stat
  • sysclassib: Infiniband metrics. See dependencies here.
  • lustre: Lustre metric values for client, MDS, and OSS.
  • Cray system metrics: Cray-specific various metric. See dependencies here.
  • Cray uGNI metrics: Cray uGNI-specific metrics. See dependencies here.

Available Storage Plugins

  • store_csv: Store the metric data to files in a comma-separated format.
  • store_function_csv: Similar to store_csv but the metric values can be derived before being store, e.g., rate, average, and ratio.
  • store_sos: Store the metric data to a Scalable Object Storage (SOS) database

Build LDMSD separately

$ cd ovis/lib
$ ./autogen.sh
$ mkdir build
$ cd build
$ ../configure --prefix=<install path> [build options]
$ make
$ make install # use sudo if you install as root
  • Build LDMS,LDMSD and friends
    • For the build options, please see the build options section and skip the sub-project build option.
$ cd ovis/ldms
$ ./autogen.sh
$ mkdir build
$ cd build
$ ../configure --prefix=<install path> [build options]
$ make
$ make install # use sudo if you install as root

Setup the environment variables

Please see the environment variables section.

Starting an LDMSD

LDMSD bundles related-metric data into a set and transfer it over network per set. For example, all metrics read from /proc/meminfo is grouped together in one set. How metrics are grouped depends on sampler plugins.

In general you would start two LDMS daemons (ldmsd),

  • sampler daemon:  An ldmsd that samples metric sets of your choices. You will configure it to load and start sampler plugins.
  • aggregator: An ldmsd that aggregates metric sets from a single or multiple sampler daemons. You will configure it to connect to sampler daemons or other aggregators, get set updates, and/or store the metric sets.
  • Note that an ldmsd can only store metric sets it collected from another ldmsd. Therefore, if you have a single machine and want to store the metric sets, you will need both a sampler daemon and an aggregator. This limitation is being addressed and it will be in the next version.

There are two approaches to start LDMS daemons:

Manually Start LDMS Daemons

For all available options,

$ ldmsd --help
# Or, consult the man page
$ man $PREFIX/share/man/man8/ldmsd.8

How to start LDMSD: the basic

The options that almost all users will use.

  • -x: listener xprt and port (mandatory)
  • -l: Path to the log file
  • -S: Path to the UNIX domain socket (alternatively to -p)
  • -p: Listener port for ldmsd_controller (alternatively to -S)
  • -v: Log verbosity levels descending order: INFO, ERROR, CRITICAL, and QUIET.
  • -c: Path to the configuration file
Simplest start command
$ ldmsd -x sock:10001
  • The ldmsd will listen on port 10001 using the socket transport.
  • The log messages will be printed to the standard out.
  • There are no way to configure this ldmsd.
Basic start command
$ ldmsd -x rdma:10001 \
    -l $HOME/var/log/ovis/samplerd.log \
    -S $HOME/var/run/ovis/samplerd.sock -p 20001
  • The ldmsd listens on port 10001 using the rdma transport.
  • The log messages will be printed to $HOME/var/log/samplerd.log file.
  • ldmsd_controller can be used to configure the ldmsd either locally talk to the UNIX domain socket $HOME/var/run/ovis/samplerd.sock or remotely configure using the port 20001.
$ ldmsd -x rdma:10001 \
    -l $HOME/var/log/ovis/samplerd.log \
    -p 20001 -c $HOME/etc/conf/ovis/samplerd.conf
  • Similar to the above case but the configuration file $HOME/etc/conf/ovis/samplerd.conf is given at start.
  • The ldmsd can be reconfigured later through the controller port 20001. The -s option could be given here without any errors as well.

How to start LDMSD: at scale

Additional command line options you might want consider for aggregators that will collect large number of metric sets (>10000) from many sampler daemons (>10000).

  • -P: Number of ldmsd worker threads.
    • There threads are responsible for sending requests to sampler daemons to get the list of all available metric sets and to get updates of the metric sets.
    • When you want to collect and update large amount of sets, you might want more than one worker threads to prevent slowness on sending update requests to the remote ldmsd‘s.
  • -m: Size of pre-allocated memory for all LDMS sets on this ldmsd. This is equivalent to the environment variable LDMSD_MEM_SZ.
$ ldmsd -x rdma:10001 \
    -l $HOME/var/log/ovis/ldmsd.log \
    -S $HOME/var/run/ovis/ldmsd.sock \
    -p 20001 -P 4 -m 1GB
  • There are 4 worker threads that will be responsible for connecting to remote ldmsd‘s and getting set updates. The maximum value of -P depends on the number of cores your system has.
  • 1GB of memory will be pre-allocated. If you build with –enable-mmap (default), the memory chunk is mmapped.

Frequent error codes at start

Error Code Possible Cause What to do
2 The configuration file given to the -c option does not exist. Please check if you give the correct path.
4 Fail to initialize the UNIX domain socket file
  • Does the given directory exist? The -S option at command line.
  • Does the file already exist? If it does, please remove it before starting a new ldmsd.
6 Fail to create the listening transport endpoint
  • Is the ZAP_LIBPATH set correctly?
  • Is the given xprt valid? The valid transports are sock, rdma, and ugni.
7 Fail to listen on the listener port
  • Is the given port valid?
  • Is the port in-use?
9 Fail to open the log file Does the directory exist?
15 Fail to get the secret word
  • Have you created a file containing the secret word?
  • See man 7 ldms_authentication for the location ldmsd will search for the secret word if the LDMS_AUTH_PATH is not set.
22 Fail to process the configuration file
  • Check the correctness of your configuration file given to  the -C option at command line.
104 Fail to open the listener port for the ldmsd_controller
  •  Is the given port valid? The -p option at command line.
  • Is the port in-use?

LDMSD Configuration

There are two ways to configure an ldmsd.

  •  Use the -c option at command line.
    • You will need to prepare a file containing the configuration commands you need.
  • Use the ldmsd_controller program to configure an ldmsd.

LDMSD  configuration interface: ldmsd_controller

ldmsd_controller is a python interactive program to control/configure an ldmsd. There are two approaches for ldmsd_controller to connect to an ldmsd.

  • Using the control listener port. It is available only when the  target ldmsd is started with the -p option. If OVIS is built with --enable-ovis_auth, the --auth_file option with the path to the file containing the shared secret word must be given. This approach allows users remotely configure an ldmsd.
$ ldmsd_controller --host node1 \
    --port 20001 --auth_file ~/.ldmsconf.auth
  • Using the UNIX domain socket. It is available only when the ldmsd to connect to use the -S option at the command line. ldmsd_controller must run on the same host as the target ldmsd. The --auth_file option is not needed regardless of how OVIS is built.
$ ldmsd_controller \
    --sockname $HOME/var/run/ovis/samplerd.sock

ldmsd_controller can be used without the interactive session by either given the--source or --script options.

  • –source: Path to a configuration file.
# Example of a configuration file to sample 
# meminfo metrics
$ cat samplerd.conf
load name=meminfo
config name=meminfo instance=samplerd/meminfo \
       producer=samplerd
start name=meminfo interval=1000000 offset=0
# 
# Pass the configuration file to ldmsd_controller
$ ldmsd_controller --host node1 --port 20001 \
    --auth_file ~/.ldmsauth.conf \
    --source samplerd.conf
  • –script: Path to a script that generates configuration commands.
# Example of a script that generates 
# configuration commands for an aggregator
# to connect to 9 sampler daemons on node[1-9]  
# and get updates for all sets.
$ cat agg-conf-gen.sh
for i in (1..9); do
    echo "prdcr_add name=sampler$i host=node$i port=10001 xprt=sock type=active interval=20000000"
    echo "prdcr_start name=sampler$i"
    echo "updtr_add name=all_sets interval=1000000 offset=100000"
    echo "updtr_prdcr_add name=all_sets regex=.*"
    echo "updtr_start name=all_sets"
done
#
# Pass the script to ldmsd_controller
$ ldmsd_controller --host node0 --port 20001 \
    --auth_file ~/.ldmsauth.conf \
    --script agg-conf-gen.sh

Configuration commands

Sampler daemons: sample metrics of interest

The first step is to load the sampler plugins that sample the metrics of your interest, then configure the plugins, and lastly start the sampler plugins. All sampler plugins create a metric set or more which contains multiple metrics and periodically sample the metric values at the specified interval.

  • load: Load a sampler plugin
    • name: Sampler plugin library name. lib<name>.so must exist in LDMSD_PLUGIN_LIBPATH.
load name=<sampler_plugin_name>
  • config: Configure the loaded sampler plugin
    • name: Sampler plugin name
    • producer: Arbitrary string representing the ldmsd that samples the metric set. Later this will refer as the set origin.
    • instance: Metric set name. This must be unique in the whole system. If there are two sets with the same name in a system, they might collide on an aggregator if the aggregator collects both sets even from different sampler daemons.
    • component_id [optional]: A numeric ID represents a component corresponding to the metric set.
    • schema [optional]: Metric set schema. All existing sampler plugin comes with a pre-defined metric set schema name. If your system comprises of identical kernel and hardware, you could ignore this argument.
      • Metric set schema is the definition of a metric set. It contains the metric name list, the metric type, and the metric value type. All metric sets correspond to exactly one metric set schema which cannot be modified after it got created.
      • In a system, a metric set schema must have exactly one definition. That  is if there are two kernel versions which result in different number of metrics in /proc/meminfo. When you configure the meminfo sampler plugin on the nodes with the different kernel versions. You must give different schema name on the nodes with one kernel version than the nodes with the other kernel version.
    • plugin-specific arguments: Some sampler plugins require additional arguments, e.g., procnetdev requires the network device names of which you want to collect metrics. Use the usage command to get the usage of all loaded plugins.
config name=<sampler_plugin_name> producer=<arbitrary name for this ldmsd> instance=<metric_set_name> [Plugin-specific arguments]
  • start: Start sample the metrics by the configured sampler plugin
    • name: Sampler plugin name
    • interval: Sample interval in milliseconds
    • offset [optional]: Offset from the second 0 of the minute in milliseconds. For example, offset = 10. The sampler plugin will sample the metrics at the second 10 of the minute.
start name=<sampler_plugin_name> interval=<sample_interval_in_microsecond> offset=<offset_from_second_0>
  • stop: Stop a running sampler plugin
    • name: Sampler plugin name
stop name=<running_sampler_plugin_name>

aggregators: Collect metric sets from other ldmsd‘s

First a connection between an aggregator and another ldmsd that you want to collect metric sets from needs to be established. Next, ldmsd must be configured to update the collected sets.

  • prdcr_add [mandatory]: Create a connection
    • name: Arbitrary string for the connection name. We call it a Producer but it is independent to the producer in the sampler plugin config command and to the producer string stored in metric sets.
    • host: Hostname of the ldmsd you want to collect metric sets from.
    • port: Listener port of the ldmsd you want to collect metric sets from.
    • xprt: Transport the ldmsd use to listen for connection request.
    • type: Producer types: active or passive.
      • active: the aggregator sends a connection request to the ldmsd.
      • passive: the aggregator waits for a connection request from the ldmsd.
    • interval: Re-connect interval in milliseconds if the aggregator fails to connect to the ldmsd or if the connection is disconnected.
prdcr_add name=<producer_name> host=<hostname> xprt=<transport_name> port=<listener_port> type=<producer_type> interval=<re-connect_interval>
  • prdcr_start: Start a connection. (alternative to prdcr_start_regex)
    • name: Producer name given in prdcr_add
    • interval [optional]: Re-connect interval in milliseconds.
prdcr_start name=<producer_name>
  • prdcr_start_regex: Start multiple producers. (alternative to prdcr_start)
    • regex: Regular expression to match to the producer names given in multiple prdcr_add commands
prdcr_start_regex regex=.*    # .* matches all strings.

You can have as many Producers as you want. There could be multiple Producers connect to the same ldmsd.

A running Producer can be stopped and the connection will be closed. After it is stopped, the re-connect interval could be changed when it is started again or it can be deleted completely.

  • prdcr_stop: Stop a running Producer. The metric sets collected from the ldmsd on the other side of the connection will not be available on the aggregator any more.
    • name: Producer name
prdcr_stop name=<running_Producer_name>

Multiple Producers can be stopped at the same time using prdcr_stop_regex with the same arguments as prdcr_start_regex,

  • prdcr_del: Delete a Producer. The Producer must not be running.
    • name: Producer name.
prdcr_del name=<Producer_name>

Now we have created and connected to the ldmsd’s (either sampler daemons or aggregators). Next we will configure this aggregator to get set updates. We will do this by creating Updater.

  • updtr_add [mandatory]: Add an Updater
    • name: Arbitrary string represents the Updater name.
    • interval: Update interval in milliseconds.
    • offset [optional]: Offset from the second 0 of the minute in milliseconds. It is the same as the offset in the start command to start a sampler plugin.
updtr_add name=<Updater_name> interval=<update_interval> offset=<update_offset>
  • updtr_prdcr_add [mandatory]: Add Producer filter to get updates from.
    • name: Updater name
    • regex: Regular expression to match the producer names that this Updater to get set updates from. The Producer name here refers to the Producer name given in prdcr_add  command which is independent to the producer name stored the metric sets.
      • The Updater will only get updates for the sets collected from the Producer that its name matched the given regular expression.
updtr_prdcr_add name=<Updater_name> regex=<regex>
  • updtr_match_add [optional]: Add set schema name or set name filter to the Updater. If you want to get updates for all sets from the matched Producer in updtr_prdcr_add, you could skip this command.
    • name: Updater name
    • regex: Regular expression to match set schema name or set names. The Updater will get updates only for the sets that its name or its schema name matched the given regular expression.
    • match: Either schema or instance
updtr_match_add name=<Updater_name> regex=<regex> match=< schema | instance >
  • updtr_start [mandatory]: Start an Updater
    • name: Updater name
    • interval [optional]: Update interval. You could give a new value if you want to change it before starting the Updater.
    • offset [optional]: Update offset. You could give a new value if you want to change it before starting the Updater.
updtr_start name=<Updater name>

A running Updater can be stopped and deleted similar to Producers. A stopped Updater can be started again with the same of different update interval and offset. To change the update interval or offset, give the desired interval or offset when calling updtr_start.

  • updtr_stop: Stop a running Updater. The metric sets get the updates by the given Updater will still be available but the metric values will not be updated.
  • updtr_del: Delete an Updater. The Updater must not be running.
    • name: A running Updater name
updtr_stop name=<Updater_name>
updtr_del name=<Updater_name>

aggregators: Store metric sets

You probably want to store the metric values you collect. Usually the metrics will be stored at the aggregator(s) that no other aggregator collect metric sets from it. However, there are no limitation to this.

Similar to how to use sampler plugin, the storage plugins that will store metrics must be loaded and configured first. Then, a Storage Policy must be created. A Storage Policy is a policy that tells the aggregator that you want to store metric sets of a specific set schema name by the given storage plugin. After the Storage Policy is ready, it can be started and the metric sets matched the criteria defined in the Storage Policy will be stored accordingly.

  • load: Load a storage plugin
    • name: Storage plugin name
load name=<storage plugin name>
  • config: Configure the loaded storage plugin
    • name: Storage plugin name
    • storage-plugin-specific arguments: Most storage plugin requires the path argument which is the path to the directory the metric will be stored. Some storage plugins provides more arguments for flexible usability. Use the usage command to get the usage of all loaded plugins.
config name=<storage_plugin_name> [path=<path_to_store_directory>] [additional storage-plugin-specific arguments]

After all needed storage plugins are loaded, the necessary Storage Policy could be created.

  • strgp_add [mandatory]: Create a Storage Policy
    • name: Storage Policy name.
    • schema: Metric set schema name. The Storage Policy will store  the metric sets with the schema name.
    • plugin: Storage plugin that will store the metric sets with the given schema name.
    • container: Arbitrary string. Most storage plugins use this string for the file name or directory name containing the metric data.
strgp_add name=<Storage_Policy_name> plugin=<storage_plugin_name> schema=<set_schema_name> container=<container_name>
  • strgp_prdcr_add [optional]: Add Producer filter similar to updtr_prdcr_add. The Storage Policy will only store the sets collected from the Producer its name matched the given regular expression. You could skip this command if you want to store all sets of the specified schema name in strgp_add.
    • name: Storage Policy name
    • regex: Regular expression to match the Producer name. Note that the producer name here refers to the String given in prdcr_add.
strgp_prdcr_add name=<Storage_policy_name> regex=<regex>
  • strgp_metric_add [optional]: Add metric filter. The Storage Policy will store only the metric with the given name. This command can be given many times to tell ldmsd to store multiple metrics in the schema. If you want to store all metrics in the schema, you could skip this command.
    • name: Storage Policy name
    • metric: Metric name.
strgp_metric_add name=<Storage_Policy_name> metric=<metric_name>
  • strgp_start: Start a Storage Policy
    • name: Storage Policy name
strgp_start name=<Storage_Policy_name>

Baler

For baler document, please visit here.

HPC Machine Data Mining