*****************************************
  Monitoring the ARC Information System
*****************************************

The main configuration section for these probes is ``arcinfosys``, see
:ref:`configuration-files`.


EGIIS Check
===========

To monitor an EGIIS service, use ::

    check_egiis -H <HOST> [-P <PORT>] --index=<INDEX-NAME>

This will do an LDAP query of the EGIIS service on ``<HOST>:<PORT>``.  The
default port is 2135.  The base DN of the query is ``Mds-Vo-name=<INDEX-NAME>,
o=grid``.  The probe will also fetch the subschema at ``cn=subschema`` and
check the presence of attributes against MAY and MUST specifications in the
schema.  In addition some type conversions are attempted to catch invalid
data.

Any validation error will give a CRITICAL Nagios status.  If the index is
empty, a WARNING Nagios status is reported.  Otherwise, the status is OK and
counts for different registrations states is printed.


CE Health State using EMIES
===========================

The following probe contacts the EMIES service of the compute element and
checks the ``HealtStatus`` element in the reply.

    check_arcservice -u <url> [-k <key-file> -c <cert-file>] [-t <timeout>]

``arcinfo -c <host>`` shows whether a CE supports EMIES and which URL to use.
EMIES uses SSL client authentication.  By default the host certificate is
used.  To use a grid proxy, pass it as both key and certificate.  Example:

    check_arcservice -u https://arcce.example.org:60000/arex \
                     -k /tmp/x509up_1000 -c /tmp/x509up_1000


CE Infosys Validation for the GLUE 2 LDAP Schema
================================================

You can test the GLUE 2 LDAP records published by an CE with ::

    check_arcglue2 -H <HOST> [-P <PORT>] \
            [--glue2-schema PATH] [--if-dependent-schema STATUS] \
            [--warn-if-missing OBJECTCLASS,...,OBJECTCLASS] \
            [--critical-if-missing OBJECTCLASS,...,OBJECTCLASS] \
            [--hierarchical-foreign-keys FOREIGN-KEY,...,FOREIGN-KEY] \
            [--hierarchical-aggregates]

See ``check_arcglue2 --help`` for a full list of options.

This probe will do a full query under ``o=glue`` on the provided host and port
and perform the following checks.  The default port is 2135.

As a basic check that the information system contains data,
``--warn-if-missing`` and ``--critical-if-missing`` may be passed a
comma-separated list of LDAP objectclasses for which there should be at least
one entry in the information system.  By default, a warning is raised if the
system has no entries of type ``GLUE2AdminDomain``, ``GLUE2Service``, or
``GLUE2Endpoint``.

The probe will verify each entry using the GLUE 2 LDAP schema.  By default,
the GLUE 2 schema is expected at ``/etc/ldap/schema/GLUE20.schema``.  An
alternative path may be specified with the ``--glue2-schema`` option.  If the
schema is not found, a warning is raised and the schema is fetched from
``cn=subschema``.  The rationale behind this warning is that the content
should be checked independent of what the remote end claims it should be.
Another Nagios status can be specified with ``--if-dependent-schema``,
including ``OK`` to disable the warning.

As GLUE 2 is relational in nature, the probe does further checks on
connections which cannot be specified in the LDAP schema.  It checks
uniqueness of the ``*ID`` attributes, and the outgoing and incoming
multiplicities of ``*ForeignKey`` attributes as specified in the GLUE
Specification v2.0 [GLUE2]_ and the LDAP schema reference implementation
[GLUE2L]_.

Further, the probe checks some of the constraints on the directory information
tree (DIT) [GLUE2L]_.  A critical condition is raised if the following
conditions are not met.

* All ``GLUE2Extension`` objects must appear immediately below the object they
  extend.

* Objects which are aggregates of a ``GLUE2Service`` must appear somewhere
  below that service.

* Services which link to a ``GLUE2AdminDomains`` cannot reside under a
  different domain.

Optionally you can require the DIT to reflect additional foreign keys, either
passing an explicit list to ``--hierarchical-foreign-keys``, or passing
``--hierarchical-aggregates`` to include all keys which represent aggregation
or composition.  Note that the latter will fail unless services are structured
under their administrative domain, if any.


CE Infosys Validation for the NorduGrid and GLUE 1 Schemas
==========================================================

The ARIS probe is invoked with ::

    check_aris -H <HOST> [-P <PORT>] [--cluster <CLUSTER>...] \
            [--cluster-test <testname>...] [--queue-test <testname>...] \
            [OTHER-OPTIONS...]

See ``check_aris --help`` for the full list of options.
It will query ``Mds-Vo-name=local, o=grid`` on ``<HOST>:<PORT>``.  The default
port is 2135.  If one or more clusters are specified with the ``--cluster``
option, only those will be checked (``nordugrid-cluster-name=<CLUSTER>``), and
it is considered error for any of them to be missing.  The probe validates
attributes of entries against MAY and MUST of the schema, and attempts some
type conversions.  For each found cluster, the probe will query and validate
queues.

If no clusters are found, or if no queues are found for a given cluster, it
will be reported as a warning.  You can change this by passing a Nagios status
to the option ``--if-no-clusters`` or ``--if-no-queues``, respectively.
Valid statuses are ``ok``, ``warning``, ``critical``, and ``unknown``, though
only the first three makes sense here.

This probe can also do custom checks on the LDAP data, either numeric limits
or regular-expression matches.  A custom test defined in the configuration
file under a section ``arcinfosys.aris.<testname>``, can be enabled by passing
any number of ``--cluster-test <testname>`` and ``--queue-test <testname>``
options to the probe.  The tests are run on entries of the type
``nordugrid-cluster`` and ``nordugrid-queue``, respectively.

The ARIS infosystem contains a attribute ``nordugrid-cluster-contactstring``
which provides the interface for job submission.  You can check that this URL
is accessible by passing ``--check-contact``.  This will do a list operation
and, if the logging level is ``INFO`` or lower, will report the number of
entries.  If the attribute is missing or the URL is inaccessible, the service
goes CRITICAL with an appropriate message.


Limit Checks
------------

A limit check takes the form

.. code-block:: ini

    [arcinfosys.aris.<testname>]
    type = limit
    value = <expr>
    critical.min = <value>
    critical.max = <value>
    critical.message = <message>
    warning.min = <value>
    warning.max = <value>
    warning.message = <message>

The ``type`` and ``value`` variables are required, and at least one of the
``min`` or one of the ``max`` variables should be given for the test to be
useful.  There are reasonable defaults for the messages, though if your
``<expr>`` is complex, you may want to provide a more human readable version.
The probe will

* Evaluate ``<expr>`` using Python's `eval` function, in an environment based
  on the LDAP attribute names to the corresponding converted values.  The
  variable names are obtained from the attribute names by replacing "``-``"
  with "``_``" and stripping common prefixes including
  "``nordugrid-cluster-``", "``nordugrid-queue-``", and "``Mds-``".

* If ``critical.min`` is given and the result is below this value, or if
  ``critical.max`` is given and the result is above this value, report it as a
  critical error.

* Similar for ``warning.min`` and ``warning.max``, reported as a warning.


Regular Expression Checks
-------------------------

A regular expression check takes the form:

.. code-block:: ini

    [arcinfosys.aris.<testname>]
    type = regex
    variable = <varname>
    critical.pattern = <python-regex>
    critical.message = <message>
    warning.pattern = <python-regex>
    warning.message = <message>

The ``type`` and ``variable`` settings are required, and you should specify at
least on of ``critical.pattern`` and ``warning.pattern``.  The variable name
is obtained the same way as for the limit checks.  The probe will consider all
values for the LDAP attribute corresponding to ``<varname>``.

* If ``critical.pattern`` is specified and none of the values match it, then a
  critical condition is reported, else

* if ``warning.pattern`` is specified and none of the values match it, then a
  warning is reported.

The following example will issue a critical state if a queue is not active:

.. code-block:: ini

    [arcinfosys.aris.queue-active]
    type = regex
    variable = status
    critical.pattern = ^active$
    critical.message = Inactive queue


Glue Schema Checks
------------------

Some CEs publish cluster and queue information in the Glue schema in addition
to the NorduGrid schema.  You can enable schema checks for these if present by
passing ``--enable-glue``.

The information in the Glue entries should match information in the ARC
entries as described in [ARCIS2011]_.  You can enable a partial comparison of
GlueCE, GlueCluster, and GlueSubCluster records by passing ``--compare-glue``.


Checking Expiration of Host Certificates
========================================

A separate probe is provided for checking the host certificate as reported by
the information system::

    check_archostcert -H <HOST> [-p <PORT>] \
                      [-c <CRITDAYS>] [-w <WARNDAYS>] [-t <TIMEOUT>]

The suggestion is to run this for each compute element on a low frequency,
like once or a few times a day.  A command definition like ::

    define command {
        command_name check_archostcert
        command_line $USER$/check_archostcert -H $HOSTNAME$ -c 7 -w 31
    }

will warn about a certificate one month before it expires and report a
critical status one week before.  The port number defaults to 2135, but can be
changed with ``-p <port>``, and a timeout of ``<T>`` seconds is specified as
``-t <T>``.  Se also ``check_archostcert --help``.

The lifetime of the host certificate can also be checked using a generic HTTPS
probe against the EMIES service, as long as the probe supports client
authentication and lifetime checks.


.. [GLUE2]
    "GLUE Specification v2.0";
    Sergio Andreozzi (ed.), et al.;
    http://www.ogf.org/documents/GFD.147.pdf
.. [GLUE2L]
    "GLUE v. 2.0 – Reference Implementation of an LDAP Schema"
    Sergio Andreozzi (ed.), et al.;
    https://forge.ogf.org/sf/docman/do/downloadDocument/projects.glue-wg/docman.root.drafts/doc15526
.. [ARCIS2011]
    "The NorduGrid-ARC Information System";
    Balázs Kónya and Daniel Johansson;
    NORDUGRID-TECH-4;
    http://www.nordugrid.org/documents/arc_infosys.pdf
