***********************************
  Monitoring ARC Compute Elements
***********************************

General Configuration
=====================

For each CE to monitor run ::

    check_arcce_submit -H <HOST>

This should be run at a relatively low frequency in order to let one job
finish before the next is submitted.  The probe keeps track of submitted jobs,
and will hold the next submission if necessary.  Subsequent sections describe
additional options for testing data-staging, running custom scripts, etc.

On a more regular basis, each 5 min or so, run ::

    check_arcce_monitor

which will monitor all job status of each host and submit it passively to a
service matching the host name and the service description "ARCCE Job
Termination".  The passive service name can be configured.

Finally, a probe is provided to tidy the ARC job list after unsuccessful
attempts by `check_arcce_monitor` to clean jobs.  This is also set up as a
single service, and only needs to run occasionally, like once a day::

    check_arcce_clean

For additional options, see ::

    check_arcce_submit --help
    check_arcce_monitor --help
    check_arcce_clean --help


Plugin Configuration
--------------------

The main configuration section for this probe is ``arcce``, see
:ref:`configuration-files`.  This probe requires an X509 proxy, see
:ref:`x509-proxy`.

Connection URLs for job submission (the ``--ce`` option) may be specified in
the section ``arcce.connection_urls``.

Example::

    [arcce]
    voms = ops
    user_cert = /etc/nagios/globus/robot-cert.pem
    user_key = /etc/nagios/globus/robot-key.pem
    loglevel = DEBUG

    [arcce.connection_urls]
    arc1.example.org = ARC1:https://arc1.example.org:443/ce-service
    arc0.example.org = ARC0:arc0.example.org:2135/nordugrid-cluster-name=arc0.example.org,Mds-Vo-name=local,o=grid

The ``user_key`` and ``user_cert`` options may be better placed in the common
``gridproxy`` section.


Nagios Configuration
--------------------

You will need command definitions for monitoring and submission::

    define command {
        command_name check_arcce_monitor
        command_line $USER1$/check_arcce_monitor -H $HOSTNAME$
    }
    define command {
        command_name check_arcce_clean
        command_line $USER1$/check_arcce_clean -H $HOSTNAME$
    }
    define command {
        command_name check_arcce_submit
        command_line $USER1$/check_arcce_submit -H $HOSTNAME$ \
                        [--test <test_name> ...]
    }

For monitoring, add a single service like ::

    define service {
        use                     monitoring-service
        host_name               localhost
        service_description     ARCCE Monitoring
        check_command           check_arcce_monitor
    }
    define service {
        use                     monitoring-service
        host_name               localhost
        service_description     ARCCE Cleaner
        check_command           check_arcce_clean
        normal_check_interval   1440
        retry_check_interval    120
    }

For each host, add something like ::

    define service {
	use			submission-service
	host_name		arc0.example.org
	service_description	ARCCE Job Submission
	check_command		check_arcce_submit
    }
    define service {
	use			passive-service
	host_name		arc0.example.org
	service_description	ARCCE Job Termination
	check_command		check_passive
    }

The ``--test <test_name>`` options enables tests to run in addition to a plain
job submission.  They are specified in individual sections of the
configuration files as described below.  Such a test may optionally submit the
results to a named passive service instead of the above termination service.
To do so, add the Nagios configuration for the service and duplicate the
"``service_description``" in the section defining the test.

See the arcce-example.cfg for a more complete Nagios configuration.


Running Multiple Job Services on the Same Host
----------------------------------------------

By default, running jobs are tracked on a per-host basis.  To define multiple
job submission services for the same host, pass to ``--job-tag`` a tag which
identify the service uniquely on this host.  Remember to also add a passive
service and pass the corresponding ``--termination-service`` option.

The scheme for configuring an auxiliary submission/termination service is::

    define command {
        command_name check_arcce_submit_<test_name>
        command_line $USER1$/check_arcce_submit -H $HOSTNAME$ \
            --job-tag <test_name> \
	    --termination-service 'ARCCE Job Termination for <Test-Description>' \
            [--test <test1> ...]
    }
    define service {
	use			submission-service
	host_name		arc0.example.org
	service_description	ARCCE Job Submission for <Test-Description>
	check_command		check_arcce_submit_<test_name>
    }
    define service {
	use			passive-service
	host_name		arc0.example.org
	service_description	ARCCE Job Termination for <Test-Description>
	check_command		check_passive
    }


Custom Job Descriptions
-----------------------

If the generated job scripts and job descriptions are not sufficient, you can
provide hand-written ones by passing the ``--job-description`` option to the
``check_arcce_submit`` command.  This option is incompatible with ``--test``.

Currently no substitutions are done in the job description file, other than
what may be provided by ARC.


Job Tests
=========

Scripted Checks
---------------

It is possible to add custom commands to the job scripts and do a regular
expression match on the output.  E.g. to test that Python is installed and
report the version, add the following section to the plugin configuration
file::

    [arcce.python]
    jobplugin = scripted
    required_programs = python
    script_line = python -V >python.out 2>&1
    output_file = python.out
    output_pattern = Python\s+(?P<version>\S+)
    status_ok = Found Python version %(version)s.
    status_critical = Python version not found in output.
    service_description = ARCCE Python version

The options are

required_programs
    Space-separated list of programs to check for before running the script.
    If one of the programs is not found, it's reported as a critical error.

script_line
    One-liner shell code to run, including features commonly supported by
    ``/bin/sh`` on year CEs.

output_file
    The name of the file your script produces.  This is mandatory, and the same
    file will be used to communicate errors back to ``check_arcce_monitor``.
    The reason standard output is not used, is to allow multiple job tests to
    publish independent passive results.

output_pattern
    This is a Python regular expression which is searched for in the output of
    the script.  It will stop on the first matched line.  You cannot match more
    than one line, so distill the output in ``script_line`` if necessary.
    Named regular expression groups of the form ``(?<v>...)`` captures their
    output in a variable *v*, which can be substituted in the status messages.

status_ok
    The status message if the above regular expression matches.  A named regular
    expression group captured in a variable *v* can be substituted with
    ``%(v)s``.

status_critical
    Status message if the regular expression does not match.  Obviously you
    cannot do substitutions of RE groups.  If the test for required programs
    fail, then the status message will indicate which programs are missing
    instead.

service_description
    The ``service_description`` of the passive Nagios service to which results
    are reported.

See :ref:`example-ini` for more examples.

It is possible to give more control over the probe status to the remote
script.  Instead of ``output_pattern`` the script may pass status messages and
an exit code back to Nagios.  This is done by printing certain magic strings
to the file specified by ``output_file``:

  * ``__status <status-code> <status-message>`` sets the exit code and status
    line of the probe.
  * ``__log <level> <message>`` emits an additional status line which will de
    shown iff the log level set in the probe configuration is at least
    ``<level>``, which is a numeric value from the Python ``logging`` module.
  * ``__exit <exit-code>`` is used to report the exit code of a script.
    Anything other than 0 will cause a CRITICAL status.  You probably don't
    want to use this yourself.

The ``__status`` line may occur before, between, or after ``__log`` lines.
This can be convenient to log detailed check results and issues before the
final status in known.

It possible to adapt this to a Nagios-style probe ``check_foo`` by wrapping it
in some shell code:

.. code-block:: sh

    script_line = (/bin/sh check_foo 2>&1; echo __status $?) | \
        (read msg; sed -e 's/^/__log 20 /' -e '$s;^__log 20 \(.*\);\1 '"$msg;") \
        > check_foo.out
    output_file = check_foo.out
    staged_inputs = file:////path-to/check_foo


Staging Checks
--------------

The "staging" job plug-in checks that file staging works in connection with
job submission.  It is enabled with ``--test <test-name>`` where the
plugin configuration file contains a corresponding section::

    [arcce.<test-name>]
    jobplugin = staging
    staged_inputs = <URL> ... <URL>
    staged_outputs = <URL> ... <URL>
    service_description = <TARGET-FOR-PASSIVE-RESULT>

Note that the URLs are space-separated.  They can be placed separate indented
lines.  Within the URLs, the following substitutions may be useful:

``%(hostname)s``
    The argument to the ``-H`` option if passed to the probe, else "localhost".
``%(epoch_time)s``
    The integer number of seconds since Epoch.

If a staging check fails, the whole job will fail, so it's status cannot be
submitted to an individual passive service as with scripted checks.  For this
reason, it may be preferable to create one or more individual submission
services dedicated to file staging.  Remember to pass unique names to
``--job-tag`` to isolate them.


Custom Substitutions in Job Test Sections
=========================================

Additional variables for substitution into the fields of job test sections can
be defined.  You enable this by adding a field ``variables`` containing a
space-separated list of extra variables or bundles of variables to add:

.. code-block:: ini

    [arcce.<test-name>]
    jobplugin = ...
    variables = <var-1> ... <var-n>
    ...

Each variable is defined using one of the following methods.  A variable can
always be overridden by passing ``-O <var>=<value>`` to the probe.  Variables
can also use other variables, so it is implied that you can add a field

.. code-block:: ini

    variables = <var-1> ... <var-n>

to any of the sections below.  Cyclic references are reported and cause an
UNKNOWN status.

**Probe Option**.
A section of the form

.. code-block:: ini

    [variable.<var>]
    method = option
    default = <default-value>

declares ``<var>`` as an option which can be passed to the probe with ``-O
<var>=<value>``.  The ``default`` field may be omitted, in which case the
probe option becomes mandatory for any tests using the variable.

**UNIX Environment**.
A section of the following form declares that ``<var>`` shall be imported from
the UNIX environment.  If no default value is provided, then the environment
variable must be exported to the probe.

.. code-block:: ini

    [variable.<var>]
    method = getenv

**Custom Time Stamp.**
This method provides a custom time stamp format as an alternative to
``%(epoch_time)s``.  It takes the form

.. code-block:: ini

    [variable.<var>]
    method = strftime
    format = <escaped-strftime-style-format>

Note that the ``%`` characters in the ``format`` field must be escaped as
``%%``, as to avoid attempts to parse them as interpolations.  An alternative
``raw_format`` field can be used, which is interpreted literally.

**Random Line from File.**
A section of the following form picks a random line from ``<path>``.  A low
entropy system source is used for seeding.

.. code-block:: ini

    [variable.<var>]
    method = random_line
    input_file = <path>

Leading and trailing spaces are trimmed, and empty lines are ignored.
Referring to a section of this kind in a ``variables`` field causes the
file to be read, regardless of whether the variable is used or not.

**Example.**
In the following staging tests, ``%(se_host)s`` is replaced by a random host
name from the file ``/var/lib/gridprobes/ops/goodses.conf``, and ``%(now)s``
is replaced by a customized time stamp.

.. code-block:: ini

    [arcce.srm]
    jobplugin = staging
    variables = se_host timestamp
    staged_output = srm://%(se_host)s/testing/%(hostname)s-%(now)s.txt
    service_description = Test Service

    [variable.se_host]
    method = random_line
    input_file = /var/lib/gridprobes/ops/goodses.conf

    [variable.now]
    method = strftime
    raw_format = %FT%T
