***********************************
  Monitoring ARC Compute Elements
***********************************

For each CE to monitor run ::

    check_arcce -H <HOST> submit

This should be run at a relatively low frequency in order to let one job
finish before the next is submitted.  The probe keeps track of submitted jobs,
and will hold the next submission if necessary.  Subsequent sections describe
additional options for testing data-staging, running custom scripts, etc.

On a more regular basis, each 5 min or so, run ::

    check_arcce -H <NAGIOS-HOST> monitor

which will monitor all job status of each host and submit it passively to a
service matching the host name and the service description "ARCCE Job
Termination".  The passive service name can be configured.

Finally, a probe is provided to tidy the ARC job list after unsuccessful
attempts by `check_arcce monitor` to clean jobs.  This is also set up as a
single service, and only needs to run occasionally, like once a day::

    check_arcce -H <NAGIOS-HOST> clean

For additional options, see ::

    check_arcce --help
    check_arcce submit --help
    check_arcce monitor --help
    check_arcce clean --help


Plugin Configuration
--------------------

The main configuration section for this probe is ``arcce``, see
:ref:`configuration-files`.  This probe requires an X509 proxy, see
:ref:`x509-proxy`.

Connection URLs for job submission (the ``--ce`` option) may be specified in
the section ``arcce.connection_urls``.

Example::

    [arcce]
    voms = ops
    user_cert = /etc/nagios/globus/robot-cert.pem
    user_key = /etc/nagios/globus/robot-key.pem
    loglevel = DEBUG

    [arcce.connection_urls]
    arc1.example.org = ARC1:https://arc1.example.org:443/ce-service
    arc0.example.org = ARC0:arc0.example.org:2135/nordugrid-cluster-name=arc0.example.org,Mds-Vo-name=local,o=grid

The ``user_key`` and ``user_cert`` options may be better placed in the common
``gridproxy`` section.


Nagios Configuration
--------------------

You will need command definitions for monitoring and submission::

    define command {
        command_name check_arcce_monitor
        command_line $USER1$/check_arcce -H $HOSTNAME$ monitor
    }
    define command {
        command_name check_arcce_clean
        command_line $USER1$/check_arcce -H $HOSTNAME$ clean
    }
    define command {
        command_name check_arcce_submit
        command_line $USER1$/check_arcce -H $HOSTNAME$ submit \
                        [--test <test_name> ...]
    }

For monitoring, add a single service like ::

    define service {
        use                     monitoring-service
        host_name               localhost
        service_description     ARCCE Monitoring
        check_command           check_arcce_monitor
    }
    define service {
        use                     monitoring-service
        host_name               localhost
        service_description     ARCCE Cleaner
        check_command           check_arcce_clean
        normal_check_interval   1440
        retry_check_interval    120
    }

For each host, add something like ::

    define service {
	use			submission-service
	host_name		arc0.example.org
	service_description	ARCCE Job Submission
	check_command		check_arcce_submit
    }
    define service {
	use			passive-service
	host_name		arc0.example.org
	service_description	ARCCE Job Termination
	check_command		check_passive
    }

The ``--test <test_name>`` options enables tests to run in addition to a plain
job submission.  They are specified in individual sections of the
configuration files as described below.  Such a test may optionally submit the
results to a named passive service instead of the above termination service.
To do so, add the Nagios configuration for the service and duplicate the
"``service_description``" in the section defining the test.

See the arcce-example.cfg for a more complete Nagios configuration.


Running Multiple Job Services on the Same Host
----------------------------------------------

By default, running jobs are tracked on a per-host basis.  To define multiple
job submission services for the same host, pass to ``--job-tag`` a tag which
identify the service uniquely on this host.  Remember to also add a passive
service and pass the corresponding ``--termination-service`` option.

The scheme for configuring an auxiliary submission/termination service is::

    define command {
        command_name check_arcce_submit_<test_name>
        command_line $USER1$/check_arcce -H $HOSTNAME$ submit \
            --job-tag <test_name> \
	    --termination-service 'ARCCE Job Termination for <Test-Description>' \
            [--test <test1> ...]
    }
    define service {
	use			submission-service
	host_name		arc0.example.org
	service_description	ARCCE Job Submission for <Test-Description>
	check_command		check_arcce_submit_<test_name>
    }
    define service {
	use			passive-service
	host_name		arc0.example.org
	service_description	ARCCE Job Termination for <Test-Description>
	check_command		check_passive
    }


Scripted Checks
---------------

It is possible to add custom commands to the job scripts and do a regular
expression match on the output.  E.g. to test that Python is installed and
report the version, add the following section to the plugin configuration
file::

    [arcce.python]
    jobplugin = scripted
    required_programs = python
    script_line = python -V >python.out 2>&1
    output_file = python.out
    output_pattern = Python\s+(?P<version>\S+)
    status_ok = Found Python version %(version)s.
    status_critical = Python version not found in output.
    service_description = ARCCE Python version

The options are

required_programs
    Space-separated list of programs to check for before running the script.
    If one of the programs is not found, it's reported as a critical error.

script_line
    One-liner shell code to run, including features commonly supported by
    ``/bin/sh`` on year CEs.

output_file
    The name of the file your script produces.  This is mandatory, and the same
    file will be used to communicate errors back to ``check_arcce``.

output_pattern
    This is a Python regular expression which is searched for in the output of
    the script.  It will stop on the first matched line.  You cannot match more
    than one line, so distill the output in ``script_line`` if necessary.
    Named regular expression groups of the form ``(?<v>...)`` captures their
    output in a variable *v*, which can be substituted in the status messages.

status_ok
    The status message if the above regular expression matches.  A named regular
    expression group captured in a variable *v* can be substituted with
    ``%(v)s``.

status_critical
    Status message if the regular expression does not match.  Obviously you
    cannot do substitutions of RE groups.  If the test for required programs
    fail, then the status message will indicate which programs are missing
    instead.

service_description
    The ``service_description`` of the passive Nagios service to which results
    are reported.

See ``arcnagios.ini`` for more examples.


Staging Checks
--------------

The "staging" job plug-in checks that file staging works in connection with
job submission.  It is enabled with ``--test <test-name>`` where the
plugin configuration file contains a corresponding section::

    [arcce.<test-name>]
    jobplugin = staging
    staged_inputs = <URL> ... <URL>
    staged_outputs = <URL> ... <URL>
    service_description = <TARGET-FOR-PASSIVE-RESULT>

Note that the URLs are space-separated.  They can be placed separate indented
lines.  Within the URLs, the following substitutions may be useful:

``%(hostname)s``
    The argument to the ``-H`` option if passed to the probe, else "localhost".
``%(epoch_time)s``
    The integer number of seconds since Epoch.

If a staging check fails, the whole job will fail, so it's status cannot be
submitted to an individual passive service as with scripted checks.  For this
reason, it may be preferable to create one or more individual submission
services dedicated to file staging.  Remember to pass unique names to
``--job-tag`` to isolate them.


Custom Job Descriptions
-----------------------

If the generated job scripts and job descriptions are not sufficient, you
can provide hand-written ones by passing the ``--job-description`` option to
the submit subcommand.  This option is incompatible with ``--test``.

Currently no substitutions are done in the job description file, other than
what may be provided by ARC.
