Mon Dec 13 18:11:16 2010                        Michael Jennings (mej)

Initial check-in.
----------------------------------------------------------------------
Mon Dec 13 19:12:54 2010                        Michael Jennings (mej)

Work in progress:  Initial skeleton for node health check.
----------------------------------------------------------------------
Tue Dec 14 11:15:23 2010                        Michael Jennings (mej)

Completed driver script for health check.  Now to write checks and
test it.
----------------------------------------------------------------------
Wed Dec 15 17:01:43 2010                        Michael Jennings (mej)

Add packaging goop.
----------------------------------------------------------------------
Wed Dec 15 19:10:33 2010                        Michael Jennings (mej)

Added some checks and a sample config.  In the process of debugging.
----------------------------------------------------------------------
Wed Dec 15 20:51:38 2010                        Michael Jennings (mej)

Debugging errors in the FS check.
----------------------------------------------------------------------
Wed Dec 15 21:19:35 2010                        Michael Jennings (mej)

Thanks to Greg, I fixed the handling of subshell (pipeline)
vs. non-subshell while loops.  Everything appears to be working now.
----------------------------------------------------------------------
Thu Dec 16 20:51:57 2010                        Michael Jennings (mej)

Allow for more readable config files by stripping surrounding
whitespace from target and check values.

Convert comment check to native bash to avoid spawning grep and a
subshell.

Fix bug with regexp targets in config.  Forgot to *actually* strip off
the slashes...
----------------------------------------------------------------------
Thu Dec 16 21:24:58 2010                        Michael Jennings (mej)

Fix output redirection and debugging.
----------------------------------------------------------------------
Thu Mar 31 15:50:46 2011                        Michael Jennings (mej)

Fix spec to package all installed files and directories.

Wrap I/O in "eval" to make sure $LOGFILE redirection symbols are used
properly.

Fix typo in $SILENT check.

Fix config file parsing to be compatible with bash >= 3.2.
----------------------------------------------------------------------
Thu Mar 31 17:46:42 2011                        Michael Jennings (mej)

Don't log timestamp; we're trying to avoid subprocesses.

On error, syslog the reason.
----------------------------------------------------------------------
Thu Apr 21 18:46:17 2011                        Michael Jennings (mej)

This is a work in progress.  I'm testing a bunch of stuff, so some or
all of it may end up not working.  We'll see.

 - Added routine to gather /etc/passwd data into arrays
 - Added userid-to-UID mapping function
 - Consolidated process checks to use a single spawning of "ps"
 - Added routine to gather list from TORQUE of users who currently
   have jobs running on the node
 - Added check for unauthorized processes running on the node
 - Added timeout in background to kill nhc if it hangs to avoid
   hanging pbs_mom
 - Eliminated several unnecessary forks
----------------------------------------------------------------------
Fri Apr 22 14:03:32 2011                        Michael Jennings (mej)

Still needs some debugging, but I've successfully eliminated all but 1
subprocess (the "ps" command).  Quite good given all the script does
so far.

Also added the beginnings of a test script for making sure the
individual functions work as advertised.
----------------------------------------------------------------------
Mon Apr 25 13:00:10 2011                        Michael Jennings (mej)

Final fixups for UID check.  Everything appears to be working well
now.
----------------------------------------------------------------------
Wed Apr 27 12:19:11 2011                        Michael Jennings (mej)

Added check to verify user processes descend from pbs_mom.

Added flexible regexp/glob match check.

Renamed utility functions to nhc_* so that only user-usable checks
start with check_*.

Added syslog function to save syslog messages until script
termination.
----------------------------------------------------------------------
Mon May  2 18:51:27 2011                        Michael Jennings (mej)

Added checks for CPU socket/core/thread counts and total/free
RAM/swap/memory.
----------------------------------------------------------------------
Tue May  3 19:07:02 2011                        Michael Jennings (mej)

Missed a file.
----------------------------------------------------------------------
Wed May  4 17:49:59 2011                        Michael Jennings (mej)

Minor cleanups to check_ps_kswapd().
----------------------------------------------------------------------
Fri May  6 13:12:24 2011                               Yong Qin (yqin)

Added check for Infiniband.
----------------------------------------------------------------------
Fri May  6 14:41:23 2011                               Yong Qin (yqin)

A minor bug fix.
----------------------------------------------------------------------
Tue May 10 01:26:14 2011                        Michael Jennings (mej)

Bump version.
----------------------------------------------------------------------
Tue May 10 13:01:16 2011                               Yong Qin (yqin)

Added checks for Myrinet and Ethernet. Minor bug fix.
----------------------------------------------------------------------
Thu May 12 08:12:48 2011                        Michael Jennings (mej)

Try alternate mechanism for IB port checks.
----------------------------------------------------------------------
Wed May 18 15:57:49 2011                        Michael Jennings (mej)

Fix parsing bug.  Due to bash not properly "escaping" expanded
variables inside ${VAR#...} constructs, config file lines must not
contain more than one occurance of "||" any more.

Direct output to /dev/null, then redirect if $LOGFILE is set.
----------------------------------------------------------------------
Tue May 24 15:33:25 2011                        Michael Jennings (mej)

Support older single-core, non-HT CPUs in /proc/cpuinfo.
----------------------------------------------------------------------
Thu Sep  1 16:16:38 2011                        Michael Jennings (mej)

Fixed status reporting and added NHC label to offline message.
----------------------------------------------------------------------
Mon Sep 19 17:13:11 2011                        Michael Jennings (mej)

Bump version to 1.1.

This release adds the ability to detect previously-set notes for nodes
and not overwrite them.

It will also clear notes and online nodes if all checks pass for a
node that had previously had check errors.  It will only do this for
nodes whose notes begin with "NHC" to avoid bringing nodes online
which were manually offlined.  Nodes marked offline which have no note
are not distinguished from down nodes and may be brought online if the
error(s) clear.
----------------------------------------------------------------------
Thu Oct 13 16:04:20 2011                        Michael Jennings (mej)

Output onlining/offlining of nodes to log (with timestamp).

Log failure of health check to logfile as well as syslog.
----------------------------------------------------------------------
Wed Jan 25 15:40:58 2012                        Michael Jennings (mej)

Convert to autoconf/automake for build.
----------------------------------------------------------------------
Tue Feb  7 11:11:12 2012                        Michael Jennings (mej)

Various fixes and release of 1.1.4.
----------------------------------------------------------------------
Tue Mar 13 11:22:45 2012                        Michael Jennings (mej)

Bump version.  More consistency/cleanups.
----------------------------------------------------------------------
Fri May  4 11:31:47 2012                        Michael Jennings (mej)

Remove debugging stuff for UIDs > 100.  I don't really use it anyway,
and some people may want to run as other users.

Convert node online/offline scripts to use variables and $PATH to
identify where the "pbsnodes" command is and what arguments it should
take.

Add an "eval" to the execution of the check so that shell variables
can be used or altered in config files.
----------------------------------------------------------------------
Wed May  9 12:12:54 2012                        Michael Jennings (mej)

Always use [[ ]] instead of [ ] (primarily for consistency).

Add customization of resource manager daemon match expression and
greater control over pbsnodes commands in online/offline helpers.
----------------------------------------------------------------------
Wed May  9 12:22:20 2012                        Michael Jennings (mej)

Fix a couple conditional expressions from the last commit.
----------------------------------------------------------------------
Tue May 15 09:05:23 2012                        Michael Jennings (mej)

Make sure nodes with no job files still work.
----------------------------------------------------------------------
Wed May 16 13:50:20 2012                        Michael Jennings (mej)

Fix bug pointed out by Ole Holm Nielsen <ole.h.nielsen@fysik.dtu.dk>
which caused the new "eval" of config file lines to barf on the
regular expression with parentheses in the sample config.  Going
forward, users will need to take care to escape shell metacharacters
appropriately in config files.
----------------------------------------------------------------------
Fri Jun 22 15:26:40 2012                        Michael Jennings (mej)

Add stubs for unit test and benchmarking scripts.

Convert main NHC driver script to use functions so that it can be
loaded without needing to be executed and to facilitate testing of
some of its functionality.
----------------------------------------------------------------------
Fri Jun 29 17:52:30 2012                        Michael Jennings (mej)

I smell unit tests!
----------------------------------------------------------------------
Mon Jul  2 16:33:54 2012                        Michael Jennings (mej)

Move unit test framework to separate file.  Now called "SHUT."

Override output functions to suppress normal NHC I/O and exception
handling.

Major refactoring of test framework to allow named tests and progress
output.

Added lots more unit tests for main nhc script.
----------------------------------------------------------------------
Mon Jul  2 17:17:38 2012                        Michael Jennings (mej)

Initial test files for each check script.
----------------------------------------------------------------------
Thu Aug 16 18:12:36 2012                        Michael Jennings (mej)

More work on unit tests:
 - Report number of tests skipped, if any.
 - Add tests for "common.nhc" module.
 - Add tests for "ww_fs.nhc" module.
 - Fix typos in external match checks.
----------------------------------------------------------------------
Fri Aug 24 15:29:17 2012                        Michael Jennings (mej)

Finished hardware unit tests.
----------------------------------------------------------------------
Mon Aug 27 14:52:27 2012                        Michael Jennings (mej)

Unit tests are finally done!  Should be 100% coverage on end-user
checks too, though I don't know of any "gcov" equivalents for
bash....  ;-)

TODO:  More checks!
----------------------------------------------------------------------
Mon Aug 27 16:39:47 2012                        Michael Jennings (mej)

Build fixes, alternative skip syntax, and unit test changes to allow
"make test" in the spec file.  Tested on RHEL4, 5, and 6 and in chroot
jails and VNFS images.
----------------------------------------------------------------------
Tue Sep  4 13:01:10 2012                        Michael Jennings (mej)

Initial support for the nVidia HealthMon tool for checking the status
of nVidia CUDA GPU devices.  More information can be found with the
Tesla Deployment Kit version 3 (currently in RC status).
----------------------------------------------------------------------
Tue Sep  4 17:50:00 2012                        Michael Jennings (mej)

Add check for blacklisted processes.
----------------------------------------------------------------------
Wed Sep  5 16:39:59 2012                        Michael Jennings (mej)

New checks for filesystem size/used/free limits based on "df" output.
Refactored check_fs_mount() to only read /proc/mounts once and
populate central array set (just like all the other modules).
Refactored unit tests accordingly.
----------------------------------------------------------------------
Thu Sep  6 09:57:41 2012                        Michael Jennings (mej)

Added support for detached mode.  Runs all checks in the background,
saves state to filesystem and checks it on the next run.
----------------------------------------------------------------------
Fri Sep  7 14:14:41 2012                        Michael Jennings (mej)

Added unit tests for new disk space checks.  Tweaked detached mode to
detach sooner.  Fixed some faulty logic.
----------------------------------------------------------------------
Fri Sep  7 14:52:49 2012                        Michael Jennings (mej)

A couple minor bugfixes/cleanups.  This is now officially 1.2 beta.
----------------------------------------------------------------------
Wed Oct  3 16:09:05 2012                        Michael Jennings (mej)

Finalized 1.2 release.
----------------------------------------------------------------------
Thu Oct 25 18:29:13 2012                        Michael Jennings (mej)

Add support for NHC log rotation.
----------------------------------------------------------------------
Fri Oct 26 17:17:51 2012                        Michael Jennings (mej)

Add support and unit tests for an "authorized users" whitelist.
----------------------------------------------------------------------
Mon Oct 29 17:58:30 2012                        Michael Jennings (mej)

By default, don't touch nodes that are offline but have no note.  Not
every site uses notes as religiously as we do, nor wants to!
----------------------------------------------------------------------
Tue Oct 30 14:17:40 2012                        Michael Jennings (mej)

Fix job file location fallback handling, and look up userids for
processes where only UID is given as this may indicate a userid >8
characters rather than an unknown user.
----------------------------------------------------------------------
Tue Nov  6 13:28:13 2012                        Michael Jennings (mej)

Add nhc.cron script contributed by Ole Holm Nielsen
<Ole.H.Nielsen@fysik.dtu.dk> to help minimize excessive messages from
NHC when executed via cron.
----------------------------------------------------------------------
Wed Nov  7 14:23:44 2012                        Michael Jennings (mej)

Finalized 1.2.1 release.
----------------------------------------------------------------------
Wed Nov  7 15:24:26 2012                        Michael Jennings (mej)

Found a bug.  Re-releasing 1.2.1.
----------------------------------------------------------------------
Tue Nov 27 16:56:25 2012                        Michael Jennings (mej)

Despite being specified by POSIX, apparently bash's built-in "kill"
command doesn't support signaling process groups.  The watchdog timer
has been rewritten to just kill the nhc script itself.  Unit tests for
the watchdog timer were also added.
----------------------------------------------------------------------
Thu Nov 29 17:57:23 2012                        Michael Jennings (mej)

New check:  check_hw_mcelog

This check will run "mcelog --client" by default and fail if any
output is received.  If the mcelog daemon is not running, this will be
noted in the log file and syslog, but the check will pass.
----------------------------------------------------------------------
Mon Dec 17 11:03:13 2012                        Michael Jennings (mej)

Reset IFS in die() handler and add quotes to traps.  This should
prevent newlines being added to failure messages when certain
subcommands cause timeouts.
----------------------------------------------------------------------
Wed Jan 16 12:14:23 2013                        Michael Jennings (mej)

Patch from John Hanks <john.hanks@usu.edu> for basic pdsh-style node
range support.  Node ranges are now permitted and must be surrounded
by braces (e.g., "{n00[00-99].cluster}").  Multiple ranges may be
specified by separating them with commas (e.g., "{node[0-5],node8}"),
but commas may NOT be used inside the brackets ("{node[0-5,8]}").

This feature should be considered experimental at this point.  Please
report any mismatches.
----------------------------------------------------------------------
Wed Jan 16 14:24:02 2013                        Michael Jennings (mej)

I've added fallback support for LDAP, NIS, etc. via "getent" based on
a suggestion and proposed patch from John Hanks <john.hanks@usu.edu>.

For users using any solution for passwd resolution other than local
/etc/passwd, there are now 2 possible alternatives.

One, you can override the use of /etc/passwd as the source of passwd
data.  You can reference any file on the filesystem that's locally
accessible by setting PASSWD_DATA_SRC to the filename you want NHC to
use.  This could be used to read from a cache file generated by, e.g.,
"ypcat passwd" or a similar command.  This should also work with
process substitution, so you can specify something like
PASSWD_DATA_SRC='<(ypcat passwd)' instead of using a file.  (Note,
though, that this will block.  Insert associated caveats here.)

Two, if reading from PASSWD_DATA_SRC fails, NHC will use "getent" on
an as-needed basis to populate its internal data structures.  Note
that this will mean 1 execution of getent PER missing userid or UID.
Once a particular passwd entry has been retrieved, the information
will be cached and used throughout that NHC run.  Subsequent
executions of NHC will have to execute getent again for each missing
entry.

I do not have such a system, so these changes are largely untested.
Feedback is humbly requested.  :-)
----------------------------------------------------------------------
Tue Jan 22 13:32:47 2013                        Michael Jennings (mej)

Bare parenthesized regular expressions are incompatible with bash 3 in
RHEL4 and RHEL5.  Quoted regular expressions are incompatible with
bash 4 in RHEL6.  Why?  Because someone decided in bash 4 to allow
parts of regular expressions to be quoted and matched as strings
instead.  Why?  Beats me.

Way to go, bash developers.  That's the kind of incompatibility I'd
expect from Python. :-/

Thankfully, storing the regular expressions in variables works just
fine, so we'll do that.
----------------------------------------------------------------------
Tue Jan 22 16:56:03 2013                        Michael Jennings (mej)

1.2.2 has been released.
----------------------------------------------------------------------
Tue Jan 22 16:58:44 2013                        Michael Jennings (mej)

Setting up the tree for 1.2.3 development.
----------------------------------------------------------------------
Mon Feb  4 11:20:34 2013                        Michael Jennings (mej)

Fix from Ole Holm Nielsen <Ole.H.Nielsen@fysik.dtu.dk> for his
nhc.cron script to place transient files in /var/lib/nhc rather than
/tmp to avoid potential symlink issues.  Based on suggestions on the
hpc-monitoring Google Group from Stuart Barkley <google@4gh.net> and
Jesse Becker <hawson@gmail.com>.
----------------------------------------------------------------------
Mon Feb 11 14:22:29 2013                        Michael Jennings (mej)

Added variable $DETACHED_MODE_FAIL_NODATA which can be set to 1 to
cause detached mode to return failure by default, instead of success,
when no results file is present from a previous run.

Also made sure that results files which are older than /proc,
indicating a reboot since the last run, are considered stale and
removed.
----------------------------------------------------------------------
Mon Mar 11 14:47:19 2013                        Michael Jennings (mej)

Add DMI data gatherer and corresponding checks.  Definitely not
speedy, but there's an awful lot of very valuable information that can
be gleaned from it.  And, as always, once you take the initial hit,
you can write as many DMI checks as you like with minimal added
overhead.

Still need to write the unit tests for the new checks.
----------------------------------------------------------------------
Mon Mar 11 17:43:40 2013                        Michael Jennings (mej)

Added unit tests and auto-fu for DMI stuff.
----------------------------------------------------------------------
Tue Mar 12 16:39:17 2013                        Michael Jennings (mej)

Merged and tweaked patch from Aleksey Senin <aleksey@senin.name> to
support Infiniband device names in check_hw_ib.
----------------------------------------------------------------------
Tue Mar 12 16:39:19 2013                        Michael Jennings (mej)

Add .gitignore file.
----------------------------------------------------------------------
Wed Mar 13 15:29:34 2013                        Michael Jennings (mej)

Auto-detect resource manager on startup for later use.
----------------------------------------------------------------------
Wed Mar 13 15:29:37 2013                        Michael Jennings (mej)

Move nhc_fs_[un]parse_size() to common for use in other checks.
----------------------------------------------------------------------
Wed Mar 13 15:29:39 2013                        Michael Jennings (mej)

Convert existing checks to support multiple resource managers via
$NHC_RM.
----------------------------------------------------------------------
Thu Mar 14 21:08:53 2013                        Michael Jennings (mej)

Merged in contributions from Dustin Rice <dustin@alaska.edu> to add
SLURM support to the node online/offline scripts.  Also added SGE
support (sort of, since it's not actually necessary).
----------------------------------------------------------------------
Thu Mar 14 21:14:11 2013                        Michael Jennings (mej)

Fix "make test" by assuming PBS for testing purposes.
----------------------------------------------------------------------
Fri Mar 15 16:38:28 2013                        Michael Jennings (mej)

Bump to version 1.3.
----------------------------------------------------------------------
Mon Mar 18 15:53:15 2013                        Michael Jennings (mej)

Simulate svnversion with git to fix RPM release numbers.
----------------------------------------------------------------------
Wed Mar 20 11:58:08 2013                        Michael Jennings (mej)

Fix issues with SLURM support found during testing.  Since $(...) does
not treat quotes as metacharacters, we'll need to hard-code the sinfo
arguments for obtaining the node status listing.  If someone has a
better way, I'm all ears (eyes?)!

As far as I can tell, SLURM support is now fully functional.
----------------------------------------------------------------------
Wed Mar 20 12:03:37 2013                        Michael Jennings (mej)

Oops; missed a few spots.
----------------------------------------------------------------------
Wed Mar 20 13:37:30 2013                        Michael Jennings (mej)

Preliminary work to integrate Grid Engine support into NHC.  This
currently still requires an external input loop, but I am planning to
do that in nhc as well.
----------------------------------------------------------------------
Wed Mar 20 14:41:55 2013                        Michael Jennings (mej)

Finish merging the Grid Engine wrapper into NHC.  The nhc script is
now directly callable as an SGE/UGE/*GE "load sensor."
----------------------------------------------------------------------
Thu Mar 21 11:48:40 2013                        Michael Jennings (mej)

Use "DRAIN" state instead of "DOWN" as the latter will terminate
running jobs on the node.  We don't want that.  :-)
----------------------------------------------------------------------
Thu Mar 21 11:48:42 2013                        Michael Jennings (mej)

More tweaks to handling of SLURM node states.
----------------------------------------------------------------------
Thu Mar 21 14:42:56 2013                        Michael Jennings (mej)

Another node state I didn't know about.
----------------------------------------------------------------------
Fri Mar 22 12:35:26 2013                        Michael Jennings (mej)

Add online/offline support for IBM Platform LSF.

NOTE:  This is based entirely on documentation and has not been tested
at all.  If you are an LSF user and are willing to help test, please
contact me!  I haven't found a way to have LSF run NHC on the nodes
yet, so for now it would need to run out of cron or similar.
----------------------------------------------------------------------
Mon Apr 01 16:27:08 2013                        Michael Jennings (mej)

Add support for both command line options and arbitrary environment
variable setting on the command line.  A few limited options are
available; see "nhc -h" for details.  For example, you can turn on
debugging and set an alternate config file location using:

    # nhc -d -c /etc/nhc/alternate.conf

All other configuration settings can be manipulated on the command
line using an env-like VAR=value syntax.  So for example, if you
wanted to disable marking online/offline of nodes and set the maximum
system UID to 499, you can now do:

    # nhc MARK_OFFLINE=0 MAX_SYS_UID=499

Note that these parameters WILL be overridden by the config file if
they're set there!
----------------------------------------------------------------------
Mon Apr 01 18:09:03 2013                        Michael Jennings (mej)

Added 3 new checks:  check_fs_inodes(), check_fs_ifree(), and
check_fs_iused().  The perform the same tasks as their
check_fs_{size,used,free}() counterparts except using inode count
instead of byte count.
----------------------------------------------------------------------
Mon Apr 01 18:09:06 2013                        Michael Jennings (mej)

Add support for byte suffixes to all check_hw_{physmem,swap,mem}{,_free}() tests.
----------------------------------------------------------------------
Tue Apr 02 17:09:16 2013                        Michael Jennings (mej)

Added new check:  check_file_contents() will scan through a file
looking for matches to one or more patterns (regular expressions or
globs).  The check will succeed iff all patterns are successfully
matched against individual lines in the file.

Some real-world usage examples:
    check_file_contents /etc/passwd '/^root:x:0:0:[^:]*:/root:/bin/[a-z]*sh$/'
    check_file_contents /etc/passwd 'adminusr:*' 'slurm:*' 'sshd:*'
    check_file_contents /var/spool/torque/mom_priv/config '$pbsserver master'
    check_file_contents /etc/hosts '10.0.0.10*master'
    check_file_contents /proc/cgroups 'cpuset*1'
----------------------------------------------------------------------
Wed Apr 03 15:10:12 2013                        Michael Jennings (mej)

More minor issues found during testing.
----------------------------------------------------------------------
Wed Apr 03 15:10:14 2013                        Michael Jennings (mej)

I'll take "Things Missed by Unit Tests for $100, Alex."
----------------------------------------------------------------------
Wed Apr 03 15:10:17 2013                        Michael Jennings (mej)

Note to self:  Write more unit tests based on potential stupid
mistakes an admin could make rather than just real-world usage.
----------------------------------------------------------------------
