Mpiexec Frequently Asked Questions (FAQ) Copyright (C) Pete Wyckoff, 2000-8. Here are some notes collected from solving various installation and usage problems with mpiexec, organized into a FAQ format. 1. Does mpiexec work with OpenPBS 2.4? There is no OpenPBS 2.4. Veridian changed the code in 2.3.16 so that it claims to be "OpenPBS_2.4". Type "l s" at a qmgr prompt to see this. The code is still 2.3.16 in spirit since it is hardly different from 2.3.15 or the last couple years of earlier versions for that matter. 2. The configure script can't find my PBS library, but I gave it the correct path. You probably need to compile mpiexec using whatever compiler you used to build PBS, otherwise some symbols may not be defined. This will show up as configure complaining "PBS library not found ...". Check config.log to verify if it really was not found, or if you chose a different compiler. Override the compiler choice at configure time by setting the environment variables CC and CFLAGS. You can run "bash -x ./configure ..." to see everything it does to try to figure out what's wrong. 3. Mpiexec exits immediately with the message "mpiexec: Error: get_hosts: tm_init: tm: system error". This is the very first line in the code where mpiexec attemps to talk to the local PBS mom. Lots of things can go wrong so that PBS will not let that happen. One problem could be that name resolution is not working correctly. You need to have entries in /etc/hosts (or a working DNS resolver) for both localhost and for your PBS server, like this: 127.0.0.1 localhost 10.0.0.254 front-end fe # pbs server Other variations might work too. On the server, you probably need hosts entries for all the other nodes, too, but I suspect you'd notice something else broken before mpiexec. Don't forget to restart pbs_mom or pbs_server as appropriate after changing a system configuration file like /etc/hosts. 4. Are there any debugging tools to figure out why the entire mess does not work? Especially this confusing "system error" message? There are lots of bits that must cooperate to run a parallel job: PBS server, PBS mother superior, other PBS moms, mpiexec, mpich library, and your application code. It's tough to figure out where the fault lies when something fails. PBS problems are frequently logged. See on the mother superior node (the compute node which holds process #0 of your parallel job) the file /var/spool/pbs/mom_logs/20021030 or whatever the date is today. On the PBS server machine, you'll find log messages in /var/spool/pbs/server_logs/20021030 If you install into a different location you'll have to change the path prefix, of course. The "big hammer" of debugging tools here is strace. If mpiexec complains when talking to the PBS mom, grab the mpiexec with an strace and watch what it's doing right before it prints out the error message: strace -vfF -s 400 -o /tmp/strace.mpiexec.out mpiexec myjob Look through the output file for the error message, then back up a few lines and try to guess what went wrong. If it looks harmless, maybe the PBS mom is causing the problem. As root, find the pid of the pbs_mom on the node, then attach to it with strace in a different terminal session: strace -vfF -s 400 -o /tmp/strace.mom.out -p then start your job and watch what happens. 5. When I do "mpiexec