<
High Throughput Computing using Condor

Condor Troubleshooting

Prerequisites

Condor provides a very powerful way of running large numbers of jobs concurrently but when jobs fail to run to completion it can be difficult to find the source of the problem. When jobs repeatedly fail, this can lead to large amounts of processor time being consumed without any progress being made. This means that users wait longer for their results and (from the "Green IT" viewpoint) electricity is wasted unecessarily.

To track down the cause of the problem, a record of the job's progress needs to be stored to a log file which can be examined after a problem has been identified. Since Condor fills its own log files extremely quickly, the required information has usually been overwritten in these system files by the time any debugging is attempted. In order to maintain a record of the required information, users must therefore keep their own log files.

For batches (or more accurately clusters) of jobs, it is useful to have a separate log file for each individual job (called a process in Condor terminology). This can be achieved by using the log attribute in the job submission file; for example:

log = trace$(PROCESS).log 

Using the simplified job submission process, the same effect can be achieved by using the indexed_log attribute e.g.
indexed_log = trace.log

It is also useful to capture any output that would normally be written to the screen as the job proceeds. Although it may not be apparent to the casual user, this output is divided into two parts called streams. One stream, called the standard output (stdout), by convention, contains progress messages whereas the other, called the standard error (stderr), contains any error messages.

Under Condor, these streams can be redirected to different files. To capture the standard output, the output attribute is used in the job submission file e.g.

output = output($PROCESS).out
and for the standard error:
error = errors($PROCESS).err
The corresponding attributes for simplified job submission files are:
indexed_stdout = output.out
indexed_stderr = errors.err

Errors which cause jobs to fail tend to fall into two categories. First there are systematic errors which cause all jobs to fail. These can arise when the required input files not transferred with the job or by errors in the input files themselves.


MATLAB Aside

In the case of MATLAB jobs, it is possible to compile M-files that are syntactically incorrect but will still produce a standalone executable. The error can then remain undetected until the standalone executable is run on the Condor pool. To avoid this, it is always best to check that the M-file is correct before compiling the standalone executable (even if only minor changes have been made to it). The full MATLAB package is available on the Condor server and can be used to check M-files.

The second source of problems are randomly occuring run-time errors which affect only a small proportion of the jobs in any given cluster. These are many possible causes of these including jobs crashing because of memory allocation faults and jobs being interrupted by logged-in PC users abruptly shutting off the power to the PC whilst a job is running. It is often the case that these jobs will complete if re-run.

To identify which jobs are failing, it is extremely useful to tell Condor which output files are expected to be produced. For normal job submission files this can be achieved using the transfer_output_files attribute e.g.

transfer_output_files = output_data$(PROCESS).dat, other_data$(PROCESS).txt
For simplified job submission files, the indexed_output_files attribute is used e.g.
indexed_input_files = output_data.dat, other_data.txt
The reason for this is that Condor will place any jobs for which it cannot retrieve all of the output files into a held state (indicated by a 'H' in the condor_q output). In the held state, jobs are still present in the Condor queue but will not be run (even if there are sufficient resources available for them) until released. To identify which jobs are held use the command:
$ condor_q -held <your_username>

The default Condor behaviour (that is when no output files are specified) is to return all of the files that are either modified or created by the job. This can be useful in some contexts (see below) but since Condor does not know which output files the user expects, it cannot flag an error if any are missing. This means that the user may need to sift though all of the output files to ensure that they have all been created properly. For large clusters of jobs there can easily be thousands of output files making this a rather painstaking operation. On the other hand, by explicitly stating which output files are needed in the job submission file, it becomes much easier to identify those jobs which have failed (they are just those that are held).


Debugging failed jobs

If all of the jobs in a given cluster have failed and become held then the likely cause is a systematic error common to all jobs. By examining the log files it should become apparent just how long each job has run for before failing. In this example, the job has failed almost immediately after starting:

000 (513136.000.000) 02/09 15:28:44 Job submitted from host: \
<138.253.100.27:43424>
...
001 (513136.000.000) 02/09 15:31:46 Job executing on host: \
<138.253.237.80:49159>
...
007 (513136.000.000) 02/09 15:31:46 Shadow exception!
        Error from slot1@BLTC-30.livad.liv.ac.uk: STARTER at 138.253.237.80 \
failed to send file(s) to <138.253.100.27:65180>: error reading from \
c:\tmp\dir_2628\output.mat: (errno 2) No such file or directory; \
SHADOW failed to receive file(s) from <138.253.237.80:63263>
        44  -  Run Bytes Sent By Job
        1987  -  Run Bytes Received By Job
...
012 (513136.000.000) 02/09 15:31:46 Job was held.
        Error from slot1@BLTC-30.livad.liv.ac.uk: STARTER at 138.253.237.80 \
failed to send file(s) to <138.253.100.27:65180>: error reading from \
c:\tmp\dir_2628\output.mat: (errno 2) No such file or directory; \
SHADOW failed to receive file(s) from <138.253.237.80:63263>
        Code 13 Subcode 2
(The backslashes, '\', indicate where lines have been broken to fit on the page and are not present in the log file. Notice that the job begain executing at 15:31:46 and failed within one second.)

The most probable cause is that either the required input files have not been transferred with the job or that the input files themselves contain errors (e.g. the wrong filename for some auxiliary file or a syntax error in a M-file in the case of MATLAB). By examining the the standard output and standard error files, the actual error should become clearer. If this still does not offer any clues then it may be worth resubmitting the jobs but without using the output files attribute. In this case all of the files that are either created or modified by the job will be returned. These will hopefully provide more information on the source of the problem.

If the jobs have run for an appreciable time then the log file may look more like this:

000 (513139.000.000) 02/09 15:58:15 Job submitted from host: \
<138.253.100.27:43424>
...
001 (513139.000.000) 02/09 15:59:39 Job executing on host: \
<138.253.234.17:59709>
...
007 (513139.000.000) 02/09 16:01:39 Shadow exception!
        Error from slot2@CDTC-07.livad.liv.ac.uk: STARTER at 138.253.234.17 \
failed to send file(s) to <138.253.100.27:43030>: error reading from \
c:\tmp\dir_5020\prod.mat: (errno 2) No such file or directory; \
SHADOW failed to receive file(s) from <138.253.234.17:53240>
        44  -  Run Bytes Sent By Job
        97628  -  Run Bytes Received By Job
...
012 (513139.000.000) 02/09 16:01:39 Job was held.
        Error from slot2@CDTC-07.livad.liv.ac.uk: STARTER at 138.253.234.17 \
failed to send file(s) to <138.253.100.27:43030>: error reading from \
c:\tmp\dir_5020\prod.mat: (errno 2) No such file or directory; \
SHADOW failed to receive file(s) from <138.253.234.17:53240>
        Code 13 Subcode 2
(Notice that the job begain executing at 15:39:39 and ran for around two minutes before failing at 16:01:39)

Here it is likely that the job has not saved the required output file(s) or that these do not correspond to the ones specified in the job submission file. Again it may be worth resubmitting the jobs without specifying the output files so that all of the files created and/or modified by the job are returned. A check can then be made so see if these files correspond to the outputs specified in the job submission file.

If only a small number of jobs fail then the easiest course of action may just be to re-run them by using:

$ condor_release -all
Although this is clearly a very simplistic approach it is often the most effective. When thousands of jobs are run under Condor, there will always be a chance that some will fail for reasons that are extremely difficult to predict (e.g. transient network problems) and even more difficult to analyse. Here a condor_release may quickly solve the problem.

If the same jobs repeatedly fail, then this may be down to a bug in the application code itself. The log files should provide some information on how long the application code ran for before crashing. Unfortunately the only really effective way of diagosing problems of this type is to run the application on a local PC with exactly the same input data.

Some points to consider are:

Since randomly occuring errors will often disappear when jobs are re-run, Condor has been configured so that jobs will automatically be released after twenty minutes in the held state. This can be repeated up to six times for each job. If the log file indicates that a job has repeatedly been held then released, then the cause is is probably one of the errors described above. The following log file excerpt exemplifies this:

000 (513139.000.000) 02/09 15:58:15 Job submitted from host: \
<138.253.100.27:43424>
...
001 (513139.000.000) 02/09 15:59:39 Job executing on host: \
<138.253.234.17:59709>
...
007 (513139.000.000) 02/09 16:01:39 Shadow exception!
	Error from slot2@CDTC-07.livad.liv.ac.uk: STARTER at 138.253.234.17 \
failed to send file(s) to <138.253.100.27:43030>: error reading from \
c:\tmp\dir_5020\prod.mat: (errno 2) No such file or directory; SHADOW \
failed to receive file(s) from <138.253.234.17:53240>
	44  -  Run Bytes Sent By Job
	97628  -  Run Bytes Received By Job
...
012 (513139.000.000) 02/09 16:01:39 Job was held.
	Error from slot2@CDTC-07.livad.liv.ac.uk: STARTER at 138.253.234.17 \
failed to send file(s) to <138.253.100.27:43030>: error reading from \
c:\tmp\dir_5020\prod.mat: (errno 2) No such file or directory; SHADOW \
failed to receive file(s) from <138.253.234.17:53240>
	Code 13 Subcode 2
...
013 (513139.000.000) 02/09 16:21:55 Job was released.
	The system macro SYSTEM_PERIODIC_RELEASE expression \
'( ( JobRunCount <= 6 ) && ( CurrentTime - EnteredCurrentStatus > 1200 ) && \
( HoldReasonCode != 1 ) )' evaluated to TRUE
...
001 (513139.000.000) 02/09 16:22:00 Job executing on host: \
<138.253.233.110:52445>
...
006 (513139.000.000) 02/09 16:22:09 Image size of job updated: 6792
...
007 (513139.000.000) 02/09 16:24:14 Shadow exception!
	Error from slot1@ETC1-10.livad.liv.ac.uk: STARTER at 138.253.233.110 \
failed to send file(s) to <138.253.100.27:50484>: error reading from \
c:\tmp\dir_764\prod.mat: (errno 2) No such file or directory; SHADOW \
failed to receive file(s) from <138.253.233.110:55335>
	44  -  Run Bytes Sent By Job
	97628  -  Run Bytes Received By Job
...
012 (513139.000.000) 02/09 16:24:16 Job was held.
	Error from slot1@ETC1-10.livad.liv.ac.uk: STARTER at \
138.253.233.110 failed to send file(s) to <138.253.100.27:50484>: \
error reading from c:\tmp\dir_764\prod.mat: (errno 2) No such file or \
directory; SHADOW failed to receive file(s) from <138.253.233.110:55335>
	Code 13 Subcode 2
...
013 (513139.000.000) 02/09 16:45:00 Job was released.
	The system macro SYSTEM_PERIODIC_RELEASE expression \
'( ( JobRunCount <= 6 ) && ( CurrentTime - EnteredCurrentStatus > 1200 ) && \
( HoldReasonCode != 1 ) )' evaluated to TRUE
...
001 (513139.000.000) 02/09 16:50:23 Job executing on host: \
<138.253.231.27:54904>
...
007 (513139.000.000) 02/09 16:52:23 Shadow exception!
	Error from slot1@DLC1-17.livad.liv.ac.uk: STARTER at \
138.253.231.27 failed to send file(s) to <138.253.100.27:57652>: \
error reading from c:\tmp\dir_3584\prod.mat: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <138.253.231.27:56734>
	44  -  Run Bytes Sent By Job
	97628  -  Run Bytes Received By Job
...
012 (513139.000.000) 02/09 16:52:23 Job was held.
	Error from slot1@DLC1-17.livad.liv.ac.uk: STARTER at 138.253.231.27 \
failed to send file(s) to <138.253.100.27:57652>: error reading from \
c:\tmp\dir_3584\prod.mat: (errno 2) No such file or directory; SHADOW \
failed to receive file(s) from <138.253.231.27:56734>
	Code 13 Subcode 2...
(Note that the job was automatically released by Condor at 16:21:55 and around twenty minutes later at 16:45:00. Condor will release jobs upto six times at twenty minute intervals).


Dealing with jobs that run "forever"

In some cases, jobs may appear to run indefinitely or at least run for far longer than expected (until ultimately evicted from an execute host for whatever reason). The most likely cause is that the application itself is waiting for an event that never occurs e.g. a mouse click on a pop up dialogue window (these do not appear on the classroom machines even if someone is logged in).


MATLAB Aside

MATLAB has a tendency to create pop up windows for error messages rather than writing them to the standard error stream. A common reason is that it cannot locate the run-time libraries. If all of the jobs in a cluster of MATLAB jobs seem to run indefinitely then it worth checking that the .manifest file is present and has been specified in the transfer_input_files list (this is taken care of automatically when the using the simplified job submission process). Another possible reason is that the main program is waiting for a pop-up figure window to be closed.

If only a small number of jobs seem to be running indefinitely these should be re-run these to see if the problem re-occurs. This can be achieved by holding and then later releasing the job e.g.

$ condor_hold <job_id>

$ condor_release <job_id>
If the same jobs seem to run indefinitely, then there is probably a bug in the application itelf or an error in the data supplied to it.


Summary

The following list summarises some of the points to consider when running large numbers of jobs under Condor:
  1. For MATLAB jobs: Check that the M-file does work correctly before compiling a stamdalone executable. If you are not using the simplified job submission process, check the the correct .manifest file is present and has been included in the input file list.

  2. Check that all of the input files are present and that they correspond to those listed in the job submission file. Do all of them contain the correct data ?

  3. Submit a small cluster of test jobs before carrying out a large scale "production" run (say on the order of 10 jobs).

  4. If all of the jobs fail then remove/comment out the output file list temporarily in the job submission file. Try resubmitting the jobs - are all of the expected output files returned ? Check the log files and standard output and error files. When things are working reinstate the output file list.

  5. When satisfied that the basic process is working submit the large cluster of jobs.

  6. If a few jobs fail, release them so that they run again (use condor_release).

  7. If jobs repeatedly fail then check the log files and standard output/error files. It may be useful to remove the output file list so that all of the output files are returned.

  8. If some jobs seem to be running indefinitely (or much longer than expected) try re-running them by holding and then releasing them (condor_hold followed by condor_release). If some jobs repeatedly fail then try running them on a local PC with exactly the same input data.