<
High Throughput Computing using Condor

Frequently Asked Questions (FAQ)



How do I ensure that different sequences of random numbers are generated by different jobs ?

How do I create a standalone executable from multiple M-files ?

How many machines are there in the ARC Condor pool ?

Why are my jobs sitting idle while other users' jobs are running ?

How many jobs can I submit in one go ?

What is the longest time that an individual job can run for ?

How do I remove my jobs from the queue ?

My jobs all seem to go into the Held ('H') state - why is this ?

I would like to run jobs that may take days or weeks to complete - is this possible ?

Why are some Condor commands taking ages to work (e.g. condor_q, condor_rm) ?

Is it possible to run jobs over the summer vacation - there don't appear to be any machines available ?

How do I get some of my jobs to run before others ?

How much space can I use up on the /condor_data filesystem ?

I'm trying to run Monte Carlo based simulations but I get the same results for each job - why is that ?

How much memory can Condor jobs use ?

Why do my jobs stop running at midnight everynight ?

Can Condor jobs run on a PC while someone is logged in to it ?

What aren't my checkpointing jobs returning any checkpoint files when jobs are evicted ?



How do I ensure that different sequences of random numbers are generated by different jobs ?

The key to getting Condor jobs to generate (approximately) independent sequences of numbers is to use a different seed value for each job. The seed values can be generated via another random number generator and stored one value per file. These files can then be used as indexed input files for the jobs in a cluster so that each job recieves a different seed value to initialise its random number generator with.

More information on using random numbers with Condor can be found in this excellent detailed article by Mike Croucher of Manchester University:

Parallel Random Numbers in MATLAB.



How do I create a standalone executable from multiple M-files ?

Place the "main" M-file in a directory on the Condor server and create another directory called dependencies below it. Then place all of the other M-files (i.e. the ones containing functions used by the main M-file) in the dependencies directory. Be careful not to include any other files in the dependencies directory or these will be "compiled-in" as well. Once this is in place run:

$ matlab_build <MyMainMfile> 
in the directory containing the main M-file.

You can also build a standalone executable on your PC using the same method. To compile the code use the mcc command to run the MATLAB compiler:

>> mcc -mv <MyMainMfile> -a dependencies
An example of this can be found here.


How many machines are there in the ARC Condor pool ?

This varies depending on the time of day, day of the week and time of year (e.g. during term time or the vacations) but the absolute maximum number of machines running 64-bit Windows is around 600. This equates to around 1200 job slots since each PC can run at least two Condor jobs at once (one on each core). To get an idea of how the size of the pool fluctuates, check out the recent Condor machine statistics.


Why are my jobs sitting idle while other users' jobs are running ?

The most obvious answer is that there are no Unclaimed (i.e. free) job slots available. To check this, run the command:

$ condor_status -totals
and look for the machines with the X86_64 architecture. There may be a small number of Unclaimed slots corresponding to machines that have recently gone into hibernation and which are no longer available.

If there do appear to be job slots available, the next thing to do is to check that your job's requirements are specified correctly. This will be taken care of automatically if you are using simplified job submission tools but if you have your own Condor job submission files, check that they have the correct requirements attribute namely:

requirements = ( Arch="X86_64" ) && ( OpSys=="WINNT61" )
If Condor always seems to be running other users' jobs in preference to your own then it could be that your priority is too low. Condor employs a "fair share" scheduling policy which aims to balance usage between different users. If you have run a lot of jobs recently, Condor will have reduced your policy to allow others users' jobs to run in preference to yours. You can see where you are in the "pecking order" using the command:
$ condor_userprio
Lower figures imply that jobs are more likely to run.

How many jobs can I submit in one go ?

There are no limits placed on the number of jobs which can be submitted in a single batch (i.e. cluster) but please bear in mind that submitting (and removing) large numbers of jobs can place a high load on the Condor scheduler which may degrade the performance for other users. We have found that Condor can cope with clusters of 50,000 jobs but batches of this size should not be submitted regularly unless absolutely necessary.


What is the longest time that an individual job can run for ?

The absolute maximum run time for a single (non-checkpoining) job is 24 hours since all of the Condor pool PCs are rebooted nightly at midnight. It is highly likely that jobs will be evicted by local use of the PCs during the day and so this is extremely difficult to achieve in practice. Bearing this in mind, a more realistic limit is 8-12 hours.

The optimum run time is around 15-30 minutes but jobs of a few hours duration should work fine. Longer running jobs can waste large amounts of CPU time and hence electricity. You can check the efficiency of your Condor jobs from the log file analysis statistics. If you need to run jobs for longer than 12 hours then some form of checkpointing mechanism will be needed (see section on Support for Long Running Jobs).


How do I remove my jobs from the queue ?

The jobs in a single cluster can be removed from the queue with the command:

$ condor_rm <cluster_id>
where <cluster_id> is the job ID for a given cluster (i.e. the part before the decimal point). To remove all of your jobs use
$ condor_rm -all
This should be used with care as Condor will not ask you to confirm whether you want to remove the jobs - it will just go straight ahead and do it.

When removing jobs, Condor will attempt to close them down "cleanly" which may take a significant time if there many jobs in the queue. To speed things up, you can use the -f (force) option e.g.

$ condor_rm -f -all
This can place a significant load on the Condor scheduler so please use it sparingly. If your jobs resolutely refuse to go away, please contact the Condor administrator for assistance.


My jobs all seem to go into the Held ('H') state why is this ?

The Held ('H') states indicates that some sort of error has occurred with the job(s) in question. The job is not removed from the queue since the error may have been caused by a transient problem (e.g. network timeout) and the job will run run to completion if released from the hold. The most likely source of a hold error is that the specified output files are not returned to the server when the job has finished. The underlying reason for this may be that there is a problem with code that the job is running (e.g. syntax error in the case of a compiled MATLAB M-file).

The Condor server has been configured so that held jobs are released after a period of time to allow them to run again. Jobs may therefore move from the held state to the idle state, then to running and back to held again. If jobs are repeatedly held in this manner, then the problem is almost certainly down to the job itself. The section on Troubleshooting can help you track down these faults but if you are still stuck, please contact the Condor administrator.


I would like to run jobs that may take days or weeks to complete - is this possible

Yes - but only if you can build some form of save/restart mechanism into your software. For MATLAB applications this is fairly straightforward. For details, see the section on Support for Long Running Jobs.



Why are some Condor commands taking ages to work (e.g. condor_q, condor_rm) ?

If another user is currently submitting or removing a large number of jobs, then the Condor scheduler can become overloaded which slows down any commands which access it (typically these are condor_submit, condor_q and condor_rm). In the ARC Condor pool, we have attempted to allocate a different scheduler to each user to avoid overload problems but since there are many more users than schedulers these may sometimes have to be shared. If you find that your scheduler is also being used by a Condor "power user" and this is slowing things down, please contact the Condor administrator who can move you to another scheduler.


Is it possible to run jobs over the summer vacation - there don't appear to be any machines available ?

Yes this is possible as machines are automatically woken up to meet the current demand during the summer vacation although there may be a delay of up to 30 minutes before this happens. Since all of the pool PCs are rebooted at midnight, jobs should start to run in the early hours of the morning in any case as the PCs start to boot up again.


How do I get some of my jobs to run before others ?

Use the command:

$ condor_prio -p <priority> <job_id>
where <priority> is the new priority <job_id> is the job's ID (a cluster ID can also be used). Note that smaller priorities mean that jobs are more likely to run. This can be useful if you have a mixture of short and long running jobs and would like to have the short ones complete first. Usually Condor will attempt to run jobs in the order in which they were submitted but this ordering may change as jobs are evicted. If jobs must be performed in a specific order, then this can be enforced by using Condor DAGMan (see the Condor manual for details).



How much space can I use up on the /condor_data filesystem ?

There are no quota limits imposed on the /condor_data filesystem unlike the ordinary Sun UNIX /home filestore but please bear in mind that the storage is shared between all Condor users. The total size of the filestore is around 7 TB which should easily accomodate most Condor applications. However, Condor can generate large amounts of data (and log files) in a short period of time which can soon exhaust the available storage if left around on the server. To prevent the Condor server from running out of storage, please delete any files you no longer need.

You can find how much space is left with the df (disk free) command:

$ df -vh /condor_data
and how much space you are using with the du (disk used) command e.g.
$ cd /condor_data/<username>
$ du -sh .
(this may take a long time if you have a lot as files as it works its way through all of them).



I'm trying to run Monte Carlo based simulations but I get the same results for each job - why is that ?

This is almost certainly due to the same sequences of random numbers being used for each job. Monte Carlo based simulations make extensive use of randomly generated numbers but computers are poor at generating truly random numbers since they are built from combinational and sequential logic circuits which are inherently deterministic. Only a truly random process (e.g. amplified electronic noise or detection or particles from a radioactive source) can provide a source of truly random numbers.

Most software generates psudo-random numbers instead in which the sequence of numbers eventually repeats itself - albeit after a very long period. A distribution graph of these numbers would show a close resemblance to a uniform distribution but the fact remains that the sequence of numbers is completely predictable if one knows the starting point. If the same software random number generator is run multiple times then exactly the same sequence of numbers will be produced and to get around this most random number generators allow a seed value to be used which changes the starting point in the sequence.

The key to getting Condor jobs to generate (approximately) independent sequences of numbers is therefore to use a different seed value for each job. The seed values can be generated via another random number generator and stored one value per file. These files can then be used as indexed input files for the jobs in a cluster.

Some users have attempted to employ timestamps as seed values for the random number generators on the basis that these should be unique to each job. The problem with this is that timestamps generally have a resolution of only one second whilst the computer processors are operating at nanosecond timescales (and network devices at millisecond timescales). This makes it highly likely that multiple jobs will receive the same seed value hence spoiling the statistical independence of the multiple jobs. Generating the seed values first should always work much better and has the advantage that the same simulation run with the same input data should produce the same results as before (a good way of checking that the code works correctly) - this would not be the case if timestamps were used.



How much memory can Condor jobs use ?

All of the Condor pool PCs have 1 GB of memory per core and therefore 1 GB per job slot however not all of this will be available to Condor jobs. To be on the safe side, it is probably best to work to a limit of 500 MB per job. If you are likely to need more than this please contact the Condor administrator for advice.



Why do my jobs stop running at midnight everynight ?

All of the Condor pool PCs are rebooted nightly at midnight which will cause all jobs running to be evicted. In addition, most of the pool PCs are re-imaged in the early hours of Friday morning meaning that jobs may not start to run again until the teaching centres become occupied from around 9 am onwards.



Can Condor jobs run on a PC while someone is logged in to it ?

Yes provided that the logged in user is only making light use of the PC e.g. by browsing the web rather than running compute-intensive applications such as MATLAB, AutoCAD etc. If the local load becomes significant or available memory becomes low, then the job will be evicted. You can see which machines have users logged in with

$ condor_status -constraint UserLoggedIn==True



What aren't my checkpointing jobs returning any checkpoint files when jobs are evicted ?

The default is for Condor NOT to transfer files on eviction - this is to prevent bandwidth and storage being wasted by non-checkpointing jobs. If you are using matlab_submit, you will need to specify the checkpoint files in the simplified job description file with:

checkpoint_files = ...
For ordinary job submission files, add the checkpoint files to this list of output files and add this attribute to the job submission file:
+CheckpointJob = True
(the leading plus sign "+" is needed). The following two lines are also needed:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT