<
High Throughput Computing using Condor

Condor Limitations

Unlike high performance computing systems which are based on specialised hardware, the Condor service uses commodity desktop PCs. This places some limitations on the types of application that can successfully be run under Condor and a list of these is given below. Please don't be put off by this. Many users have found the Condor service to be an extremely effective research tool (see Condor Successes for examples) and in some cases it has enabled research that otherwise would not have been possible. If you are unsure whether your problem would be applicable to Condor, please contact the ARC Condor Service administrator Ian C. Smith for advice.


Interactivity

Condor does not provide any means of interacting with a job once it has started running and therefore any input needed by it will have to be read from a file. Similarly any output that needs to be saved will have to be written to a file rather than printed to the screen. Clearly this rules out applications which rely on interaction with a graphical user interface (GUI) e.g. via the mouse.


Communication between jobs

The ARC Condor service does not provide any way for jobs to communicate with each other. Therefore if a problem is to be run successfully on Condor it must be broken down into tasks which perform completely independently of each other. If your computing problem does require communication between jobs (for example it uses MPI), you may find that the HPC service provides more suitable environment.


Job run times

Breaking down a computational problem into a large group of smaller parts is essential to making effective use of Condor. In general the smaller the parts, the more efficient the use of Condor will be. This is because Condor jobs (the smallest units of work in Condor) can only run when the pool PCs are not otherwise in use. If someone logs into a teaching centre machine while a job is running, then usually all of the work done by the job will be lost.

The limited availability of pool PCs makes shorter jobs more likely to run to completion than longer jobs. This means that the combined results are arrived at more quickly. Jobs are also more likely to run to completion during the summer vacation when teaching centre activity tends to very low.

What constitutes "shorter jobs" ? Experimentally we've found that jobs with run times of 15-30 minutes are the most efficient. Overheads such as file transfer can be significant for shorter jobs so that efficiency actually reduces for jobs of just a few minutes duration. Run times of a few hours can work reasonably well but, for most practical purposes, 8-12 hours is about the limit. Since all of the pool PCs are rebooted nightly (at midnight), this places an absolute maximum of 24 hours on the run time of a given job. In practise this is very difficult to achieve.

Would-be Condor users may not have any means of dividing up problems in a way that will reduce run times sufficiently. Worse still, for some problems (e.g. those based on iterative methods), run times may vary widely making it difficult - if not impossible - to get an estimate of what the overall spread of times might be. In this case, it may still be possible to use Condor if there is some way of restarting the job from a previously saved state. See the section on checkpointing for details.


Application Software

When run under Windows, Condor does not provide any way of sharing files between the pool machines and the server. Because of this, any files that are required by a job will already need to present on the PC or will need to to be transferred to it as part of the job. The second method normally provides the job's input (data) files but it may be possible to install software packages in this way if they are compact enough and do not need admin rights.

The easiest way of running application software under Condor is to employ programs written in "traditional" programming languages (such as C/C++ and FORTRAN) by users themselves. Once compiled and linked, the resulting executable can be uploaded to the Condor server and sent to the pool PCs as part of the job. If you are heading down this route, it is strongly recommended that you use static linking so that the executable can run in "stand-alone" mode (there will generally be a linker option to enforce this). If this is not possible, any Dynamically Linked Libraries (DLLs) needed by the job will need to bundled with it (tip: try running the executable on a teaching centre PC first to see if it will run without complaining about missing library functions).

Many users prefer to write their software using interpreted languages such as Perl, R and MATLAB. Here the software comes in two distinct parts - the program written by the user (often called a script) and the interpreter that runs it (as well as any library functions/modules needed).

MATLAB is available on all of the PCs in the Condor pool but because of licensing limitations, it is not possible to run scripts (i.e. M-files) themselves under Condor. Instead, the M-files first needs to be compiled into a stand-alone executable. Since MATLAB is widely used on the Condor service, local support is available - for details see the section on Running MATLAB jobs.

The R interpreter is also installed on all of the Condor pool machines albeit a rather old version (2.6.2). Fortunately R is small enough for it to be installed on-the-fly as part of a job and tools are available to do precisely that. If you intend to use R on Condor and need a particular version of R or need additional packages, please contact the Condor administrator as this can usually be arranged.

The final type of software is that provided by third parties normally as commercial packages. These tend to be large and their use is often restricted by a licensing agreement. It is unlikely that jobs needing this software can be run under Condor unless the package is preinstalled on the MWS and does not require a license to be checked out. For advice please contact the Condor administrator.


Memory requirements

Condor is not suited to running jobs with large memory requirements. Since all pool PCs are configured to allow a Condor job to run on each processor core, the amount of memory available to each job is around 2 GB. Since there may be other processes running on a PC (even while no one is logged in) this may be reduced to around 500 MB for practical purposes. If your jobs are likely to use excess memory, the HPC service may prove to be a better option. Contact the Condor administrator for further details.


Disk space

The amount of disk space available on each PC will vary from one centre to another depending on the actual machine specification and the amount of software preinstalled on it. Condor jobs should aim to take up no more than a few GB of disk and if you require more than this please contact the Condor administrator for advice. In the past, Windows has performed very slowly with large files (bigger than 2 GB) although the situation may have improved recently.


Batch sizes

There is no hard limit on the amount of jobs that can be submitted in one batch (called a cluster in Condor) but please be aware that by submitting very large numbers of jobs a heavy load is placed on the scheduler which may degrade the performance for other users. We have found that Condor can cope with a cluster of 50,000 jobs but clusters of this size should not be used any more than is absolutely necessary. In particular, please do not submit very large clusters if all of the jobs are unlikely to complete within a week or two.


Storage space on the Condor server

The Condor data filestore resides on a large 7 TB RAID system which can hold around two million files and directories (i.e. inodes). Unlike the standard Sun UNIX service home filestore there are no quota restrictions and the entire filesystem is shared between users. Please bear this in mind and delete any files which are no longer needed. The amount of storage may sound a lot but it can quickly be exhausted by the very large numbers of files generated by Condor.

Note that the Condor filesystem in not backed up and in the unlikely (but possible) event of a crash it may not be possible to restore any of the data on it. It is therefore unsuited to long term storage of data and important results should be copied to somewhere safe.