<
High Throughput Computing using HTCondor

HTCondor Limitations

Unlike high performance computing systems which are based on specialised hardware, the HTCondor service uses commodity desktop PCs. This places some limitations on the types of application that can successfully be run under HTCondor and a list of these is given below. Please don't be put off by this. Many users have found the HTCondor service to be an extremely effective research tool (see HTCondor Successes for examples) and in some cases it has enabled research that otherwise would not have been possible. If you are unsure whether your problem would be applicable to HTCondor, please contact the Research IT HTCondor Service administrator Ian C. Smith for advice.


Interactivity

HTCondor does not provide any means of interacting with a job once it has started running and therefore any input needed by it will have to be read from a file. Similarly any output that needs to be saved will have to be written to a file rather than printed to the screen. Clearly this rules out applications which rely on interaction with a graphical user interface (GUI) e.g. via the mouse.


Communication between jobs

The Research IT HTCondor service does not provide any way for jobs to communicate with each other. Therefore if a problem is to be run successfully on HTCondor it must be broken down into tasks which perform completely independently of each other. If your computing problem does require communication between jobs (for example it uses MPI), you may find that the HPC service provides more suitable environment.


Job run times

Breaking down a computational problem into a large group of smaller parts is essential to making effective use of HTCondor. In general the smaller the parts, the more efficient the use of HTCondor will be. This is because HTCondor jobs (the smallest units of work in HTCondor) can only run when the pool PCs are not otherwise in use. If someone logs into a teaching centre machine while a job is running, then usually all of the work done by the job will be lost.

The limited availability of pool PCs makes shorter jobs more likely to run to completion than longer jobs. This means that the combined results are arrived at more quickly. Jobs are also more likely to run to completion during the summer vacation when teaching centre activity tends to very low.

What constitutes "shorter jobs" ? Experimentally we've found that jobs with run times of 15-30 minutes are the most efficient. Overheads such as file transfer can be significant for shorter jobs so that efficiency actually reduces for jobs of just a few minutes duration. Run times of a few hours can work reasonably well but, for most practical purposes, 8-12 hours is about the limit. Since all of the pool PCs are rebooted nightly (at midnight), this places an absolute maximum of 24 hours on the run time of a given job. In practise this is very difficult to achieve.

Would-be HTCondor users may not have any means of dividing up problems in a way that will reduce run times sufficiently. Worse still, for some problems (e.g. those based on iterative methods), run times may vary widely making it difficult - if not impossible - to get an estimate of what the overall spread of times might be. In this case, it may still be possible to use HTCondor if there is some way of restarting the job from a previously saved state. See the section on checkpointing for details.


Application Software

When run under Windows, HTCondor does not provide any way of sharing files between the pool machines and the server. Because of this, any files that are required by a job will already need to present on the PC or will need to to be transferred to it as part of the job. The second method normally provides the job's input (data) files but it may be possible to install software packages in this way if they are compact enough and do not need admin rights.

The easiest way of running application software under HTCondor is to employ programs written in "traditional" programming languages (such as C/C++ and FORTRAN) by users themselves. Once compiled and linked, the resulting executable can be uploaded to the HTCondor server and sent to the pool PCs as part of the job. If you are heading down this route, it is strongly recommended that you use static linking so that the executable can run in "stand-alone" mode (there will generally be a linker option to enforce this). If this is not possible, any Dynamically Linked Libraries (DLLs) needed by the job will need to bundled with it (tip: try running the executable on a teaching centre PC first to see if it will run without complaining about missing library functions).

Many users prefer to write their software using interpreted languages such as Python, R and MATLAB. Here the software comes in two distinct parts - the program written by the user (often called a script) and the interpreter that runs it (as well as any library functions/modules needed).

MATLAB is available on all of the PCs in the HTCondor pool but because of licensing limitations, it is not possible to run scripts (i.e. M-files) themselves under HTCondor. Instead, the M-files first needs to be compiled into a stand-alone executable. Since MATLAB is widely used on the HTCondor service, local support is available - for details see the section on Running MATLAB jobs. As well as MATLAB, the Research IT HTCondor Services provides support for programs written in Python and R.

The final type of software is that provided by third parties normally as commercial packages. These tend to be large and their use is often restricted by a licensing agreement. It is unlikely that jobs needing this software can be run under HTCondor unless the package is preinstalled on the MWS and does not require a license to be checked out. For advice please contact the HTCondor administrator.


Memory requirements

HTCondor is not suited to running jobs with large memory requirements as the maximum the amount of memory available on each PC is around 16 GB. Since there may be other processes running on a PC (even while no one is logged in) this may be reduced for practical purposes. If your jobs are likely to use excess memory, the HPC service may prove to be a better option. Contact the HTCondor administrator for further details.


Disk space

The amount of disk space available on each PC will vary from one centre to another depending on the actual machine specification and the amount of software preinstalled on it. HTCondor jobs should aim to take up no more than a few GB of disk and if you require more than this please contact the HTCondor administrator for advice.


Batch sizes

There is no hard limit on the amount of jobs that can be submitted in one batch (called a cluster in HTCondor) but please be aware that by submitting very large numbers of jobs a heavy load is placed on the scheduler which may degrade the performance for other users. We have found that HTCondor can cope with a cluster of 50,000 jobs but clusters of this size should not be used any more than is absolutely necessary. In particular, please do not submit very large clusters if all of the jobs are unlikely to complete within a week or two.


Storage space on the HTCondor server

The HTCondor data filestore resides on a large 7 TB RAID system which can hold around two million files and directories (i.e. inodes). Unlike the standard Linux service (on lxb/lxc) home filestore there are no quota restrictions and the entire filesystem is shared between users. Please bear this in mind and delete any files which are no longer needed. The amount of storage may sound a lot but it can quickly be exhausted by the very large numbers of files generated by HTCondor.

Note that the HTCondor filesystem in not backed up and in the unlikely (but possible) event of a crash it may not be possible to restore any of the data on it. It is therefore unsuited to long term storage of data and important results should be copied to somewhere safe.