High Throughput Computing using HTCondor

The Research IT HTCondor Service in Detail

PCs included in the Research IT HTCondor pool

The Research IT HTCondor service makes widespread use of classroom PCs located in teaching and learning centres across the campus including libraries. Taken together, the PCs comprise the HTCondor pool. The size of the pool varies widely depending on time of day, day of the week and time of year but peaks at around 1900 cores.

Since each PC is configured to allow HTCondor jobs to run on each of the cores at the same time this gives a theoretical maximum capacity of 1900 HTCondor jobs running in parallel. Normally the figure will be much lower than this and may drop to around 600 during the summer vacation when some centres are closed for refurbishment or upgrades.

All of the HTCondor pool machines run the standard Managed Windows Service (MWS) based on the 64-bit version of Windows 10. A typical specification is:
  • Intel Core-i5 (six core) processor running at 3 GHz
  • 16 GB of installed memory
  • 330 GB of disk space

The HTCondor server

Access to the pool is exclusively via a single high performance server having the hostname condor.liv.ac.uk (referred to here as the HTCondor server). Users must register for the HTCondor service to gain access to this. The HTCondor server runs the Scientifc Linux version of the UNIX operating system and therefore some knowledge of UNIX will be helpful in using the Research IT HTCondor service. However it is also possible to use HTCondor directly from the (Windows) desktop without any knowledge of UNIX by running Desktop Condor (see the Desktop Condor (DTCondor) web page for details).

The main reason for using a single submission point is security, however there are very significant advantages in employing a central server to submit HTCondor jobs rather than allowing personal desktop PCs to be used. Firstly, a large (7 TB), fast RAID filestore is provided to store users' HTCondor files so that they do not take up valuable space on the fairly small hard drives of typical desktop PCs.

Secondly, large numbers of jobs can easily be submitted at once and the server left to work through them after the user has logged out. This would place a very significant strain on a desktop PC and the PC would need to be kept powered-on until all of the jobs had completed.

Availability of HTCondor pool PCs

It extremely important that HTCondor jobs do not impact in any way on use of the pool PCs by ordinary teaching centre users. For this reason HTCondor jobs will generally not run while someone is logged into a pool PC. Furthermore, if a job is already running on a PC when a teaching centre user logs into it then the job will be removed immediately and will return to the job queue (HTCondor calls this evicting a job or vacating it). Usually none of the files created by the job will be returned to the server and any work done so far will be lost. For long running jobs that exploit user-level checkpointing, it is possible to force HTCondor to return all of the output files created so that these can be used to restart the job from where it left off on a different PC.

During term time, the pool machines are in very frequent use by teaching centre users but HTCondor jobs should have a chance to start on most of the PCs after the last users have logged out and gone home. This allows jobs to run without interruption overnight and at weekends. All but the shortest of HTCondor jobs will therefore tend to run to completion outside normal teaching centre hours during term time.

Sustainable use of HTCondor

All of the PCs in the HTCondor pool take part in power saving and normally go into hibernation when not in use. Hibernation does not occur if a HTCondor job is running on a PC and so jobs can run for significant periods without the PC powering down.