High Throughput Computing using Condor

The ARC Condor Service in Detail

PCs included in the ARC Condor pool

The ARC Condor service makes widespread use of classroom PCs located in teaching and learning centres across the campus (including some of the PCs in the libraries). Taken together, the PCs comprise the Condor pool. The size of the pool varies widely depending on time of day, day of the week and time of year but peaks at around 1900 cores.

Since each PC is configured to allow Condor jobs to run on each of the cores at the same time this gives a theoretical maximum capacity of 1900 Condor jobs running in parallel. Normally the figure will be much lower than this and may drop to around 600 during the summer vacation when some centres are closed for refurbishment or upgrades.

All of the Condor pool machines run the standard Managed Windows Service (MWS) based on the 64-bit version of Windows 7. A typical specification is:
  • Intel Corei3 (quad-core) processor running at 3.3 GHz
  • 8 GB of installed memory
  • 120 GB of disk space

The Condor server

Access to the pool is exclusively via a single high performance server having the hostname condor.liv.ac.uk (referred to here as the Condor server). Users must register for the Condor service to gain access to this. The Condor server runs the Scientifc Linux version of the UNIX operating system and therefore some knowledge of UNIX will be helpful in using the ARC Condor service. However it is also possible to use Condor directly from the (Windows) desktop without any knowledge of UNIX by running Desktop Condor (see the Desktop Condor (DTCondor) web page for details).

The main reason for using a single submission point is security, however there are very significant advantages in employing a central server to submit Condor jobs rather than allowing personal desktop PCs to be used. Firstly, a large (7 TB), fast RAID filestore is provided to store users' Condor files so that they do not take up valuable space on the fairly small hard drives of typical desktop PCs.

Secondly, large numbers of jobs can easily be submitted at once and the server left to work through them after the user has logged out. This would place a very significant strain on a desktop PC and the PC would need to be kept powered-on until all of the jobs had completed.

Availability of Condor pool PCs

It extremely important that Condor jobs do not impact in any way on use of the pool PCs by ordinary teaching centre users. For this reason Condor jobs will generally not run while someone is logged into a pool PC. Furthermore, if a job is already running on a PC when a teaching centre user logs into it then the job will be removed immediately and will return to the job queue (Condor calls this evicting a job or vacating it). Usually none of the files created by the job will be returned to the server and any work done so far will be lost. For long running jobs that exploit user-level checkpointing, it is possible to force Condor to return all of the output files created so that these can be used to restart the job from where it left off on a different PC.

During term time, the pool machines are in very frequent use by teaching centre users but Condor jobs should have a chance to start on most of the PCs after the last users have logged out and gone home. This allows jobs to run without interruption overnight and at weekends. All but the shortest of Condor jobs will therefore tend to run to completion outside normal teaching centre hours during term time.

Sustainable use of Condor

All of the PCs in the Condor pool take part in power saving and normally go into hibernation when not in use. Hibernation does not occur if a Condor job is running on a PC and so jobs can run for significant periods without the PC powering down. During the summer vacation there is almost no teaching centre use and PCs are automatically woken up as needed by the Condor server in reponse to demand from Condor users.