<
High Throughput Computing with Condor

Introduction to High Throughput Computing


A (very) brief history of research computing


For many years, researchers faced with computationally demanding problems would resort to using powerful bespoke systems called supercomputers (a classic example is shown left) to solve their problems more quickly. Often these were confined to regional and national computing centres and provided only limited availability. More recently, systems called computing clusters - often forming part of central university and departmental IT services - have taken over much of the work done by supercomputers. Access to these clusters is usually more open but like their supercomputer counterparts they are still based on fairly exotic hardware and require specialised software to make best use on them.

This style of computing is referred to as High Performance Computing (HPC) - the goal of which is to speed up programs as much possible so that results are achieved more quickly. In some applications, all-out speed is of paramount importance making HPC use essential (think of next day weather prediction for example - an extremely accurate forecasting program is not much use if it needs to run for longer than 24 hours).


High throughput vs high performance

By contast, High Throughput Computing (HTC) doesn't concern itself too much with speeding up individual programs themselves - rather it allows many copies of the same program to run at the same time. More precisely, it allows many copies of the same program to run in parallel or concurrently. Running multiple copies of exactly the same problem is obviously a fairly pointless exercise but the power of HTC lies in its ability to use different data for each program copy (features in Condor makes this particularly straightforward). The multiple copies are referred to as jobs. Applications might involve jobs processing:

  • different patient data in large scale biomedical trials
  • different parts of a genome or protein sequence in bioinformatics applications
  • different random numbers in a simulations based on Monte Carlo methods
  • different model parameters in ensemble simulations or explorations of parameter spaces
The computer science community often refer to these types of computational problem as embarassingly parallel since - as far as they are concerned - they can be solved with embarassing ease (unlike many other other classes of problem). Many people prefer the, perhaps less pejorative, term pleasently parallel but the former is still used widely.


Condor's version of high throughput computing

Condor, which is used widely in HTC, is further removed from most HPC systems in that it can exploit commodity hardware such as desktop PCs. This poses drawbacks compared to a dedicated computing cluster (the PCs may only be available to Condor outside the working day for example) but these are usually outweighed by the sheer number of PCs in places like universities which can be harnessed to research computing use. This at virtually no extra cost as well.


Divide and rule - how to tackle problems using Condor

Since Condor does not provide any means of making a given program run faster, there is no need for programs to be modified to exploit specialised hardware (as is the case with HPC). Instead, it is the manner in which a given problem is tackled that needs to be altered. The approach is nearly always the same:

  1. divide the problem up into smaller independent parts;
  2. get Condor to process as many of these small parts as possible in parallel (i.e. at the same time);
  3. combine the partial results produced by Condor to give the overall result.

The middle step provides the overall speed up. Instead of tackling the problem in strict sequence (i.e. serially), many parts of it are tackled at the same time (i.e. in parallel). As an analogy, you could think of a number of people solving a jigsaw puzzle together (see footnote).

The program code used in the actual Condor simulation will probably be close to that initially used on a single PC but the first and last steps will need additional software to be created. This should not be too difficult but it is crucially important to have a very clear idea of how the problem is to be divided up before starting to create the software.

If you are unsure of how to proceed or even where to start, please contact Ian C. Smith in the ARC team for advice (email: i.c.smith@liverpool.ac.uk). Not all applications can be run successfully on Condor (see Condor limitations) and some may perform better on the ARC High Performance Computing Service.



Footnote: solving a jigsaw puzzle in parallel

Solving a jig saw puzzle is one analogy to a HTC application that may be familiar. Rather than starting with a single piece and working outwards, most people will usually begin to solve the puzzle by initially working on different parts of it at the same time (e.g. starting with the corners and edges). Only later on are these parts combined to form the entire picture.

By enlisting the help of some friends, the puzzle could be solved much faster if each person worked on a different part of the puzzle at the same (i.e. in parallel). This method of dividing up work can be applied to many computing problems and, for certain problems, will allow Condor to speed up their solution.

Stictly speaking, the jigsaw problem is not an exact fit to HTC since each of the puzzle solvers cannot work completely independently of the others. For example, what would happen if one person needed a piece another was holding ?