High Throughput Computing using Condor

Running Python Applications under Condor

All of the files for this example can be found on condor.liv.ac.uk in /opt1/condor/examples/python. The current version of Python used by Condor is 2.7.15.


Creating the Python files
Creating the simplified job submission file
Running the Python Jobs


Condor is well suited to running large numbers of Python jobs concurrently. If the application applies the same kind of analysis to large data sets (so-called "embarrassingly parallel" applications) or carries out similar calculations based on different random initial data (e.g. applications based on Monte Carlo methods), Condor can significantly reduce the time needed to generate the results by processing the data in parallel on different hosts. In some cases, simulations and analyses that would have taken years on a single PC can be completed in a matter of days.

The application will need to perform three main steps:

  1. Create the initial input data and store it to files which can be processed in parallel.
  2. Process the input data in parallel using the Condor pool and write the outputs to corresponding files.
  3. Combine the data in the output files to generate the overall results.

Creating the Python files

As a trivial example, to see how Condor can be used to run Python jobs in parallel, consider the case were we wish to form the sum of p matrix-matrix products, i.e. calculate C where:

and A, B, and C are square matrices of order n. It easy to see that the p matrix products could be calculated independently and therefore potentially in parallel. (This is infact the Python equivalent of the Condor MATLAB example).

In the first step we need to store A and B to Python data files which can be distributed by Condor for processing. Each Condor job will then need to read A and B from file, form the product AB and write the output to another Condor data file. The final step will sum all of the partial sums read from the output files to form the complete sum C.

The first step can be accomplished using a Python script such as the one below called initialise.py: (this just fills the input matrices with random values but in a more realistic application this data would come from elsewhere):


import shelve
import numpy.linalg

size = 10

for i in range(0,10):
   filename = "input%s.dat" % i

   A = numpy.random.rand(size,size)
   B = numpy.random.rand(size,size)

   my_shelf = shelve.open(filename,'n')
   my_shelf['A'] = globals()['A']
   my_shelf['B'] = globals()['B']

The elements of A and B are given random initial values and are saved to files using the Python shelve module. Condor needs the input files to be indexed from 0:p-1 so the above code generates ten input files named input0.dat, input1.dat .. input9.dat. This Python script can be run on the Condor server using python initialise.py. to generate the inputs prior to submitting the Condor job.

The second script will need to form the matrix-matix products and will eventually be run as an application on the Condor pool. A suitable Python file (product.py) is:


import shelve
import numpy.linalg

filename = 'input.dat'
my_shelf = shelve.open(filename)
del my_shelf


filename = 'output.dat'
my_shelf = shelve.open(filename,'n')
my_shelf['C'] = globals()['C']
The shelve module is again used to restore the A and B matrix variables and to save the product C.

The final step is to combine all of the output files together to form the sum. This can be achieved using another Python script (combine.py) , which is run on the server, such as this:


import shelve
import numpy.linalg

size = 10 
sum = numpy.zeros(size) 

for i in range(0,10):
   filename = "output%s.dat" % i

   my_shelf = shelve.open(filename) 
   sum = numpy.add(sum, C)

print numpy.linalg.norm(sum)

This loads each output file in turn and forms the sum of the matrix-matrix products in the sum variable. The 2-norm of the final sum is printed at the end as checksum which can be compared against running a serial version (serial.py).

Creating the simplified job submission file

Each Condor job needs a submission file to describe how the job should be run. These can appear rather arcane to new users and therefore to help simplify the process, tools have been created which will work with more user-friendly job submission files. These are automatically translated into files which Condor understands and which users need not worry about. For this example a submission file such as the one below can be used called run_product:

python_script = product.py
indexed_input_files = input.dat
indexed_output_files = output.dat
log = log.txt
total_jobs = 10

The Python script is specfied in the python_script line. The lines starting indexed_input_files and indexed_output_files specify the input and output files which differ for each individual job. The total number of jobs is given in the total_jobs line. The underlying job submission processes will ensure that each individual job receives the correct input file (input0.dat .. input9.dat) and that the output files are indexed in a corresponding manner to the input files (e.g. output file output1.dat will correspond to input1.dat)

It is also possible to provide a list of non-indexed files (i.e. files that are common to all jobs), for example:

input_files = common1.dat,common2.dat,common3.dat
This is useful if the common (non-indexed) files are relatively large.

For testing, the indexed_output_files line can be omitted so that all of the output files are returned (the default). It is useful to also capture the standard output and error (which would normally be printed to the screen) using the indexed_stdout and indexed_stderr attributes respectively. For production runs, the output files should always be specified just in case there is a run-time problem and they are not created. In this case Condor will place the job in the held ('H') state. To release these jobs and run them elsewhere use:

$ condor_release -all.

To find out why jobs have been held use:

$ condor_q -held

Running the Condor Jobs

The Condor jobs are submitted from the the Condor server using the command:

$ python_submit product

It should return with something like:

$ python_submit run_product
Submitting job(s)..........
10 job(s) submitted to cluster 1185.
You can monitor the progress of all of your jobs using:
$ condor_q
Initially the Condor jobs will remain in the idle state until machines become available: e.g.

$ condor_q

-- Schedd: Q1@condor1 : < @ 06/18/19 11:50:39
smithic CMD: run_product.bat   6/18 11:50      _      _     10     10 1185.0-9

10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended
The overall state of the pool can be seen using the command:
$ condor_status

Once the jobs have completed (as indicated by the lack of any jobs in the queue) the directory should contain ten output files named output0.dat, output1.dat, .. output9.dat which can be processed using the combine.py script to generate the final result. There will also be ten log files (logfile*) which are not of any great interest but can be useful in tracking down problems when things have gone wrong and are also useful in finding out when and where each job ran and for how long.

The job submission file can be edited using the ned editor, however all of the options in it can be overidden temporarily from the command line (the file itself is not changed). For example to submit five jobs instead of ten use:

$ python_submit run_product -total_jobs=5
and to also change the script to be used:
$ python_submit product -total_jobs=5 -python_script=otherapp.py
(use the --help option to see all of the options available)

This is useful for making small changes without the need to use the UNIX system editors to change the job submission file.

last updated on Tuesday, 06-Aug-2019 10:32:34 BST by I.C. Smith

ARC, University of Liverpool, Liverpool L69 3BX, United Kingdom, +44 (0)151 794 2000