Running Python Applications under Condor using numpy
All of the files for this example can be found on
condor.liv.ac.uk in
/opt1/condor/examples/python. The current version of Python used
by Condor is 2.7.15 however 3.7.4 is also available
(see below for details).
Contents
IntroductionCreating the Python files
Creating the simplified job submission file
Running the Python Jobs
Python 3
Saving multiple arrays to a single file
A word about compatibility
Using pickle for non-array objects
Using shelve for multiple objects
Introduction
Condor is well suited to running large numbers of Python jobs concurrently. If the application applies the same kind of analysis to large data sets (so-called "embarrassingly parallel" applications) or carries out similar calculations based on different random initial data (e.g. applications based on Monte Carlo methods), Condor can significantly reduce the time needed to generate the results by processing the data in parallel on different hosts. In some cases, simulations and analyses that would have taken years on a single PC can be completed in a matter of days.
The application will need to perform three main steps:
- Create the initial input data and store it to files which can be processed in parallel.
- Process the input data in parallel using the Condor pool and write the outputs to corresponding files.
- Combine the data in the output files to generate the overall results.
Creating the Python files
As a trivial example, to see how Condor can be used to run Python jobs in parallel, consider the case were we wish to form the sum of p matrix-matrix products, i.e. calculate C where: C = SUM(AiBi), i=[1..p]) and A, B, and C are square matrices of order n. It easy to see that the p matrix products could be calculated independently and therefore potentially in parallel. (This is infact the Python equivalent of the Condor MATLAB example).
In the first step we need to store A and B to Python data files which can be distributed by Condor for processing. Each Condor job will then need to read A and B from file, form the product AB and write the output to another Condor data file. The final step will sum all of the partial sums read from the output files to form the complete sum C.
The first step can be accomplished using a Python script such as the one below called initialise.py (this just fills the input matrices with random values but in a more realistic application this data would come from elsewhere):
import numpy as np size = 1000 no_of_jobs = 100 for i in range(0, no_of_jobs): filenameA = "inputA%s.npy" % i filenameB = "inputB%s.npy" % i A = np.random.rand(size, size) B = np.random.rand(size, size) np.save(filenameA, A) np.save(filenameB, B)
The elements of A and B are given random initial values and are saved to files using np.save (from the numpy module). Condor needs the input files to be indexed from 0:p-1 so the above code generates 100 input files for the A matrix named inputA0.npy, inputA1.npy .. inputA99.npy and 100 input files for the B matrix named inputB0.npy, inputB1.npy .. inputB99.npy This Python script can be run on the Condor server using
$ python initialise.pyto generate the inputs prior to submitting the Condor jobs.
The second script will need to form the matrix-matrix products and will eventually be run as an application on the Condor pool. A suitable Python file (product.py) is:
import numpy as np filenameA = "inputA.npy" filenameB = "inputB.npy" filenameC = "output.npy" A = np.load(filenameA) B = np.load(filenameB) C = np.matmul(A, B) np.save(filenameC, C)The numpy module is again used to restore the A and B matrix variables (using np.load) and to save the product C.
The final step is to combine all of the output files together to form the sum. This can be achieved using another Python script (combine.py) such as this:
import numpy as np size = 1000 no_of_jobs = 100 sum = np.zeros(size) for i in range(0, no_of_jobs): filename = "output%s.npy" % i C = np.load(filename) sum = sum + C print np.linalg.norm(sum)This is run on the server once all the jobs have completed using:
$ python combine.py
This loads each output file in turn and forms the sum of the matrix-matrix products in the sum variable. The 2-norm of the final sum is printed at the end as a checksum which can be compared against running a serial version (serial.py).
Creating the simplified job submission file
Each Condor job needs a submission file to describe how the job should be run. These can appear rather arcane to new users and therefore to help simplify the process, tools have been created which will work with more user-friendly job submission files. These are automatically translated into files which Condor understands and which users need not worry about. For this example a submission file such as the one below can be used called run_product:
python_script = product.py indexed_input_files = inputA.npy, inputB.npy indexed_output_files = output.npy log = log.txt total_jobs = 100
The Python script is specfied in the python_script line. The lines starting indexed_input_files and indexed_output_files specify the input and output files which differ for each individual job. The total number of jobs is given in the total_jobs line. The underlying job submission processes will ensure that each individual job receives the correct input files (inputA0.npy .. inputA99.npy and inputB0.npy .. inputB99.npy) and that the output files are indexed in a corresponding manner to the input files (e.g. output file output1.npy will correspond to inputA1.dat and inputB1.dat)
It is also possible to provide a list of non-indexed files (i.e. files that are common to all jobs), for example:
input_files = common1.npy,common2.npy,common3.npyThis is useful if the common (non-indexed) files are relatively large or if your application calls functions in other module files. These can be listed as common input files so they are available for each individual job e.g.:
input_files = module1.py, module2.py, module3.py
Aside:
For testing, the indexed_output_files line can be omitted so that all of the output files are returned (the default). It is useful to also capture the standard output and error (which would normally be printed to the screen) using the indexed_stdout and indexed_stderr attributes respectively. For production runs, the output files should always be specified just in case there is a run-time problem and they are not created. In this case Condor will place the job in the held ('H') state. To release these jobs and run them elsewhere use:
$ condor_release -all.
To find out why jobs have been held use:
$ condor_q -held
Running the Condor Jobs
The Condor jobs are submitted from the the Condor server using the command:
$ python_submit run_product
It should return with something like:
$ python_submit run_product Submitting job(s).................................................................................................... 100 job(s) submitted to cluster 15768.You can monitor the progress of all of your jobs using:
$ condor_qThe Condor jobs should soon start to run and complete: e.g.
$ condor_q OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS smithic CMD: run_product.bat 2/14 17:03 11 87 2 100 15768.0-99 89 jobs; 0 completed, 0 removed, 2 idle, 87 running, 0 held, 0 suspendedThe overall state of the pool can be seen using the command:
$ condor_status
Once the jobs have completed (as indicated by the lack of any jobs in the queue) the directory should contain 100 output files named output0.npy, output1.npy, .. output99.npy which can be processed using the combine.py script to generate the final result. There will also be a log file (log.txt) which is not of any great interest but can be useful in tracking down problems when things have gone wrong and is also useful in finding out when and where each job ran and for how long.
The job submission file can be edited using the nedit editor (see the Linux beginners guide) , however all of the options in it can be overidden temporarily from the command line (the file itself is not changed). For example to submit five jobs instead of 100 use:
$ python_submit run_product -total_jobs=5and to also change the script to be used:
$ python_submit product -total_jobs=5 -python_script=otherapp.py(use the --help option to see all of the options available)
This is useful for making small changes without the need to use the UNIX system editors to change the job submission file.
Python 3
You can run jobs requiring Python 3 on the Condor pool as version 3.7.4 is available. To select this you will need to include this line in your Condor job description file:
python_version = python_3.7.4To use Python 3 to run Python scripts on the server, use the command 'python3' instead of 'python'.
Saving multiple arrays to a single file
In the interests of simplicity, the above example saved each input array to its own file so that there were two input files for each job process. For larger numbers of arrays this is likely to become tiresome and, fortunately, it is possible to save multiple arrays to a single file using numpy.savez. For further information, refer to the numpy API Reference.
A word about compatibility
The above example has been found to work without problems between Python 2 and Python 3 and the Linux and Windows operating systems used on the Condor server and the Condor pool PCs respectively. According to the numpy documentation though, it is possible that np.save may use pickle which is known to have compatability problems between Python 2 and 3. If you are likely to switch between versions it probably best to disallow the use of pickle with allow_pickle=False:
numpy.save("myoutputfile.npy", MyArray, allow_pickle=False)For more information refer to the numpy API Reference.