Running concurrent MATLAB jobs under Condor

Contents

Introduction
Creating the M-files
Creating the standalone application
Creating the Condor files
Running the Condor Jobs
Moving on to other applications
Other sources of information


   Introduction

Condor is well suited to running large numbers of Matlab jobs concurrently. If the application applies the same kind of analysis to large data sets (so-called "embarrassingly parallel" applications) or carries out similar calculations based on different random initial data (e.g. applications based on Monte Carlo methods), Condor can significantly reduce the time needed to generate the results by processing the data in parallel on different hosts. In some cases, simulations and analyses that would have taken years on a single PC can be completed in a matter of days.

The application will need to perform three main steps:

  1. Create the initial input data and store it to file.
  2. Process the input data using the Condor pool and write the output to file.
  3. Collate the data in the output files to generate the results.

The usual way of executing Matlab jobs is to create a M-file and then run this through the Matlab interpreter. This poses problems with a parallel implementation using Condor as each running job will require a licence to be checked out and is inefficient. To circumvent this, the Matlab M-file needs to be compiled into a standalone application which can run without the Matlab interpreter and without the need for a Matlab licence. This is described later.


   Creating the M-files

As a trivial example, to see how Condor can be used to run Matlab jobs in parallel, consider the case were we want to form the sum of p matrix-matrix products, i.e. calculate C where:

and A, B, and C are square matrices of order n. It easy to see that the p matrix products could be calculated independently and therefore potentially in parallel.

In the first step we need to store A and B to Matlab data files which can be distributed by Condor for processing. Each Condor job will then need to read A and B from file, form the product AB and write the output to another Condor data file. The final step will sum all of the partial sums read from the output files to form the complete sum C.

The first step can be accomplished using a M-file such as the one below (initialise.m):

function initialise(n)
  for index=0:9 
    A=rand(n,n);
    B=rand(n,n);
    filename=strcat('input',int2str(index));
    save( filename, 'A', 'B', 'index');
  end

The elements of A and B are given random initial values and are saved to files using the Matlab 'save' command. Condor needs the input files to be indexed from 0:p-1 so the above code generates ten input files named input0.mat, input1.mat .. input9.mat. The M-file also saves the index variable to file as this will be needed by the Condor job. This M-file can be run on the Sun UNIX service or the Condor server (which also has Matlab available) to generate the inputs prior to submitting the Condor job.

The second script will need to form the matrix-matix products and will eventually be run as a standalone application on the teaching centre machines. A suitable M-file (product.m) is:

function product
  system( 'rename input*.mat input.mat' );
  load input.mat;
  C=A*B;
  filename = strcat( 'output', int2str( index ) );
  save( filename, 'C' );
Note that the function name must be the same as the M-file name (minus the extension).

Since the same executable is used for each parallel Condor job, the job will not know the specific input filename it has to deal with. The DOS 'rename' command (issued using Matlab's 'system' command) is therefore used to rename the input file prior to the data being loaded (note that the index is not known at this stage). Once the matrices have been loaded and the product formed, the output is written to file. The index is used to give the file an unique name and ensures that the output files are named output0.mat, output1.mat, .. output9.mat corresponding to the inputs input0.mat, input1.mat, .. input9.mat.

The final step is to collect all of the output files together to form the sum. This can be achieved using another M-file (collect.m) such as this:

function collect(n)
  S = zeros(n);
  for index=0:9
    filename = strcat( 'output', int2str( index ) );
    load( filename );
    S=S+C;
  end

This loads each output file in turn and forms the sum of the matrix-matrix products in the S variable.


   Creating the standalone application

As indicated earlier, the M-file used by the Condor job needs to be compiled into a standalone application. Much more information can be found on this in the 'Matlab Compiler' section of the Matlab help system, but suffice it to say here that this Matlab command will generate the standalone application files:

>> mcc -mv product.m 
If the main M-file calls other functions stored in different M-files then these should be listed after the main M-file. Alternatively the -a option can be used to specify the directory they are held in. Note that before using the compiler for the first time it needs to configured using:
>> mbuild -setup 
The Matlab built-in compiler can be selected or if you have a third party compiler installed, such as Visual Studio, that can be used instead.

The standalone application actually consists of two files; namely: the binary executable code (.exe file) and a file called a manifest which helps the executable locate various run time libaries and is in XML format. The manifest should be downloaded from the link below:

http://www.liv.ac.uk/csd/escience/condor/matlab/manifest.txt

You will need to right click on the link and click "save link as..." and save the file in the same folder/directory as the standalone executable but under the filename <myexecutable>.exe.manifest. where <myexecutable> is the name assigned to the standalone application ("product" in this case). An example manifest can also be found on the Condor server in the examples directory (/opt1/condor/examples/matlab2009).

It is well worth testing the standalone application first before running it using Condor as Condor run-time errors can be very difficult to debug. To do this, copy the executable file and the manifest to an empty directory/folder along with any input files and run it under the DOS shell. To start a DOS shell window, click on the "Command Prompt" icon on the desktop. Then move to the folder containing the standalone executable (use the cd command).

The application needs access to the Matlab run-time libraries in order to run. If you have installed Matlab using the MWS these will already have been installed on the hard disk (C: drive). To access them, set the DOS path thus:

set path=c:\matlab2009\bin\win32;%path%

The standalone application can then be tested from the command line by entering:

product
This should create the required output file(s) which can be examined by loading it into Matlab.

The executable file and manifest should then be transferred to your Condor "home" directory on the Condor server (i.e. /condor_data/<your_username>). To transfer files to and from the Condor server to your PC, using CoreFTP Lite is recommended. On the MWS this can be found in the Internet section of the of "Install University Applications" item on the Start menu. Log into condor.liv.ac.uk using your normal username and password and ensure that the SSH/SFTP box is ticked. Help on Core FTP Lite is avilable on the CSD website at:

http://www.liv.ac.uk/csd/mobile/myfilestore/coreftp.htm.

To start an interactive login session on the Condor server, PuTTy is recommended when using the MWS. To install this go to the Internet section of the of the "Install University Applications" item on the Start menu. Log into condor.liv.ac.uk using your normal username and password and ensure that the SSH box is ticked. Help on PuTTy can be found at:

http://www.liv.ac.uk/csd/mobile/myfilestore/putty.htm


   Creating the Condor files

Each Condor job needs a submission file to describe how the job should be run. For this example a submission file such as the one below can be used (product.sub):

universe = vanilla
transfer_files=always
requirements = ( Arch=="Intel") && ( OpSys=="WINNT61" )
transfer_input_files = product.exe,product.exe.manifest,input$(PROCESS).mat
transfer_output_files = output$(PROCESS).mat
executable = prod.bat
output = product$(PROCESS).out log = product$(PROCESS).log error = product$(PROCESS).err notification = Error queue 10

The $(PROCESS) macro takes on the values 0:9 for different Condor processes run on the teaching centre machines so that each receives one input file and the corresponding output files are returned to the Condor submit host. The input and output file lists can be modified to suit other applications. For testing the transfer_output_files line can be omitted so that all of the output files are returned (the default). For production runs, the output files should always be specified just in case there is a run-time problem and they are not created. In this case Condor will place the job in the held ('H') state. To release these jobs and run them elsewhere use:

$ condor_release -all.

To find out why jobs have been held use:

$ condor_q -held

A DOS batch (.bat) program is used to set up the environment for the Matlab standalone application to run. This was specified in the executable line in the submission file and is shown below (prod.bat):

set path=c:\matlab2009\bin\win32
product
The first line sets the path to the Matlab run-time libraries (note these are pre-installed on the hard disk (C: drive) on the Condor pool machines). The last command actually runs the standalone application.

Before running the Condor job, it is worth checking that the directory contains the required files namely,

initialise.m
The M-file used to generate the input data files.
collect.m
The M-file used to post-process the output data files.
product.exe
Binary executable code containing the standalone application.
product.exe.manifest
XML file which points Matlab to the run-time libraries.
input0.mat .. input9.mat
The input data files generated using the intialise.m M-file.
product.sub
The Condor job submission script specifying the input and output files and executable to be run.
prod.bat
A DOS batch file run by Condor on the teaching centre machines which sets up the environment for the standalone application.

   Running the Condor Jobs

The condor jobs are submitted by logging into the Condor server (condor.liv.ac.uk) and using the command:

$ condor_submit product.sub

It should return with something like:

Submitting job(s).......... 
Logging submit event(s).......... 
10 job(s) submitted to cluster 536261.
You can monitor the progress of all of your jobs using:
$ condor_q your_unix_username
Initially the Condor jobs will remain in the idle state until machines becomes available: e.g.

smithic(ulgp4)matlab$ condor_q smithic 
-- Submitter: root@ulgp4.liv.ac.uk : <138.253.100.177:65351> : ulgp2.liv.ac.uk 
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 536261.0 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.1 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.2 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.3 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.4 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.5 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.6 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.7 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.8 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.9 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 10 jobs; 10 idle, 0 running, 0 held
There may only appear to be a few machines in the pool on occasion as machines are only woken up on demand once jobs are submitted to the queue. If you submit a large number of jobs you should find that after an hour or so the pool size has increased dramatically to meet the demand.

Once the jobs have completed (as indicated by the lack of any processes in the queue) the directory should contain ten output files named output0.mat, output1.mat, .. output9.mat which can be processed using the collect.m M-file to generate the final result. There will also be ten log files (*.log), ten stdout files (*.out) and ten error files (*.err). These are not of any great interest but can be useful in tracking down problems when things have gone wrong. The log files can be useful in finding when and where each process ran and how long it actually took.

This is of course a very artificial example and even for large matrices the overall execution time is likely to be dominated by the time take to read the input data from disk (disk access is on the order of a million times slower than CPU speed). However, as a very rough benchmark, with matrices of order n=4000, a serial implementation required approximately 86 minutes of wall clock time whereas the Condor implementation required around 8 minutes - this is a pretty much perfect linear speed up.


   Moving on to other applications

The files presented above can be used as templates in order to get your own Matlab applications to run under Condor. The following series of steps is suggested as a way of tackling the problem.

  1. Determine which part of the Matlab application is taking the majority of the compute time (the Matlab 'etime' function is useful here) and place this code into a well defined function.
  2. Create an M-file for the function above (say process.m). Create two other M-files for the code to be executed before process.m (say initialise.m) and after (say collect.m).
  3. Configure the M-files so that process.m reads its input variables from file and writes its output to file. It should now be possible to run the three M-files independently (initialise.m followed by process.m followed by collect.m).
  4. On the MWS, create a standalone application for the process.m M-file.
  5. Copy the manifest and rename it according to your own application (e.g. myapp.exe.manifest)
  6. Test out the standalone application from the DOS command line.
  7. When the standalone application seems to be working OK, copy it (including the manifest) to the Condor server under /condor_data/<your_username>.
  8. Create the .sub and .bat files for Condor using the ones shown above for guidance.
  9. Create the input data files using MATLAB on the Condor server using initialise.m or upload your own input data.
  10. Login to the Condor submit host and submit the Condor jobs.
  11. When all of the Condor jobs have completed run the collect.m M-file to collect the output data and create the results.
  12. Steps 9-11 can be repeated for different input data sets. The UNIX ned editor provides a convenient way of tailoring files such as the job submission file and batch file.

Some research applications that have made use of Matlab jobs run under Condor are:


   Other sources of information

The Condor manual contains pretty everything there is to know about Condor. Although there is far too much information presented for the casual user to take in, the User's Manual (sections 2.1 - 2.7) is well worth reading as are some of the Condor Reference Manual (man) pages (see particularly the descriptions of condor_submit, condor_q, condor_status, condor_rm, condor_vacate and condor_release). You can find links to slides and a paper describing our experiences on the main CSD Condor page. Any queries regarding Condor can be addressed to Ian C. Smith in CSD: e-mail i.c.smith@liverpool.ac.uk.