Condor is well suited to running large numbers of Matlab jobs concurrently. If the application applies the same kind of analysis to large data sets (so-called "embarrassingly parallel" applications) or carries out similar calculations based on different random initial data (e.g. applications based on Monte Carlo methods), Condor can significantly reduce the time needed to generate the results by processing the data in parallel on different hosts. In some cases, simulations and analyses that would have taken years on a single PC can be completed in a matter of days.
The application will need to perform three main steps:
The usual way of executing Matlab jobs is to create a M-file and then run this through the Matlab interpreter. This poses problems with a parallel implementation using Condor as each running job will require a licence to be checked out and is inefficient. To circumvent this, the Matlab M-file needs to be compiled into a standalone application which can run without the Matlab interpreter and without the need for a Matlab licence. This is described later.
As a trivial example, to see how Condor can be used to run Matlab jobs in parallel,
consider the case were we want to form the sum of p matrix-matrix products,
i.e. calculate C where:
and A, B, and C are square matrices of order n. It easy to see that the p matrix products could be calculated independently and therefore potentially in parallel.
In the first step we need to store A and B to Matlab data files which can be distributed by Condor for processing. Each Condor job will then need to read A and B from file, form the product AB and write the output to another Condor data file. The final step will sum all of the partial sums read from the output files to form the complete sum C.
The first step can be accomplished using a M-file such as the one below (initialise.m):
function initialise(n) for index=0:9 A=rand(n,n); B=rand(n,n); filename=strcat('input',int2str(index)); save( filename, 'A', 'B', 'index'); end
The elements of A and B are given random initial values and are saved to files using the Matlab 'save' command. Condor needs the input files to be indexed from 0:p-1 so the above code generates ten input files named input0.mat, input1.mat .. input9.mat. The M-file also saves the index variable to file as this will be needed by the Condor job. This M-file can be run on the Sun UNIX service or the Condor server (which also has Matlab available) to generate the inputs prior to submitting the Condor job.
The second script will need to form the matrix-matix products and will eventually be run as a standalone application on the teaching centre machines. A suitable M-file (product.m) is:
function product system( 'rename input*.mat input.mat' ); load input.mat; C=A*B; filename = strcat( 'output', int2str( index ) ); save( filename, 'C' );Note that the function name must be the same as the M-file name (minus the extension).
Since the same executable is used for each parallel Condor job, the job will not
know the specific input filename it has to deal with.
The DOS 'rename' command (issued using Matlab's 'system' command) is therefore used
to rename the input file prior to the data being loaded
(note that the index is not known at this stage).
Once the matrices have been loaded and the product formed, the output is written to file.
The index is used to give the file an unique name and ensures that the output files are named
output0.mat, output1.mat, .. output9.mat
corresponding to the inputs input0.mat, input1.mat, .. input9.mat.
The final step is to collect all of the output files together to form the sum. This can be achieved using another M-file (collect.m) such as this:
function collect(n) S = zeros(n); for index=0:9 filename = strcat( 'output', int2str( index ) ); load( filename ); S=S+C; end
This loads each output file in turn and forms the sum of the matrix-matrix products in the S variable.
As indicated earlier, the M-file used by the Condor job needs to be compiled into a standalone application. Much more information can be found on this in the 'Matlab Compiler' section of the Matlab help system, but suffice it to say here that this Matlab command will generate the standalone application files:
>> mcc -mv product.mIf the main M-file calls other functions stored in different M-files then these should be listed after the main M-file. Alternatively the -a option can be used to specify the directory they are held in. Note that before using the compiler for the first time it needs to configured using:
>> mbuild -setupThe Matlab built-in compiler can be selected or if you have a third party compiler installed, such as Visual Studio, that can be used instead.
The standalone application actually consists of two files;
namely: the binary executable code (.exe file) and a
file called a manifest which helps the executable locate various
run time libaries and is in XML format. The manifest should be downloaded from the link below:
You will need to right click on the link and click "save link as..." and save the file in the same folder/directory as the standalone executable but under the filename <myexecutable>.exe.manifest. where <myexecutable> is the name assigned to the standalone application ("product" in this case). An example manifest can also be found on the Condor server in the examples directory (/opt1/condor/examples/matlab2009).
It is well worth testing the standalone application first before running it using Condor as Condor run-time errors can be very difficult to debug. To do this, copy the executable file and the manifest to an empty directory/folder along with any input files and run it under the DOS shell. To start a DOS shell window, click on the "Command Prompt" icon on the desktop. Then move to the folder containing the standalone executable (use the cd command).
The application needs access to the Matlab run-time libraries in order to run. If you have installed Matlab using the MWS these will already have been installed on the hard disk (C: drive). To access them, set the DOS path thus:
The standalone application can then be tested from the command line by entering:
productThis should create the required output file(s) which can be examined by loading it into Matlab.
The executable file and manifest should then be transferred to
your Condor "home" directory on the Condor server
(i.e. /condor_data/<your_username>). To transfer files
to and from the Condor server to your PC, using CoreFTP Lite is recommended.
On the MWS this can be found in the Internet section of the of
"Install University Applications" item on the Start menu.
Log into condor.liv.ac.uk using your normal username and password
and ensure that the SSH/SFTP box is ticked. Help on
Core FTP Lite is avilable on the CSD website at:
To start an interactive login session on the Condor server, PuTTy is recommended when using the MWS. To install
this go to the Internet section of the of the "Install University Applications" item on the Start menu.
Log into condor.liv.ac.uk using your normal username and password
and ensure that the SSH box is ticked. Help on PuTTy can be found at:
Each Condor job needs a submission file to describe how the job should be run. For this example a submission file such as the one below can be used (product.sub):
universe = vanilla transfer_files=always requirements = ( Arch=="Intel") && ( OpSys=="WINNT61" ) transfer_input_files = product.exe,product.exe.manifest,input$(PROCESS).mat transfer_output_files = output$(PROCESS).mat
executable = prod.bat
output = product$(PROCESS).out log = product$(PROCESS).log error = product$(PROCESS).err notification = Error queue 10
The $(PROCESS) macro takes on the values 0:9 for different Condor processes
run on the teaching centre machines so that each receives one input file and
the corresponding output files are returned to the Condor submit host. The input
and output file lists can be modified to suit other applications. For testing the transfer_output_files
line can be omitted so that all of the output files are returned (the default). For production
runs, the output files should always be specified just in case there is a run-time problem and they
are not created. In this case Condor will place the job in the held ('H') state. To release these jobs
and run them elsewhere use:
$ condor_release -all.
To find out why jobs have been held use:
$ condor_q -held
A DOS batch (.bat) program is used to set up the environment for the Matlab standalone application to run. This was specified in the executable line in the submission file and is shown below (prod.bat):
set path=c:\matlab2009\bin\win32 productThe first line sets the path to the Matlab run-time libraries (note these are pre-installed on the hard disk (C: drive) on the Condor pool machines). The last command actually runs the standalone application.
Before running the Condor job, it is worth checking that the directory contains the required files namely,
The condor jobs are submitted by logging into the Condor server (condor.liv.ac.uk) and using the command:
$ condor_submit product.sub
It should return with something like:
Submitting job(s).......... Logging submit event(s).......... 10 job(s) submitted to cluster 536261.You can monitor the progress of all of your jobs using:
$ condor_q your_unix_usernameInitially the Condor jobs will remain in the idle state until machines becomes available: e.g.
smithic(ulgp4)matlab$ condor_q smithic -- Submitter: email@example.com : <188.8.131.52:65351> : ulgp2.liv.ac.ukThere may only appear to be a few machines in the pool on occasion as machines are only woken up on demand once jobs are submitted to the queue. If you submit a large number of jobs you should find that after an hour or so the pool size has increased dramatically to meet the demand.
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 536261.0 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.1 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.2 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.3 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.4 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.5 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.6 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.7 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.8 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.9 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 10 jobs; 10 idle, 0 running, 0 held
Once the jobs have completed (as indicated by the lack of any processes in the queue) the directory should contain ten output files named output0.mat, output1.mat, .. output9.mat which can be processed using the collect.m M-file to generate the final result. There will also be ten log files (*.log), ten stdout files (*.out) and ten error files (*.err). These are not of any great interest but can be useful in tracking down problems when things have gone wrong. The log files can be useful in finding when and where each process ran and how long it actually took.
This is of course a very artificial example and even for large matrices the overall execution time is likely to be dominated by the time take to read the input data from disk (disk access is on the order of a million times slower than CPU speed). However, as a very rough benchmark, with matrices of order n=4000, a serial implementation required approximately 86 minutes of wall clock time whereas the Condor implementation required around 8 minutes - this is a pretty much perfect linear speed up.
The files presented above can be used as templates in order to get your own Matlab applications to run under Condor. The following series of steps is suggested as a way of tackling the problem.
Some research applications that have made use of Matlab jobs run under Condor are: