<
High Throughput Computing using Condor

Running MATLAB Applications under Condor

NOTE: These instructions make use of the new local Condor job submission tools. If you are not a member of the University of Liverpool and are visiting this site for information on how to run Matlab jobs under Condor, please refer to these historical instructions. Those in a hurry can skip to the quick summary section.


All of the files for this example can be found on condor.liv.ac.uk in
/opt1/condor/examples/matlab

Contents

Introduction
Creating the M-files
Testing the M-file
Creating the standalone application
Creating the Condor job submission file
Running the Condor Jobs
Moving on to other applications
Quick Summary
Using MATLAB on the chadwick HPC cluster

Introduction

Condor is well suited to running large numbers of MATLAB jobs concurrently. If the application applies the same kind of analysis to large data sets (so-called "embarrassingly parallel" applications) or carries out similar calculations based on different random initial data (e.g. applications based on Monte Carlo methods), Condor can significantly reduce the time needed to generate the results by processing the data in parallel on different hosts. In some cases, simulations and analyses that would have taken years on a single PC can be completed in a matter of days.

The application will need to perform three main steps:

  1. Create the initial input data and store it to files which can be processed in parallel.
  2. Process the input data in parallel using the Condor pool and write the outputs to corresponding files.
  3. Collate the data in the output files to generate the overall results.

The usual way of running MATLAB applications is to create a M-file and then to run this through the MATLAB interpreter. This poses problems with a parallel implementation using Condor as each running job will require a licence to be checked out. To avoid this, the MATLAB M-file can be compiled into a standalone application which can run without the MATLAB interpreter and without the need for a MATLAB licence.


Creating the M-files

As a trivial example, to see how Condor can be used to run MATLAB jobs in parallel, consider the case were we wish to form the sum of p matrix-matrix products, i.e. calculate C where:

and A, B, and C are square matrices of order n. It easy to see that the p matrix products could be calculated independently and therefore potentially in parallel.

In the first step we need to store A and B to MATLAB data files which can be distributed by Condor for processing. Each Condor job will then need to read A and B from file, form the product AB and write the output to another Condor data file. The final step will sum all of the partial sums read from the output files to form the complete sum C.

The first step can be accomplished using a M-file such as the one below (initialise.m) (this just fills the input matrices with random values but in a more realistic application this data would come from elsewhere):

function initialise(n)
  for index=0:9 
    A=rand(n,n);
    B=rand(n,n);
    filename=strcat('input',int2str(index));
    save( filename, 'A', 'B');
  end

The elements of A and B are given random initial values and are saved to files using the MATLAB 'save' command. Condor needs the input files to be indexed from 0:p-1 so the above code generates ten input files named input0.mat, input1.mat .. input9.mat. This M-file can be run on the Condor server using matlab_run to generate the inputs prior to submitting the Condor job.

The second script will need to form the matrix-matix products and will eventually be run as a standalone application on the Condor pool. A suitable M-file (product.m) is:

function product
  load input.mat;
  C=A*B;
  save( 'output.mat', 'C' );

(There is nothing special about the filenames input.mat and output.mat - they could be called anything but it's a good idea to stick to the MATLAB standard of using .mat as an extension).

Note that the function name must be the same as the M-file name (minus the extension). The M-file does not need to manipulate the filenames to give them unique indexes since this will be taken care of by the job submission tools.

The final step is to collect all of the output files together to form the sum. This can be achieved using another M-file (collect.m) such as this:

function collect(n)
  S = zeros(n);
  for index=0:9
    filename = strcat( 'output', int2str( index ) );
    load( filename );
    S=S+C;
  end

This loads each output file in turn and forms the sum of the matrix-matrix products in the S variable.


Testing the M-file

MATLAB is available on the Condor server and can be used to do a quick "sanity check" to ensure that the M-file does in fact work as expected. Once a M-file (e.g. product.m) has been uploaded or created on the server along with a suitable input (MAT) file, it can processed by the MATLAB interpreter using:

$ matlab_run product.m
(This command ensures that the MATLAB graphical interface does not start and that the script is run non-interactively using the command line interface). If the M-file works properly then the required output (MAT) file should have been created. Note that MATLAB on the server should be used sparingly and not for M-files which are likely to require significant CPU use over long periods as this can impact badly on the performance of Condor.

It is also possible to run the M-file directly on the Condor pool by creating a simplified job submission file such as the one below:
M_file = product.m
input_file = input.mat
output_file = output.mat
In this case product.m needs to contain a quit command otherwise the job will never complete. If the job submission file is saved under the file name product (an "extension" is best avoided), then to run the M-file on a PC in the pool use:
$ m_file_submit product
The command will return with the job ID of the M-file job and on completion the output file output.mat should have been created (see below for details of how to monitor the progress of jobs). Note that this should only be used for testing and not as a way of submitting large numbers of jobs since this will place an unecessary load on the MATLAB license server.


Creating the standalone application

As indicated earlier, the M-file used by the Condor job needs to be compiled into a standalone application. You can create this via the Condor server using the following command which will submit a Condor job that compiles the M-file on one of the PCs in the pool:

$ matlab_build product.m
The command will return a job ID and on completion the standalone executable product.exe should have been created ( you can monitor the progress by examining the contents of the log file build.log).


Aside:

To create a standalone executable from multiple M-Files, first place the "main" M-file in a directory on the Condor server and create another directory called dependencies below it. Then place all of the other M-files (i.e. the ones containing functions used by the main M-file) in the dependencies directory. Be careful not to include any other files in the dependencies directory or these will be "compiled-in" as well. Once this is in place run:

$ matlab_build <MyMainMfile>
in the directory containing the main M-file. An example of this can be found here.



Creating the simplified job submission file

Each Condor job needs a submission file to describe how the job should be run. These can appear rather arcane to new users and therefore to help simplify the process, tools have been created which will work with more user-friendly job submission files. These are automatically translated into files which Condor understands and which users need not worry about. For this example a submission file such as the one below can be used for the sum of products example (product):

indexed_input_files = input.mat
indexed_output_files = output.mat
executable = product.exe
indexed_log = logfile
total_jobs = 10

(Save this file under the filename product - an "extension" is best avoided).

The compiled executable is specfied in the executable line. The lines starting indexed_input_files and indexed_output_files specify the input and output files which differ for each individual job. The total number of jobs is given in the total_jobs line. The underlying job submission processes will ensure that each individual job receives the correct input file (input0.mat .. input9.mat) and that the output files are indexed in a corresponding manner to the input files (e.g. output file output1.mat will correspond to input1.mat)

It is also possible to provide a list of non-indexed files (i.e. files that are common to all jobs), for example:

input_files = common1.mat,common2.mat,common3.mat
This is useful if the common (non-indexed) files are relatively large.
Aside:

For testing, the indexed_output_files line can be omitted so that all of the output files are returned (the default). For production runs, the output files should always be specified just in case there is a run-time problem and they are not created. In this case Condor will place the job in the held ('H') state. To release these jobs and run them elsewhere use:

$ condor_release -all.

To find out why jobs have been held use:

$ condor_q -held



Running the Condor Jobs

The Condor jobs are submitted from the the Condor server using the command:

$ matlab_submit product

It should return with something like:

Submitting job(s).......... 
Logging submit event(s).......... 
10 job(s) submitted to cluster 536261.
You can monitor the progress of all of your jobs using:
$ condor_q your_unix_username
Initially the Condor jobs will remain in the idle state until machines becomes available: e.g.

smithic(ulgp5)matlab$ condor_q smithic 
-- Submitter: root@ulgp5.liv.ac.uk : <138.253.100.177:65351> : ulgp2.liv.ac.uk 
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 536261.0 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.1 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.2 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.3 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.4 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.5 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.6 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.7 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.8 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 536261.9 smithic 3/21 12:15 0+00:00:00 I 0 0.0 prod.bat 10 jobs; 10 idle, 0 running, 0 held
The overall state of the pool can be seen using the command:
$ condor_status

Once the jobs have completed (as indicated by the lack of any jobs in the queue) the directory should contain ten output files named output0.mat, output1.mat, .. output9.mat which can be processed using the collect.m M-file to generate the final result. There will also be ten log files (logfile*) which are not of any great interest but can be useful in tracking down problems when things have gone wrong and are also useful in finding out when and where each job ran and for how long.

Note that matlab_submit will automatically create a "manifest" file so there is no need to provide one as was the case previously. The job submission file can be edited using the ned editor, however all of the options in it can be overidden temporarily from the command line (the file itself is not changed). For example to submit five jobs instead of ten use:

$ matlab_submit product -total_jobs=5
and to also change the executable to be used:
$ matlab_submit product -total_jobs=5 -executable=otherapp.exe
This is useful for making small changes without the need to use the UNIX system editors to change the job submission file.

This is of course a very artificial example and even for large matrices the overall execution time is likely to be dominated by the time taken to read the input data from disk (disk access is on the order of a million times slower than CPU speed). However, as a very rough benchmark, with matrices of order n=4000, a serial implementation required approximately 86 minutes of wall clock time whereas the Condor implementation required around 8 minutes - this is a pretty much perfect linear speed up.


Moving on to other applications

The files presented above can be used as templates in order to get your own MATLAB applications to run under Condor. The following series of steps is suggested as a way of tackling the problem.

  1. Determine which part of the MATLAB application is taking the majority of the compute time (the MATLAB profiler is useful here) and place this code into a well defined function.

  2. Create a M-file for the function above (say process.m). Create two other M-files for the code to be executed before process.m (say initialise.m) and after (say collect.m).

  3. Configure the M-files so that process.m reads its input variables from file and writes its output to file. It should now be possible to run the three M-files independently (initialise.m followed by process.m followed by collect.m).

  4. Test the M-file file on the MATLAB server using:
    $ matlab_run M-file_function_name
  5. Test the M-file on the pool using:
    $ m_file_submit m_file_job_description_file
  6. Build the standalone application using:
    $ matlab_build M-file
  7. Submit the standalone jobs using
    $ matlab_submit job_description_file
  8. Step 7 can be repeated for different input data sets.

Some research applications that have made use of MATLAB jobs run under Condor are:


Quick Summary

  1. Make sure that the all of your MATLAB code is syntactically correct and - if possible - run it with some test data on a Windows 7 PC (this may seem obvious but it very easy to make seemingly small changes to M-file code which leads to difficult-to-spot problems later on).
  2. Upload the MATLAB M-files to the Condor server (condor.liv.ac.uk) in a directory under /condor_data/<your username>. If the main M-file calls other M-files, place these called M-files in a separate directory named dependencies below the main M-file directory.
  3. Log in to the Condor server, locate the directory containing the main M-file and run matlab_build <M-file name> to build a standalone executable to be run on the pool PCs.
  4. Create a job description file then submit the concurrent MATLAB jobs using matlab_submit <job submission_file>. It is a good idea to test the application with a small number of test jobs before attempting to run thousands of "production" jobs.
  5. In case of difficulty refer to the troubleshooting guide.


Appendix: Using MATLAB on the HPC cluster


Although Condor provides a very useful platform for running large numbers of MATLAB jobs, it may be unsuited to some applications where jobs have significant memory or storage requirements (see Condor Limitations). In these cases, use of the ARC HPC cluster (ulgbc5) should be more effective.

The ulgbc5 cluster contains 58 dual-processor, quad-core, 2.4 GHz nodes. Each node has 8 GB RAM and 200 GB of local disk space. Additionally, there are also 50 dual-processor quad-core nodes, each node having 32 GB RAM and 73 GB of disk space. There is also 5.1 TB of network attached storage. The availability of powerful multi-core processors allows jobs to exploit multi-threading parallelism using MATLAB's Parallel Computing Toolbox to speed up the execution of individual jobs.

Tools similar to those available on the Condor server can be used to build and run jobs on the cluster. Unlike Condor, standalone executables or M-files may be run with the advantage that the compiled standalone version may be quicker. To build a standalone application use the command:

$ matlab_build <M-filename>
The resulting executable will have the same name as the M-file minus the .m suffix. For applications made up of multiple M-files, place the M-files that the main M-file depends on in a directory below it called dependencies (as with Condor). Note that standalone executables built for the MWS Condor Pool cannot be used on the cluster since it runs the Linux operating system rather than Windows.

A single executable can be run on the cluster using the command:

$ matlab_submit <standalone_executable>
You can similarly submit a single M-file job with:
$ matlab_submit <M-file_name>
This will submit sinlge job to the cluster batch queue which eventually run (exclusively) on a quad-core processor of one of the compute (slave) nodes.

To see the status of your job use the qstat command e.g.:

$ qstat
job-ID  prior   name       user         state submit/start at     queue    
------------------------------------------------------------------------
 233760 10.54512 run_matlab smithic      qw    05/20/2013 10:29:17                       
The qw state indicates that the job is in the queue and waiting for a free compute node to become available. Once the job starts to run the state will change to r. When the job has completed, it will vanish from the qstat output.

By default qstat only lists jobs submitted from your own account. To see all of the jobs that are queued, including those of other users, the following command is needed:

$ qstat -u '*'

Most users will want to submit a group of jobs together in the same way as with Condor. On the cluster these groups of jobs are called array jobs rather than cluster jobs in Condor. Array jobs can be submitted easily by first creating a job submission file and using the command:

$ matlab_submit <job_submission_file>

Job submission files for array jobs are very similar to those used for Condor but can only contain these attributes:

M_file
executable
indexed_input_files 
input_files
indexed_output_files
indexed_stdout
indexed_stderr
cores_per_node
total_jobs
(only the executable and total_jobs attributes are compulsory but realistic applications will at the very least need the input files specfying).

There are a few important points to be aware of:

  1. Executables in UNIX do not have a .exe "extension" as in Windows. The standalone executable will instead have the same name as the M-file minus the .m part.

  2. The scheduler on the cluster (called Sun Grid Engine) maintains a unique task ID for each individual job. These are numbered 1..N rather than 0..N-1 as in Condor, however the MATLAB job submission tools ensure that the indexed filenames still have indices in the range 0..N-1 to ensure compatibility with Condor.

  3. For compatibilty, the indexing scheme for both input and output files is the same as for Condor with the index inserted between filename and "extension" e.g. input0.mat, input1.mat, ... input<N-1>.mat. This is despite the fact that UNIX does not really support filename extensions in the same way as Windows.

  4. The indexed_output_files attribute is optional as with Condor. If it is omitted, then all of the output files will be indexed and returned to the job submission directory. The attribute is useful for avoiding output file "clutter" but there is no mechanism for reporting an error if an output file is missing as with Condor.

  5. The optional cores_per_node attribute is new and allows users to specify how many processor cores will be assigned to each individual job. This is useful when running jobs with large memory requirements or jobs that make use of multi-threading parallelism. A value of 4 will ensure that each job has exclusive use of a 4-core processor whilst a value of 8 will assign each individual job to a whole dual-processor (8-core) node. The default is to use the whole four cores of one processor.

  6. For jobs that use M-files rather than a standalone executable, place all of the M-files in one directory and specify the main M-file using the M_file attribute in the job submission file (note the underscore rather than a hyphen).

The qstat command can again be used to track the progress of jobs e.g.:

 $ qstat
job-ID  prior   name       user         state submit/start at     queue     slots ja-task-ID 
---------------------------------------------------------------------------------------------
 233766 10.54512 run_matlab smithic      qw    05/20/2013 10:42:17           1      1-19:1                     
The final column now contains the task ID of individual jobs. In the above case all of the 19 individual jobs are queued and waiting (qw state). Later on some jobs will begin to run e.g.:
 $ qstat
job-ID  prior   name       user         state submit/start at     queue              slots ja-task-ID 
-----------------------------------------------------------------------------------------------------
 233766 10.54512 run_matlab smithic      r     05/20/2013 10:42:28 serial@node139      1      1
 233766 10.54512 run_matlab smithic      r     05/20/2013 10:42:38 serial@node211      1      2
 233766 6.62909 run_matlab smithic       qw    05/20/2013 10:42:17                     1      3-19:1
This shows that jobs with task IDs 1 and 2 are running (r state) whilst 3 to 19 are queued and waiting (qw state). The corresponding file IDs of the running jobs will be 0 and 1. If you want to track the output from the jobs as they run, look in the directories called tmp_execute*. In the above example the output from task ID 1 will be in tmp_execute0 and from task ID 2 in tmp_execute1.

Whereas the Condor schedulers can cope with clusters of 10,000 or more jobs, the Sun Grid Engine scheduler does not scale well to large numbers of jobs and submitting very large numbers of jobs can degrade the job scheduler performance to detriment of other users. Cluster/array sizes should therefore be limited to no more than a few hundred jobs.

The matlab_run command is also available and can be used to run small M-file scripts on the head node. This should be used extremely sparingly and not as alternative to running jobs on the compute nodes. Excessive use is likely to put a significant strain on the head node which can degrade the job scheduler performance to detriment of other users.