<
High Throughput Computing using Condor

Simplfied Job Submission

A number of tools have been developed locally with the aim of making Condor job submission more user-friendly and are described here. One difficulty users face with Condor is creating Condor job submission files (whose syntax can be fairly obscure) and editing these files using the UNIX system editors (which are not exactly known for their ease of use). These tools do not do away with the need for job submission files altogether but strive to make them easier to create and use.

General Purpose Tools

The mws_submit command can be used instead of condor_submit to submit jobs using simplified job submission files i.e.:

$ mws_submit simplified_description_file
(Where simplified_description_file is the name of the "user-friendly" job submission file).

A typical job submission file might contain:
executable = myapp.exe
input_files = myinput_common, other_input_common
indexed_input_files = input_data, other_input_data
indexed_output_files = output_data
total_jobs = 10
All of the attributes are optional apart from executable which must be specified (the default for total_jobs is a single job). The executable attribute specifies the main executable file to be run on a Condor pool PC - this will generally be a .bat file or a .exe file. The input_files attribute lists which input files are common to all jobs whilst indexed_input_files lists input files which are different for each individual job. In this example, each job will get its own input_data file from the set of input files input_data0 ... input_data9 (the same is true for other_input_data).

The indexed_output_files attribute will ensure that the output files are retrieved following the same indexing as the input files (i.e. output_data0 ... output_data9). It is the responsibility of the user's application code to ensure that the output files are named in the correct manner (e.g. output file output_data1 corresponds to input file input_data1). For small test runs, it is useful to omit indexed_output_files so that (by default) all of the output files are returned. For production use though, the use of indexed_output_files is recommended as a way of catching run-time errors where the output files may not have been created.

All of the values given in the job description may be temporarily overridden from the command line (although the job description file is left unchanged). For example to change the number of submitted jobs from ten to five:
$ mws_submit simplified_description_file -total_jobs=5
and to also use a different executable:
$ mws_submit simplified_description_file -total_jobs=5 -executable=otherapp.exe
This makes it easy to make small changes without the need to edit the job description file.

The mws_submit command creates the job description file used by Condor which will have the same name as the simplified job description file but with a .sub extension. The Condor job description file corresponding to the above example is:

universe = vanilla
executable = myapp.exe 
transfer_input_files = myinput_common, other_input_common, input_data$(PROCESS), other_input_data$(PROCESS)
transfer_output_files = output_data$(PROCESS)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
requirements = ( Arch=="X86_64") && ( OpSys=="WINDOWS" )
notification = never
queue 10
Clearly this is a good deal more complicated. Several other attributes can be specified in the simplified job description and all of these are detailed in a later section.


Tools for Submitting MATLAB Jobs

Running MATLAB jobs on the Condor pool is made more difficult by the need to build standalone executables to be run as jobs. The tools described below are designed to assist in testing, building and running MATLAB applications.

Since the Condor server has MATLAB installed in the same was as the central Sun UNIX service, it is possible test out M-files on it. The command matlab_run can be used to pass the M-file to the MATLAB interpreter without the need for the graphical interface to be started (it is therefore suitable for use with PuTTy or other terminal emulators) for example:
$ matlab_run product.m
Here product.m would need to contain a MATLAB function called product and be able to run without input from the user. MATLAB on the server should be used sparingly and not for M-files which are likely to require significant CPU use over long periods as this can impact badly on the performance of Condor.

In the past it was not possible to run M-files directly on the Condor PCs however this can now be achieved by using a special job description file e.g.
M_file = product.m
input_file = input.mat
output_file = output.mat
This can be submitted to Condor using the command m_file_submit e.g.
$ m_file_submit product
(Where product is the name of the job description file.)

The command will return with job ID of the M-file job and on completion the output file output.mat will have been created. This should only be used for testing and not as a way of submitting large numbers of jobs since this will place an unecessary load on the MATLAB license server.

Once the M-file is found to work properly, it is possible to build the standalone application directly on a pool PC without the need to build it locally and then upload it. This accomplished by using the command matlab_build e.g.

$ matlab_build product.m
The command will return a job ID and on completion, the standalone executable product.exe should have been created. The file build.log can be used to track the progress of the job which should only take a few minutes to run.

Note that if the M-file contains any syntax errors, the Matlab compiler will not catch these and will blindly compile the code into an executable which will fail when run under Condor. It is extremely difficult to locate these errors later on so please always check that the M-file works correctly before compiling it.

MATLAB standalone applications can be submitted to the Condor pool using a simplified job submission file and the command matlab_submit e.g.

$ matlab_submit simplified_job_description_file
The job description file can make use of same attributes as mws_submit for example:
indexed_input_files = input.mat
indexed_output_files = output.mat
executable = product.exe
indexed_log = logfile
total_jobs = 10

Another important feature is that matlab_submit will automatically create a manifest file so that MATLAB can locate the required run time libraries. There is therefore no need for the user to worry about this. In this example the manifest file would be product.exe.manifest. The actual job submission file passed to Condor in this example is:

universe = vanilla
executable = product.bat 
arguments = product.exe $(PROCESS) input output
transfer_input_files = product.exe.manifest, product.exe, input$(PROCESS).mat
transfer_output_files = output$(PROCESS).mat
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
requirements = ( Arch=="X86_64") && ( OpSys=="WINDOWS" )
notification = never
queue 10
which again is a good deal more complicated than the simplified job description file.


Summary of Job Description Attributes

A complete list of attributes which can be used in simplified job description files for use with mws_submit and matlab_submit is given below. For attributes with multiple values, a comma separated list is used which may contain spaces, however spaces may not be used on the command line. For example

$ mws_submit -indexed_input_files=input1,input2
will work but
$ mws_submit -indexed_input_files=input1, input2
will not.
centre
Teaching centre on which the job can be run.
indexed_input_files
A comma-separated list of file names for input files unique to each job. If, for example, input.mat is given as an indexed input file, this would correspond to the set of files input0.mat .. input(n-1).mat with the ith job receiving inputi.mat as an input file.
indexed_log
Similar to the log attribute (below) but different log files are used for each job making it easier to track down information.
indexed_output_files
A comma-separated list of file names for output files unique to each job. If, for example, output.mat is given as an indexed output file, this would correspond to the set of files output0.mat .. output(n-1).mat with the ith job producing outputi.mat as an output file.
indexed_stdout
File to which each individual job's standard output is to be redirected. The file names will be indexed in a similar manner to the indexed input/output files so that the standard output of each individual job can be seen. This can sometimes be useful in determining where things have gone wrong.
indexed_stderr
File to which each individual job's standard error stream is to be redirected. The file names will be indexed in a similar manner to the indexed input/output files so that the standard error of each individual job can be seen. This can sometimes be useful in determining where things have gone wrong.
input_files
Comma-separated list of input files common to all jobs
log
File to which Condor logs information about the progress of jobs. For multiple jobs, all of the log information is merged into one file and a better choice may be to use indexed_log. This can be useful in determining where and for how long jobs ran.
max_run_time
The maximum time (in minutes) that a job will be allowed to run for. After this time has elapsed, the job will be held then released causing it to go back into the Condor queue. This is useful to prevent jobs getting "stuck".
memory
This attribute can be used so ensure that jobs run only on machines with at least a given amount of memory. The memory size is specified in MB so that memory = 1024 would ensure that jobs run only on PCs with at least 1 GB of memory (per core). Note that 1 GB = 1 024 MB.
prefer
Indicates that jobs should run preferentially on some machines rather than others. Using prefer = speed will cause jobs to choose the fastest processors (based on MIPS benchmark) first and prefer = memory will cause jobs to seek machines with the largest amount of memory installed to run on first.
output_files
List of output files common to all jobs. This is only really useful for single jobs.
run_during
This attribute allows users to specify during which periods their jobs can start. For jobs requiring more than say 60 minutes of run time, it may be better to run jobs outside of office hours to avoid them being interrupted by users logging into the teaching centre machines. This is is beneficial to everyone for two reasons. Firstly energy (and money !) is not wasted by running jobs during the daytime only for them killed off by user logins with no useful results produced. Secondly, Condor assumes that all usage counts towards users "fair share" and if large numbers of jobs are killed by logins during the daytime then Condor users will find that a large proportion of their "fair share" is in fact wasted and jobs are less likely to run. The values which can be specified are office_hours (Mon - Fri, 9 am - 6 pm), out_of_hours (anytime apart from office_hours) and weekends (all day Sat and Sun).
speed
This attribute can be used to ensure that jobs run only on machines with a minimum of clock speed. The speed value is based on the MIPS rating of the processor (not an individual core) so that a for a dual core processor with a 2.33 GHz clock the "speed" would actually be approximately 2 * 2 300 = 4 600 MIPS.
stdout
File to which the job's standard output is to be redirected. This is only really useful for single jobs - for multiple parallel jobs use indexed_stdout
stderr
File to which the job's standard error stream is to be redirected. This is only really useful for single jobs - for multiple parallel jobs use indexed_stdout

Summary of Commands

matlab_build M-file
Builds a standalone executable by compiling the M-file on a PC in the Condor pool.
matlab_run M-file
Will run a M-file using MATLAB on the Condor server without the need to start the graphical interface.
matlab_submit simplified_job_description_file
Submits a standalone MATLAB executable to the pool. The executable does not need to manipulate the input and output filenames to give the correct indexes.
m_file_submit simplified_job_description file
Uses a Condor pool PC to run the specified M-file. A job description needs to be supplied which contains (at a minimum) the name of the M-file and the input file it reads.
mws_submit simplified_job_description_file
Submits a generic Condor job to the pool using a simplified job description. Users' applications must ensure that all file indexing is taken care of.