A "Hands-On" Condor Tutorial
Available on-line at: http://condor.liv.ac.uk/tutorial
Getting Started
-
To access the Condor server, use the PuTTy terminal emulator. If this is not already available
on the desktop you will need to install it as follows:
Click "Start" | "Install University Applications"
Select "Internet" from the "Category" drop down menu
Select "Putty 0.060 Install" from the progam list
Click "Run"
-
Start PuTTy and specify condor.liv.ac.uk as the hostname. Make sure that SSH
is selected as the connection type and click "Open". Then login with your usual
(MWS) username and password.
-
You should now be logged into the Condor server (note that the hostname will
be ulgp5 - condor is an alias for this). Condor jobs need to be submitted from
a local filesystem (/condor_data) so change directory to this:
$ cd /condor_data/<your_username>
You can think of this as HOME for Condor jobs. The filesystem sits on a large (7 TB) fast RAID system. This is not backed up so please copy any important results to somewhere safe when using Condor in earnest.
-
Unpack the examples to your Condor "home" directory:
$ tar xvf /opt1/condor/examples/handson.tar
-
You should now see the following files installed:
$ ls -R .: build hello matlab multiple r_example ./build: input.mat product product.m ./hello: hello.bat hello.sub ./matlab: input0.mat input1.mat input2.mat input3.mat input4.mat prodsum.m product product.exe ./multiple: hello.bat multiple ./r_example: analyse.R observations0.dat observations3.dat observations6.dat observations9.dat collect.R observations1.dat observations4.dat observations7.dat run_analysis initialise.R observations2.dat observations5.dat observations8.dat
Some Condor commands to begin with
-
To view the current state of the Condor pool use
$ condor_status
This is likely to scroll off the screen so to see things more clearly pipe the output to more using:$ condor_status | more
(press the space bar to see the next screenfull).
condor_status lists the current state of all machines in the pool. In the first column are the machine names which include the teaching centre to which they belong (some of these may be familiar). Note that some machines have two or more slots associated with them meaning that they can run multiple jobs concurrently on their multi-core processor. The State column is possibly the most important and can take one of several values. Owner means that the machine (actually the job slot for a multi-core PC) is in use by someone logged into it and hence unavailable to run Condor jobs. Claimed means that the machine is already running (or is about to run) a Condor job and Unclaimed means that a machine is not currently in use and therefore is free to run Condor jobs.
A summary of the current pool state can be found by using:$ condor_status -totals
Note that there are a small number of 32-bit machines in the pool alongside the majority of 64-bit machines. The 32-bit machines are very low specification and best avoided for Condor use (all of the examples here will run on the 64-bit machines).
- During the summer vacation, there may be many machines in the pool in the Unclaimed state and which have an [Unknown] entry in the last column. These represent machines which are currently powered-down and offline. Condor automatically wakes up these machines according to the current demand.
A Simple "Hello World" example
-
The first example is extremely simple and just runs a "Hello World" program
on a machine in the pool. First change to the required directory with:
$ cd hello
Take a look at the job submission file hello.sub - it contains all of the information necessary to run the job:$ cat hello.sub universe=vanilla executable=hello.bat requirements = ( Arch=="X86_64" ) && ( OpSys=="WINDOWS" ) notification=never should_transfer_files = YES when_to_transfer_output = ON_EXIT output = hello.out error = hello.err log = hello.log queue
In particular the executable attribute specifies the program to run - in this case a batch file which contains:$ cat hello.bat echo hello world from ... %WINDIR%\system32\ifconfig %WINDIR%\system32\sleep 30
The second line prints out the IP (network) address of the machine that the job runs on as quick sanity check. This batch file will run to completion extremely quickly and a 30 s delay is introduced in the last line so that it is possible to see it running. Note that we need to give the full path to both of these commands since the PATH (and other environment variables) is not defined when the Condor job runs on the PC.
-
To submit the job use:
$ condor_submit hello.sub
Condor should return with something like:Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 129538.
The number at the end is a unique identifier called the Job ID. You can see the current state of all jobs with:$ condor_q -global
and to see just your own jobs:$ condor_q <your_username>
or just a particular job:$ condor_q <Job ID>
The state is shown in the 6th (ST) column. R indicates running jobs and I those that are idle. The H state indicates jobs which are held - possibly due to an error.
To see all of your own jobs which are running (and where) use:$ condor_q -run <your_username>
Condor is aimed at running large numbers of jobs over perhaps a few hours and you may find that your job does not run immediately. If this is the case then simply move on to the other examples and return to this later (if necessary after the course).
-
Once the job has completed, as indicated by its disapearance from the
condor_q output, it is worth having a look at the output files. The standard
output file (hello.out) should contain something like:
$ cat hello.out hello world from ... Windows IP Configuration Ethernet adapter Liverpool University Network: Connection-specific DNS Suffix . : livad.liv.ac.uk IP Address. . . . . . . . . . . . : 138.253.233.42 Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : 138.253.233.1
The IP address may be different depending on where the job ran. The error file (hello.err) should be empty but - had the job failed - may contain useful debugging information. The log file (hello.log) will contain details about the job's progress:$ cat hello.log 000 (129539.000.000) 05/20 15:05:14 Job submitted from host: <138.253.100.27:64217> ... 001 (129539.000.000) 05/20 15:07:23 Job executing on host: <138.253.233.42:1060> ... 005 (129539.000.000) 05/20 15:07:54 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 353 - Run Bytes Sent By Job 81 - Run Bytes Received By Job 353 - Total Bytes Sent By Job 81 - Total Bytes Received By Job
Submitting multiple jobs
-
The next example builds on the "hello world" example above to show how
multiple jobs can easily be submitted in a single instance. First change
directory to the one containing the example:
$ cd ../multiple
-
Instead of using a Condor job description file similar to that used in the previous
example, this example will use a simplified job description file which
makes use of locally provided job submission features. The simplified
job description file is called multiple and contains:
$ cat multiple executable=hello.bat indexed_stdout=hello.out indexed_stderr=hello.err indexed_log=hello.log total_jobs=2
The executable line is as before but now the attributes prefixed by indexed_ indicate that different stdout, stderr and log files should be used for each of the multiple jobs. The total number of jobs is given by total_jobs - here two.
-
To submit the job use:
$ mws_submit multiple
This should return with something like:Submitting job(s).. Logging submit event(s).. 2 job(s) submitted to cluster 129540.
-
If you check the queue state quickly you should see that two jobs
have now been submitted:
$ condor_q -- Submitter: ulgp4.liv.ac.uk : <138.253.100.27:64217> : ulgp4.liv.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 129540.0 fred 5/20 15:16 0+00:00:00 I 0 0.0 hello.bat 129540.1 fred 5/20 15:16 0+00:00:00 I 0 0.0 hello.bat 2 jobs; 2 idle, 0 running, 0 held
Notice carefully the Job IDs. The integer part (here 129540) is called the cluster ID and decimal part (0 or 1 here) is called the process ID. These allow fine control over jobs as will be seen later.
-
The mws_submit command creates a Condor job description file which is the
one actually passed to condor_submit. It is worth taking a look at this and
comparing it to the simplified job description above:
$ cat multiple.sub universe = vanilla should_transfer_files = YES when_to_transfer_output = ON_EXIT executable = hello.bat output = hello.out$(PROCESS) error = hello.err$(PROCESS) log = hello.log$(PROCESS) requirements = ( Arch=="X86_64") && ( OpSys=="WINDOWS" ) notification = never queue 2
Clearly it is a good deal more complicated.
-
Once all of the jobs in the cluster have completed, multiple output,
error and log files will have been created e.g.
$ ls -l hello.* total 112 -rw------- 1 fred csd 81 May 20 15:16 hello.bat -rw------- 1 fred csd 0 May 20 15:18 hello.err0 -rw------- 1 fred csd 0 May 20 15:18 hello.err1 -rw------- 1 fred csd 622 May 20 15:18 hello.log0 -rw------- 1 fred csd 1038 May 20 15:18 hello.log1 -rw------- 1 fred csd 353 May 20 15:18 hello.out0 -rw------- 1 fred csd 353 May 20 15:18 hello.out1
By examing these it is possible so see where the jobs ran. Depending on the current pool state this may have been on different machines.
-
It is easy to extend this to large numbers of jobs (possibly thousands). If time
permits, change the number of jobs to be submitted using say:
$ mws_submit multiple -total_jobs=5
so that now five jobs will be submitted. It is possible to override all of the attributes in the simplified job description temporarily using command line options. If you prefer to edit the file directly, ned is recommended as an easy to use editor:$ ned multiple
-
Condor jobs can be removed from the queue at any time using the condor_rm command.
To remove a cluster of jobs, just give the cluster ID part of the Job ID e.g.
$ condor_rm 129540
or to remove an individual job, add its process ID e.g.$ condor_rm 129540.3
To remove all of your jobs (use with care !):$ condor_rm -all
Condor will attempt to remove jobs "gracefully" but you can speed things along by supplying the -f (force) option to any of the above commands. You may need to submit the previous example again in order to try and remove it.
Working with MATLAB Applications
-
This section gives examples of how to use Condor to run MATLAB based jobs. If you are more interested
in R applications you can skip over this to the next section.
A number of Condor additions have been provided locally to help with running MATLAB applications on the pool. These examples use an extremely simple application. Consider that we want to form the sum of p matrix-matrix products:
Where A, B, and C are square matrices of order n. It easy to see that the p matrix-matrix products could be calculated independently and therefore potentially in parallel. To locate the first example files change directory$ cd ../build
The M-file, product.m, is extremely simple and just loads the input matrices, A and B, from the input file (input.mat), calculates the product and saves the product, C, to an output file (output.mat):$ cat product.m function product load input.mat; C=A*B; save( 'output.mat', 'C' ); quit;
The input file, input.mat, contains two small dense matrices (A and B) of order n=100;
-
MATLAB is available on the Condor server (although it should be used sparingly)
and can be used as a quick sanity check to ensure that the M-files work properly.
To run the M-file without the graphical interface use:
$ matlab_run product.m
The output.mat file should be created by this:$ ls -l total 496 -rw------- 1 fred csd 151507 May 21 10:27 input.mat -rw------- 1 fred csd 74617 May 21 10:31 output.mat -rw------- 1 fred csd 44 May 21 10:31 product -rw------- 1 fred csd 86 May 21 10:31 product.m
-
To run MATLAB jobs on the Condor pool, the M-file needs to be compiled into
a standalone executable which will run without the need for a MATLAB license.
The standalone executable can be built using the Condor pool and without the
need to have Windows on your desktop machine (another way is to use Apps Anywhere).
To build the executable, just submit a 'build' job using:
$ matlab_build product.m
When the job completes, a standalone executable called product.exe should have been created. To use this with the next example, copy it over to another directory (if time is short don't worry - a standalone executable has already been put there):$ cp product.exe ../matlab
-
To see how to run concurrent MATLAB jobs using standalone executables, change to
the next example directory:
$ cd ../matlab
This now contains five input files containing the matrices to be multiplied:$ ls -l input*.mat -rw------- 1 fred csd 151507 May 21 11:45 input0.mat -rw------- 1 fred csd 151507 May 21 11:45 input1.mat -rw------- 1 fred csd 151507 May 21 11:45 input2.mat -rw------- 1 fred csd 151507 May 21 11:45 input3.mat -rw------- 1 fred csd 151507 May 21 11:45 input4.mat
as well as a simplified job description file:$ cat product executable=product.exe indexed_input_files=input.mat indexed_output_files=output.mat indexed_stdout=product.out indexed_stderr=product.err indexed_log=product.log total_jobs=5
Note that now we have specified the output files produced by each job using the indexed_output_files attribute. -
To submit the jobs use:
$ matlab_submit product
It is worth taking a look at the underlying Condor submission file which is created by this:$ cat product.sub universe = vanilla should_transfer_files = YES when_to_transfer_output = ON_EXIT executable = product.bat arguments = product.exe $(PROCESS) input output output = product.out$(PROCESS) error = product.err$(PROCESS) log = product.log$(PROCESS) transfer_input_files = product.exe.manifest, product.exe, input$(PROCESS).mat transfer_output_files = output$(PROCESS).mat requirements = ( Arch=="X86_64") && ( OpSys=="WINDOWS" ) notification = never queue 5
There is a good deal more going on here than is immediately apparent from the simplified description file. In particular note the executable and arguments attributes. These ensure that a wrapper script (product.bat) - created automatically - sorts out the filename indexing so there is no need for the user to do this in the original M-file.
-
Once all of the jobs have completed, five output files should have been created
$ ls -l output*.mat -rw------- 1 fred csd 74536 May 21 12:04 output0.mat -rw------- 1 fred csd 74536 May 21 12:03 output1.mat -rw------- 1 fred csd 74536 May 21 12:04 output2.mat -rw------- 1 fred csd 74536 May 21 12:04 output3.mat -rw------- 1 fred csd 74536 May 21 12:05 output4.mat
The final step is just to combine the partial sums using another M-file:$ cat prodsum.m function prodsum S = zeros(100); for index=0:4 filename = strcat( 'output', int2str( index ) ); load( filename ); S=S+C; save( 'sum.mat', 'C' ); end quit;
This can be run on the Condor server:$ matlab_run prodsum.m
The final sum will be stored in sum.mat
Working with R Applications
-
A number of Condor additions have been provided locally to help with running
R applications on the pool. This example uses an extremely simple problem.
Consider the case where observers are stationed on a number of roads and record the colour of cars which pass by them. Unfortunately the observers have not come across the idea of a tally chart and just record the colour of each car in a list as one of:red, white, blue, green, black, pink
The problem is to combine this data in order to calculate the relative frequency of each car colour. It easy to see that the the frequency distributions for each observer's data could be calculated independently and therefore potentially in parallel.
-
To see how to run the analysis concurrently on the Condor pool, change to
the next example directory:
$ cd ../r_example
This contains ten input files containing the data recorded by the observers and stored in 'native' R format:$ ls -l observations*.dat -rw-r--r-- 1 fred csd 5458439 Oct 30 15:29 observations0.dat -rw-r--r-- 1 fred csd 5456760 Oct 30 15:29 observations1.dat -rw-r--r-- 1 fred csd 5456971 Oct 30 15:29 observations2.dat -rw-r--r-- 1 fred csd 5456489 Oct 30 15:29 observations3.dat -rw-r--r-- 1 fred csd 5458133 Oct 30 15:29 observations4.dat -rw-r--r-- 1 fred csd 5458496 Oct 30 15:29 observations5.dat -rw-r--r-- 1 fred csd 5457296 Oct 30 15:29 observations6.dat -rw-r--r-- 1 fred csd 5456744 Oct 30 15:29 observations7.dat -rw-r--r-- 1 fred csd 5457377 Oct 30 15:29 observations8.dat -rw-r--r-- 1 fred csd 5458846 Oct 30 15:29 observations9.dat
There are actually a million observations in each file although the analysis script does not need to be aware of this and infact the number of observations could be different for each file.The analyse.R file is the R script which will be run on the Condor pool and contains the following:
$ cat analyse.R load("observations.dat") observf <- factor( observed_colours ) observt <- table( observf ) save( observt, file="results.dat" )
The script loads the input data from observations.dat, calculates the frequency table for it and stores this to an output file called results.dat. Note that there is no need to worry about the filename indexing - this is automatically taken care of. There is also a simplified job submission file (run_analysis) as well:$ cat run_analysis R_script = analyse.R indexed_input_files = observations.dat indexed_output_files = results.dat indexed_log = log indexed_stdout = output indexed_stderr = error total_jobs = 10
Note that now we have specified the output files produced by each job using the indexed_output_files attribute. -
To submit the jobs use:
$ r_submit run_analysis
It is worth taking a look at the underlying Condor submission file which is created by this:universe = vanilla should_transfer_files = YES when_to_transfer_output = ON_EXIT executable = analyse.bat arguments = analyse.R $(PROCESS) observations.dat results.dat output = output$(PROCESS) error = error$(PROCESS) log = log$(PROCESS) transfer_input_files = analyse.R, /opt1/condor/apps/matlab/index.exe, /opt1/condor/apps/matlab/unindex.exe, /opt1/condor/apps/matlab/dummy.exe, observations$(PROCESS).dat transfer_output_files = results$(PROCESS).dat requirements = ( Arch=="X86_64") && ( OpSys=="WINDOWS" ) notification = never queue 10
There is a good deal more going on here than is immediately apparent from the simplified description file. In particular note the executable and arguments attributes. These ensure that a wrapper script (analyse.bat) - created automatically - sorts out the filename indexing so there is no need for the user to do this in the original R script.
-
Once all of the jobs have completed, ten output files should have been created
$ ls -l results*.dat -rw-r--r-- 1 fred csd 190 Oct 30 16:09 results0.dat -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results1.dat -rw-r--r-- 1 fred csd 190 Oct 30 16:09 results2.dat -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results3.dat -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results4.dat -rw-r--r-- 1 fred csd 189 Oct 30 16:10 results5.dat -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results6.dat -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results7.dat -rw-r--r-- 1 fred csd 190 Oct 30 16:10 results8.dat -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results9.dat
The final step would be to download the partial results files (see below) and combine them using a script such as this:load( "results0.dat" ) global_observt <- observt for( i in 1:9 ) { filename = paste( "results", i, ".dat", sep="" ) load( filename ) global_observt <- global_observt + observt } freq_data <- transform( observt, relative = prop.table(Freq) ) plot( freq_data$observf, freq_data$relative )
Retrieving the output
-
In real applications, Condor is likely to generate large numbers of output files
which will generally need to be transferred elsewhere for post-processing or archiving.
Rather than transfer each individual file it far easier to bundle the output files into
a single ZIP format file. To continue with the previous MATLAB example, create a ZIP
bundle of the outputs using
$ zip output.zip output*.mat
The R example could use:$ zip results.zip results*.mat
-
To download the files to your MWS PC or M: drive, the easiest way is to use CoreFTP Lite.
If this is not already available on the desktop you will need to install it as follows:
Click "Start" | "Install University Applications"
Select "Internet" from the "Category" drop down menu
Select "Core FTP LE21 Install" from the program list
Click "Run"
Enter condor.liv.ac.uk as the hostname and login with your MWS username and password, ensuring that the connection type is set to SSH/SFTP. In the right-hand pane navigate to the directory on the Condor server containing the output (i.e. /condor_data/<your_username>/matlab) and on the left-hand pane a directory that you have write permission for e.g. on the M: drive. Select output.zip and then click the left pointing arrow. The ZIP file should now have been downloaded. You can double click on this in Windows Explorer to extract the files. The reverse procedure can be used to upload large numbers of input files.