<
High Throughput Computing using Condor

Running R Applications under Condor


All of the files for the examples can be found on condor.liv.ac.uk in
/opt1/condor/examples/r_example


The R installation used on Condor (version 3.0.2) can be downloaded as a self-extracting ZIP file using FTP from: condor.liv.ac.uk in
/opt1/condor/apps/R_3.0.2.exe

You can find a list of all of the installed R packages here

You can work through the example described here step by step using the Hands On Tutorial

Contents

Introduction
Preparing the input files
Creating the Condor R script
Collecting the partial results
Creating the simplified job submission file
Submitting the Condor Jobs
Summary

Introduction

Condor is well suited to running large numbers of R jobs at the same time (i.e. concurrently). If the application applies the same kind of analysis to large data sets (so-called "embarrassingly parallel" applications) then Condor can significantly reduce the time needed to generate the results by processing the data in parallel on different PCs. In some cases, simulations and analyses that would have taken months on a single PC can be completed in a matter of days.

A very simple example is given here to illustrate the way in which R applications can be adapted to make use of Condor. Consider the case where observers are stationed on a number of roads and record the colour of cars which pass by them. Unfortunately the observers have not come across the idea of a tally chart and just record the colour of each car in a list as one of:

red, white, blue, green, black, pink
The problem is to combine this data in order to calculate the relative frequency of each car colour.

To tackle the problem with Condor, it will need to be divided into three separate stages namely:

  1. Gather the input data and and store it to files in a form which can be processed in parallel by Condor
  2. Process the input data in parallel using the Condor pool and write the outputs to corresponding files.
  3. Collate the data in the output files to generate the overall results.
We'll create three R scripts to carry out each of these tasks in turn namely: initalise.R, analyse.R and collect.R respectively. Each of these is described below.


Preparing the input files

In a "real world" application, the input data would most likely already be stored in files which hopefully can be loaded directly into R (either in CSV or "native" R format). To distinguish between files, the filenames need to have a common name plus an index value i.e they must be of the form:

basename<index>.extension
where <index> takes values in the range [0..N-1] if there are N files in total and the extension is optional. For the purposes of this example we'll just create ten input files filled with (psuedo-) random data as follows:
colours =   c( "red", "white", "blue", "green", "black", "pink"  )
occurence = c(  100,   200,    50,      500,    200,     10     )
 
for( i in 0:9 )
{
  filename <- paste( "observations", i, ".dat", sep="" )
  observed_colours <- sample( colours, 10000000, TRUE, occurence )
  save( observed_colours, file=filename )
}
This is the file initialise.R.

Since each input data file contains a million observations, this corresponds to some extremely industrious observers however this is necessary for the analysis to run for anything like a measurable amount of time. The script can be run on a PC to generate the ten input files which can then be uploaded to the Condor server (or you can download them directly in ZIP format as this file observations.zip). The input files will be:

observations0.dat
observations1.dat
observations2.dat
...
observations9.dat
It is also possible to use the version of R installed on the Condor server using:
$ Rscript initialise.R


Creating the Condor R script

Next up we'll need a script to do the actual analysis - this is the script that will run concurrently on the Condor pool. Often this script will have started life being used to run the analysis on a desktop PC and will need adapting for Condor-style execution. Here we'll just start from scratch by creating a script that will read the input data, calculate a frequency table for it and store the results to a file as follows:

load("observations.dat")

observf <- factor( observed_colours )
observt <- table( observf )

save( observt, file="results.dat" )
This is the file analyse.R.

Note that the filenames do not need to include an index value - this will automatically be taken care of by the simplified Condor job submission process.

The script can easily be tested on a PC by copying one of the input files to observations.dat and running the script via the R graphical interface. It should produce results.dat. It is always a good idea to test scripts first locally to check that they work (even if only minor changes have been made) as it can be difficult to diagnose errors later under Condor.

There is one fairly subtle point here which is worth making. By using R's high level built-in functions, it is not necessary to know the how much data is stored in each input file. Thus each input file could contain a different number of observations (possibly with "NA" values) and this will be handled transparently. Building this kind of flexibility in early on is always a better idea than trying to adapt a more complicated script later.


Collecting the partial results

The final stage will be to combine the partial results to arrive at a combined frequency distribution from which the relative frequencies can be found. A script such as the following will perform this task:

load( "results0.dat" )
global_observt <- observt

for( i in 1:9 )
{
 filename = paste( "results", i, ".dat", sep="" )
 load( filename )
 global_observt <- global_observt + observt 
}

freq_data <- transform( observt, relative = prop.table(Freq) )

plot( freq_data$observf, freq_data$relative )
This is the file collect.R.

The script simply loops over each output file and sums the absolute frequencies for each category before calculating the overall relative frequencies. This script can be run on a desktop PC once all of the output files have been downloaded from the Condor server.

Creating the simplified job submission file

Each batch of Condor jobs needs a job submission file to describe how the jobs should be run. These can appear rather arcane and difficult to understand to new users and so to help simplify the process, tools have been created to provide a more user-friendly job submission process. The simplified job submission files used are automatically translated into files which Condor understands and which users need not worry about. For the present example, a suitable job submission file (run_analysis) is:

R_script = analyse.R
indexed_input_files = observations.dat
indexed_output_files = results.dat
indexed_log = log
total_jobs = 10

The R script to be run on the Condor pool is specified in the R_script attribute. Note that under UNIX, filenames are case sensitive so that the script name needs an upper case R. Another UNIX hazard that may trip up users more familiar with Windows is that spaces should not be used in filenames. If you have files which have spaces in their names, the spaces are best converted to either the underscore ('_') or hyphen ('-') character before uploading them to the Condor server.

The input data files are given in the indexed_input_files attribute. Multiple sets of input files should be provided as a comma-separated list e.g.

indexed_input_files = first_set.dat, second_set.dat, third_set.dat
The index values do not need to be included as the simplified job submission process will take care of this for you.

If there are input files which are common to all jobs then these can be specified using the input_files attribute. This is useful if the common input files are relatively large. Where there are multiple R scripts needed by the application, those that are called from main R script (given in R_script) can be included in the list of common input files in input_files.

Those output files which jobs are expected to produce are listed in indexed_output_files in a similar way to the indexed input files.

The indexed_log attribute is optional but can be useful in tracking the progress of jobs as well as diagnosing problems and its use is recommended.

Finally, the total number of jobs is given in the total_jobs line. The underlying job submission process will ensure that each individual job receives the correct input file (observations0.dat .. observations9.dat) and that the output files are indexed in a corresponding manner to the input files (e.g. output file results1.dat will correspond to observations1.dat)


Submitting the Condor Jobs

The Condor jobs are submitted from the the Condor server using the command:

$ r_submit run_analysis

It should return with something like:

[condoruser@ulgp5 r_example]$ r_submit run_analysis
Submitting job(s)..........
10 job(s) submitted to cluster 657.
You can monitor the progress of all of your jobs using:
$ condor_q <your_unix_username>
Where <your_unix_username> is the username that you logged in with. Initially the Condor jobs will remain in the idle state until PCs become available: e.g.

[condoruser@ulgp5 r_example]$ condor_q condoruser


-- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:50534>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 657.0   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.1   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.2   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.3   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.4   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.5   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.6   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.7   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.8   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys
 657.9   condoruser     10/30 10:38   0+00:00:00 I  0   0.0  analyse.bat analys

10 jobs; 10 idle, 0 running, 0 held
The state is shown in the ST column and I indicates an idle job. As some of the jobs start to run the state will change to run, denoted by a R e.g.
[condoruser@ulgp5 r_example]$ condor_q condoruser


-- Schedd: Q6@ulgp5.liv.ac.uk : <138.253.100.17:50534>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 658.0   condoruser     10/30 10:43   0+00:00:01 R  0   0.0  analyse.bat analys
 658.1   condoruser     10/30 10:43   0+00:00:00 I  0   0.0  analyse.bat analys
 658.2   condoruser     10/30 10:43   0+00:00:01 R  0   0.0  analyse.bat analys
 658.3   condoruser     10/30 10:43   0+00:00:00 I  0   0.0  analyse.bat analys
 658.4   condoruser     10/30 10:43   0+00:00:00 I  0   0.0  analyse.bat analys
 658.5   condoruser     10/30 10:43   0+00:00:00 I  0   0.0  analyse.bat analys
 658.6   condoruser     10/30 10:43   0+00:00:00 I  0   0.0  analyse.bat analys
 658.7   condoruser     10/30 10:43   0+00:00:00 I  0   0.0  analyse.bat analys
 658.8   condoruser     10/30 10:43   0+00:00:00 I  0   0.0  analyse.bat analys
 658.9   condoruser     10/30 10:43   0+00:00:00 I  0   0.0  analyse.bat analys

10 jobs; 8 idle, 2 running, 0 held

Once the jobs have completed (as indicated by the lack of any jobs in the queue), the directory should contain ten output files named results0.dat, results1.dat, .. results9.dat which can be processed using the collect.R script to generate the final result. This can be performed on the Condor server using:

$ Rscript collect.R
(it will not be possible to plot the results unless you are running a graphical interface such as eXceed). The files can also be downloaded to a PC and processed there.

There will also be ten log files (logfile*) which are not of any great interest but can be useful in tracking down problems when things have gone wrong and are also useful in finding out when and where each job ran and for how long.

The job submission file can be edited using the ned editor, however all of the options in it can be overidden temporarily from the command line (the file itself is not changed). For example to submit five jobs instead of ten use:

$ r_submit run_analysis -total_jobs=5
This is useful for making small changes without the need to use the UNIX system editors to change the job submission file.

This is of course a very simple example and the overall execution time is likely to be dominated by the time taken to read the input data and write the output data to/from disk (disk access is on the order of a million times slower than CPU speed). If you look at the log files you will see that the actual run time was only on the order of a few seconds e.g.

[condoruser@ulgp5 r_example]$ cat log0
000 (659.000.000) 10/30 11:50:44 Job submitted from host: <138.253.100.17:50534>
...
001 (659.000.000) 10/30 11:51:07 Job executing on host: <138.253.234.33:9718>
...
006 (659.000.000) 10/30 11:51:16 Image size of job updated: 11104
...
005 (659.000.000) 10/30 11:51:27 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:02  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:02  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1406  -  Run Bytes Sent By Job
        66254120  -  Run Bytes Received By Job
        1406  -  Total Bytes Sent By Job
        66254120  -  Total Bytes Received By Job
Here the job used just two seconds of CPU time (Total Remote Usage).

The resulting relative frequency plot is shown below. You can see that it corresponds to the weightings given in the original initialisation script viz:

colours =   c( "red", "white", "blue", "green", "black", "pink"  )
occurence = c(  100,   200,    50,      500,    200,     10     )

Summary

Although the example application presented here is a very artificial one, the overall method can be used to tackle more realistic problems having significant computing requirements. The input data may be composed of a large multi-variate data frame(s) which will first need to be divided up in some way so that the data can be processed in parallel using the Condor pool. The analysis of the data may also be far more complicated leading to much longer job run times. For many problems, it is possible to divide the input data into an arbitrary number of parts to the reduce the individual job run times. As a rule of thumb, Condor works best with run times of around 15-30 minutes.