<
High Throughput Computing using Condor

A "Hands-On" Condor Tutorial


Available on-line at: http://condor.liv.ac.uk/tutorial

Getting Started

  1. To access the Condor server, use the PuTTy terminal emulator. If this is not already available on the desktop you will need to install it as follows:

    Click "Start" | "Install University Applications"

    Select "Internet" from the "Category" drop down menu

    Select "Putty 0.060 Install" from the progam list

    Click "Run"

  2. Start PuTTy and specify condor.liv.ac.uk as the hostname. Make sure that SSH is selected as the connection type and click "Open". Then login with your usual (MWS) username and password.

  3. You should now be logged into the Condor server (note that the hostname will be ulgp5 - condor is an alias for this). Condor jobs need to be submitted from a local filesystem (/condor_data) so change directory to this:
      $ cd /condor_data/<your_username>
       
    You can think of this as HOME for Condor jobs. The filesystem sits on a large (7 TB) fast RAID system. This is not backed up so please copy any important results to somewhere safe when using Condor in earnest.

  4. Unpack the examples to your Condor "home" directory:
       $ tar xvf /opt1/condor/examples/handson.tar
       
  5. You should now see the following files installed:
    $ ls -R
      .:
      build  hello  matlab  multiple  r_example
      
      ./build:
      input.mat  product  product.m
      
      ./hello:
      hello.bat  hello.sub
      
      ./matlab:
      input0.mat  input1.mat  input2.mat  input3.mat  
      input4.mat  prodsum.m  product  product.exe
      
      ./multiple:
      hello.bat  multiple
      
      ./r_example:
      analyse.R     observations0.dat  observations3.dat  
      observations6.dat  observations9.dat
      collect.R     observations1.dat  observations4.dat  
      observations7.dat  run_analysis
      initialise.R  observations2.dat  observations5.dat  
      observations8.dat
       
       

Some Condor commands to begin with

  1. To view the current state of the Condor pool use
       $ condor_status 
       
    This is likely to scroll off the screen so to see things more clearly pipe the output to more using:
       $ condor_status | more
       
    (press the space bar to see the next screenfull).

    condor_status lists the current state of all machines in the pool. In the first column are the machine names which include the teaching centre to which they belong (some of these may be familiar). Note that some machines have two or more slots associated with them meaning that they can run multiple jobs concurrently on their multi-core processor. The State column is possibly the most important and can take one of several values. Owner means that the machine (actually the job slot for a multi-core PC) is in use by someone logged into it and hence unavailable to run Condor jobs. Claimed means that the machine is already running (or is about to run) a Condor job and Unclaimed means that a machine is not currently in use and therefore is free to run Condor jobs.

    A summary of the current pool state can be found by using:
       $ condor_status -totals   
       
    Note that there are a small number of 32-bit machines in the pool alongside the majority of 64-bit machines. The 32-bit machines are very low specification and best avoided for Condor use (all of the examples here will run on the 64-bit machines).

  2. During the summer vacation, there may be many machines in the pool in the Unclaimed state and which have an [Unknown] entry in the last column. These represent machines which are currently powered-down and offline. Condor automatically wakes up these machines according to the current demand.

A Simple "Hello World" example

  1. The first example is extremely simple and just runs a "Hello World" program on a machine in the pool. First change to the required directory with:
       $ cd hello
       
    Take a look at the job submission file hello.sub - it contains all of the information necessary to run the job:
       $ cat hello.sub
       universe=vanilla
       executable=hello.bat
       requirements = ( Arch=="X86_64" ) && ( OpSys=="WINDOWS" )
       notification=never   
       should_transfer_files = YES
       when_to_transfer_output = ON_EXIT
       output = hello.out
       error = hello.err
       log = hello.log
       queue
       
    In particular the executable attribute specifies the program to run - in this case a batch file which contains:
       $ cat hello.bat
       echo hello world from ...
       %WINDIR%\system32\ifconfig
       %WINDIR%\system32\sleep 30 
       
    The second line prints out the IP (network) address of the machine that the job runs on as quick sanity check. This batch file will run to completion extremely quickly and a 30 s delay is introduced in the last line so that it is possible to see it running. Note that we need to give the full path to both of these commands since the PATH (and other environment variables) is not defined when the Condor job runs on the PC.

  2. To submit the job use:
       $ condor_submit hello.sub
       
    Condor should return with something like:
       Submitting job(s).
       Logging submit event(s).
       1 job(s) submitted to cluster 129538.
       
    The number at the end is a unique identifier called the Job ID. You can see the current state of all jobs with:
       $ condor_q -global 
       
    and to see just your own jobs:
       $ condor_q <your_username>
       
    or just a particular job:
       $ condor_q <Job ID>
       
    The state is shown in the 6th (ST) column. R indicates running jobs and I those that are idle. The H state indicates jobs which are held - possibly due to an error.

    To see all of your own jobs which are running (and where) use:
       $ condor_q -run <your_username>
       
    Condor is aimed at running large numbers of jobs over perhaps a few hours and you may find that your job does not run immediately. If this is the case then simply move on to the other examples and return to this later (if necessary after the course).

  3. Once the job has completed, as indicated by its disapearance from the condor_q output, it is worth having a look at the output files. The standard output file (hello.out) should contain something like:
        $ cat hello.out
        hello world from ...
    
        Windows IP Configuration
    
    
        Ethernet adapter Liverpool University Network:
    
            Connection-specific DNS Suffix  . : livad.liv.ac.uk
            IP Address. . . . . . . . . . . . : 138.253.233.42
            Subnet Mask . . . . . . . . . . . : 255.255.255.0
            Default Gateway . . . . . . . . . : 138.253.233.1  
       
    The IP address may be different depending on where the job ran. The error file (hello.err) should be empty but - had the job failed - may contain useful debugging information. The log file (hello.log) will contain details about the job's progress:
        $ cat hello.log
        000 (129539.000.000) 05/20 15:05:14 Job submitted from host: <138.253.100.27:64217>
        ...
        001 (129539.000.000) 05/20 15:07:23 Job executing on host: <138.253.233.42:1060>
        ...
        005 (129539.000.000) 05/20 15:07:54 Job terminated.
            (1) Normal termination (return value 0)
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                    Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
            353  -  Run Bytes Sent By Job
            81  -  Run Bytes Received By Job
            353  -  Total Bytes Sent By Job
            81  -  Total Bytes Received By Job
       

Submitting multiple jobs

  1. The next example builds on the "hello world" example above to show how multiple jobs can easily be submitted in a single instance. First change directory to the one containing the example:
        $ cd ../multiple
        
  2. Instead of using a Condor job description file similar to that used in the previous example, this example will use a simplified job description file which makes use of locally provided job submission features. The simplified job description file is called multiple and contains:
        $ cat multiple
        executable=hello.bat
        indexed_stdout=hello.out 
        indexed_stderr=hello.err
        indexed_log=hello.log
        total_jobs=2
        
    The executable line is as before but now the attributes prefixed by indexed_ indicate that different stdout, stderr and log files should be used for each of the multiple jobs. The total number of jobs is given by total_jobs - here two.

  3. To submit the job use:
        $ mws_submit multiple
        
    This should return with something like:
        Submitting job(s)..
        Logging submit event(s)..
        2 job(s) submitted to cluster 129540.
        
  4. If you check the queue state quickly you should see that two jobs have now been submitted:
         $ condor_q
    
         -- Submitter: ulgp4.liv.ac.uk : <138.253.100.27:64217> : ulgp4.liv.ac.uk
          ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
         129540.0   fred         5/20 15:16   0+00:00:00 I  0   0.0  hello.bat         
         129540.1   fred         5/20 15:16   0+00:00:00 I  0   0.0  hello.bat         
    
         2 jobs; 2 idle, 0 running, 0 held
       
    Notice carefully the Job IDs. The integer part (here 129540) is called the cluster ID and decimal part (0 or 1 here) is called the process ID. These allow fine control over jobs as will be seen later.

  5. The mws_submit command creates a Condor job description file which is the one actually passed to condor_submit. It is worth taking a look at this and comparing it to the simplified job description above:
        $ cat multiple.sub
        universe = vanilla
        should_transfer_files = YES
        when_to_transfer_output = ON_EXIT
        executable = hello.bat 
        output = hello.out$(PROCESS)
        error = hello.err$(PROCESS)
        log = hello.log$(PROCESS)
        requirements = ( Arch=="X86_64") && ( OpSys=="WINDOWS" )
        notification = never
        queue 2
        
    Clearly it is a good deal more complicated.

  6. Once all of the jobs in the cluster have completed, multiple output, error and log files will have been created e.g.
        $ ls -l hello.*
         total 112
         -rw-------   1 fred  csd           81 May 20 15:16 hello.bat
         -rw-------   1 fred  csd            0 May 20 15:18 hello.err0
         -rw-------   1 fred  csd            0 May 20 15:18 hello.err1
         -rw-------   1 fred  csd          622 May 20 15:18 hello.log0
         -rw-------   1 fred  csd         1038 May 20 15:18 hello.log1
         -rw-------   1 fred  csd          353 May 20 15:18 hello.out0
         -rw-------   1 fred  csd          353 May 20 15:18 hello.out1
       
    By examing these it is possible so see where the jobs ran. Depending on the current pool state this may have been on different machines.

  7. It is easy to extend this to large numbers of jobs (possibly thousands). If time permits, change the number of jobs to be submitted using say:
        $ mws_submit multiple -total_jobs=5
        
    so that now five jobs will be submitted. It is possible to override all of the attributes in the simplified job description temporarily using command line options. If you prefer to edit the file directly, ned is recommended as an easy to use editor:
        $ ned multiple
        
  8. Condor jobs can be removed from the queue at any time using the condor_rm command. To remove a cluster of jobs, just give the cluster ID part of the Job ID e.g.
        $ condor_rm 129540
        
    or to remove an individual job, add its process ID e.g.
        $ condor_rm 129540.3
        
    To remove all of your jobs (use with care !):
        $ condor_rm -all
        
    Condor will attempt to remove jobs "gracefully" but you can speed things along by supplying the -f (force) option to any of the above commands. You may need to submit the previous example again in order to try and remove it.

Working with MATLAB Applications

  1. This section gives examples of how to use Condor to run MATLAB based jobs. If you are more interested in R applications you can skip over this to the next section.

    A number of Condor additions have been provided locally to help with running MATLAB applications on the pool. These examples use an extremely simple application. Consider that we want to form the sum of p matrix-matrix products:



    Where A, B, and C are square matrices of order n. It easy to see that the p matrix-matrix products could be calculated independently and therefore potentially in parallel. To locate the first example files change directory
        $ cd ../build
        
    The M-file, product.m, is extremely simple and just loads the input matrices, A and B, from the input file (input.mat), calculates the product and saves the product, C, to an output file (output.mat):
        $ cat product.m 
        function product 
            load input.mat;
            C=A*B;
            save( 'output.mat', 'C' ); 
            quit;
        
    The input file, input.mat, contains two small dense matrices (A and B) of order n=100;

  2. MATLAB is available on the Condor server (although it should be used sparingly) and can be used as a quick sanity check to ensure that the M-files work properly. To run the M-file without the graphical interface use:
        $ matlab_run product.m
        
    The output.mat file should be created by this:
        $ ls -l
        total 496
        -rw-------   1 fred  csd       151507 May 21 10:27 input.mat
        -rw-------   1 fred  csd        74617 May 21 10:31 output.mat
        -rw-------   1 fred  csd           44 May 21 10:31 product
        -rw-------   1 fred  csd           86 May 21 10:31 product.m
        
  3. To run MATLAB jobs on the Condor pool, the M-file needs to be compiled into a standalone executable which will run without the need for a MATLAB license. The standalone executable can be built using the Condor pool and without the need to have Windows on your desktop machine (another way is to use Apps Anywhere). To build the executable, just submit a 'build' job using:
        $ matlab_build product.m
        
    When the job completes, a standalone executable called product.exe should have been created. To use this with the next example, copy it over to another directory (if time is short don't worry - a standalone executable has already been put there):
         $ cp product.exe ../matlab
        
  4. To see how to run concurrent MATLAB jobs using standalone executables, change to the next example directory:
         $ cd ../matlab
        
    This now contains five input files containing the matrices to be multiplied:
         $ ls -l input*.mat
         -rw-------   1 fred  csd       151507 May 21 11:45 input0.mat
         -rw-------   1 fred  csd       151507 May 21 11:45 input1.mat
         -rw-------   1 fred  csd       151507 May 21 11:45 input2.mat
         -rw-------   1 fred  csd       151507 May 21 11:45 input3.mat
         -rw-------   1 fred  csd       151507 May 21 11:45 input4.mat
        
    as well as a simplified job description file:
         $ cat product
         executable=product.exe
         indexed_input_files=input.mat
         indexed_output_files=output.mat
         indexed_stdout=product.out
         indexed_stderr=product.err
         indexed_log=product.log
         total_jobs=5
        
    Note that now we have specified the output files produced by each job using the indexed_output_files attribute.

  5. To submit the jobs use:
         $ matlab_submit product
         
    It is worth taking a look at the underlying Condor submission file which is created by this:
         $ cat product.sub
         universe = vanilla
         should_transfer_files = YES
         when_to_transfer_output = ON_EXIT
         executable = product.bat 
         arguments = product.exe $(PROCESS) input output
         output = product.out$(PROCESS)
         error = product.err$(PROCESS)
         log = product.log$(PROCESS)
         transfer_input_files = product.exe.manifest, product.exe, input$(PROCESS).mat
         transfer_output_files = output$(PROCESS).mat
         requirements = ( Arch=="X86_64") && ( OpSys=="WINDOWS" )
         notification = never
         queue 5
        
    There is a good deal more going on here than is immediately apparent from the simplified description file. In particular note the executable and arguments attributes. These ensure that a wrapper script (product.bat) - created automatically - sorts out the filename indexing so there is no need for the user to do this in the original M-file.

  6. Once all of the jobs have completed, five output files should have been created
         $ ls -l output*.mat
         -rw-------   1 fred  csd        74536 May 21 12:04 output0.mat
         -rw-------   1 fred  csd        74536 May 21 12:03 output1.mat
         -rw-------   1 fred  csd        74536 May 21 12:04 output2.mat
         -rw-------   1 fred  csd        74536 May 21 12:04 output3.mat
         -rw-------   1 fred  csd        74536 May 21 12:05 output4.mat
        
    The final step is just to combine the partial sums using another M-file:
         $ cat prodsum.m 
         function prodsum 
            S = zeros(100);
            for index=0:4 
               filename = strcat( 'output', int2str( index ) );
               load( filename );
               S=S+C; 
               save( 'sum.mat', 'C' );
            end
         quit;
        
    This can be run on the Condor server:
         $ matlab_run prodsum.m
        
    The final sum will be stored in sum.mat

Working with R Applications

  1. A number of Condor additions have been provided locally to help with running R applications on the pool. This example uses an extremely simple problem.

    Consider the case where observers are stationed on a number of roads and record the colour of cars which pass by them. Unfortunately the observers have not come across the idea of a tally chart and just record the colour of each car in a list as one of:
     red, white, blue, green, black, pink
        
    The problem is to combine this data in order to calculate the relative frequency of each car colour. It easy to see that the the frequency distributions for each observer's data could be calculated independently and therefore potentially in parallel.

  2. To see how to run the analysis concurrently on the Condor pool, change to the next example directory:
         $ cd ../r_example
        
    This contains ten input files containing the data recorded by the observers and stored in 'native' R format:
          $ ls -l observations*.dat
          
          -rw-r--r-- 1 fred csd 5458439 Oct 30 15:29 observations0.dat
          -rw-r--r-- 1 fred csd 5456760 Oct 30 15:29 observations1.dat
          -rw-r--r-- 1 fred csd 5456971 Oct 30 15:29 observations2.dat
          -rw-r--r-- 1 fred csd 5456489 Oct 30 15:29 observations3.dat
          -rw-r--r-- 1 fred csd 5458133 Oct 30 15:29 observations4.dat
          -rw-r--r-- 1 fred csd 5458496 Oct 30 15:29 observations5.dat
          -rw-r--r-- 1 fred csd 5457296 Oct 30 15:29 observations6.dat
          -rw-r--r-- 1 fred csd 5456744 Oct 30 15:29 observations7.dat
          -rw-r--r-- 1 fred csd 5457377 Oct 30 15:29 observations8.dat
          -rw-r--r-- 1 fred csd 5458846 Oct 30 15:29 observations9.dat
         
    There are actually a million observations in each file although the analysis script does not need to be aware of this and infact the number of observations could be different for each file.

    The analyse.R file is the R script which will be run on the Condor pool and contains the following:

         
         $ cat analyse.R     
          load("observations.dat")
          observf <- factor( observed_colours )
          observt <- table( observf )
          
          save( observt, file="results.dat" )
          
    The script loads the input data from observations.dat, calculates the frequency table for it and stores this to an output file called results.dat. Note that there is no need to worry about the filename indexing - this is automatically taken care of. There is also a simplified job submission file (run_analysis) as well:
          $ cat run_analysis 
          R_script = analyse.R
          indexed_input_files = observations.dat
          indexed_output_files = results.dat
          indexed_log = log
          indexed_stdout = output
          indexed_stderr = error
          total_jobs = 10
        
    Note that now we have specified the output files produced by each job using the indexed_output_files attribute.

  3. To submit the jobs use:
         $ r_submit run_analysis 
         
    It is worth taking a look at the underlying Condor submission file which is created by this:
          universe = vanilla
          should_transfer_files = YES
          when_to_transfer_output = ON_EXIT
          executable = analyse.bat 
          arguments = analyse.R $(PROCESS) observations.dat results.dat
          output = output$(PROCESS)
          error = error$(PROCESS)
          log = log$(PROCESS)
          transfer_input_files = analyse.R, /opt1/condor/apps/matlab/index.exe, /opt1/condor/apps/matlab/unindex.exe, /opt1/condor/apps/matlab/dummy.exe, observations$(PROCESS).dat
          transfer_output_files = results$(PROCESS).dat
          requirements = ( Arch=="X86_64") && ( OpSys=="WINDOWS" )
          notification = never
          queue 10
    
        
    There is a good deal more going on here than is immediately apparent from the simplified description file. In particular note the executable and arguments attributes. These ensure that a wrapper script (analyse.bat) - created automatically - sorts out the filename indexing so there is no need for the user to do this in the original R script.

  4. Once all of the jobs have completed, ten output files should have been created
          $ ls -l results*.dat
          -rw-r--r-- 1 fred csd 190 Oct 30 16:09 results0.dat
          -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results1.dat
          -rw-r--r-- 1 fred csd 190 Oct 30 16:09 results2.dat
          -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results3.dat
          -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results4.dat
          -rw-r--r-- 1 fred csd 189 Oct 30 16:10 results5.dat
          -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results6.dat
          -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results7.dat
          -rw-r--r-- 1 fred csd 190 Oct 30 16:10 results8.dat
          -rw-r--r-- 1 fred csd 191 Oct 30 16:10 results9.dat
        
    The final step would be to download the partial results files (see below) and combine them using a script such as this:
          load( "results0.dat" )
          global_observt <- observt
          
          for( i in 1:9 )
          {
            filename = paste( "results", i, ".dat", sep="" )
            load( filename )
            global_observt <- global_observt + observt 
          }
          
          freq_data <- transform( observt, relative = prop.table(Freq) )
                
          plot( freq_data$observf, freq_data$relative )
        

Retrieving the output

  1. In real applications, Condor is likely to generate large numbers of output files which will generally need to be transferred elsewhere for post-processing or archiving. Rather than transfer each individual file it far easier to bundle the output files into a single ZIP format file. To continue with the previous MATLAB example, create a ZIP bundle of the outputs using
        $ zip output.zip output*.mat
        
    The R example could use:
        $ zip results.zip results*.mat
        
  2. To download the files to your MWS PC or M: drive, the easiest way is to use CoreFTP Lite. If this is not already available on the desktop you will need to install it as follows:

    Click "Start" | "Install University Applications"

    Select "Internet" from the "Category" drop down menu

    Select "Core FTP LE21 Install" from the program list

    Click "Run"

    Enter condor.liv.ac.uk as the hostname and login with your MWS username and password, ensuring that the connection type is set to SSH/SFTP. In the right-hand pane navigate to the directory on the Condor server containing the output (i.e. /condor_data/<your_username>/matlab) and on the left-hand pane a directory that you have write permission for e.g. on the M: drive. Select output.zip and then click the left pointing arrow. The ZIP file should now have been downloaded. You can double click on this in Windows Explorer to extract the files. The reverse procedure can be used to upload large numbers of input files.