<
Cloud Computing with HTCondor

Cloud Computing with HTCondor



1. What is Cloud Computing ?

2. The AWS HTCondor Pool
2.1 Basic Operation
2.2 R Applications
2.3 MATLAB Applications
2.4 Python Applications
2.5 Generic Applications

1. What is Cloud Computing ?

To begin with it is probably worth explaining a little of what is meant by "cloud computing" as there seem to be as many definitions as there are shapes of real clouds themselves. Many people will be familiar with cloud computing via commercial services such as Dropbox, Google Drive and iCloud. These allow users to archive and share data with other users in a transparent and seamless manner. The actual physical location of the data and the way in which it is accessed is usually unknown and probably of no real interest to the average user who just expects it to be always acessible on-demand. A useful analogy to cloud computing is the mains electricity supply. We expect it just be there 24/7 when we need it without worrying about how the electricity was produced or how it got to our homes. The same is true of the cloud computing - it is just there when we need it although, like the electricity supply, this comes at a cost.

The services described above are all examples of data clouds however there are other resources available called computing clouds. These allow users to run programs on remote computers rather than just store data on them. Here the term "programs" is deliberately vague and encompasses almost any application software including database engines, mail servers and web servers. Of more interest to researchers is the fact that a computing cloud can be used to efficiently run the kinds of computationally demanding software used extensively in science, engineering and medical applications. With cloud computing, these are usually serial applications however it is possible to provision an entire computing cluster (similar to the new barkla cluster) to run truly parallel codes "in the cloud".

There are a wide variety of cloud resources on offer from vendors such as IBM and Microsoft and the Research IT term have set up a pilot service service using Amazon Web Services (AWS) cloud. This has a monthly credit of a few thousand dollars. AWS started life as part of Amazon' online retail business but now can be considered a separate entity. It provides access to a huge amount of online services including computational resources (called EC2) and data storage facilities (called S3). The elastic computing cloud (EC2) service comprises a large amount of hardware from small single core virtual machines to large memory multi-core servers running a variety of Linux and Windows operating systems.

Users can create their own AWS "instances" (essentially virtual computers in the cloud) then login to them and run applications on them. From a research point of view, what is more useful is the ability to run multiple jobs in the cloud in a similar way to the MWS HTCondor Pool or clusters such as barkla. This is is where HTCondor comes in as it provides an efficient way of running multiple serial jobs on remote resources which may not be permanently available.

The HTCondor server (condor.liv.ac.uk) provides an easy-to-use way of accessing resources available through cloud computing as well as the MWS Pool. This can provide access to wide a range of dedicated hardware whose capabilities far exceed those of the standard MWS HTCondor Pool. There are two important points to make here:

The are number of other advantages to using the AWS HTCondor pool over the MWS HTCondor pool (and the barkla cluster to some extent), namely:

There is a one disadvantage that prevents the AWS HTCondor pool being used as a direct alternative to the MWS HTCondor pool and that is cost. During a farily typical month, approximately half a million core hours were clocked up on the MWS HTCondor Pool which would have cost at the very least $5,000 on the AWS Cloud. This is based on a cost of $0.01 per core hour and most instances cost at least a few times more than this. The AWS HTCondor service is therefore best suited to relatively small numbers of resource intensive or specialist serial jobs.

2. The AWS HTCondor Pool

The AWS HTCondor pool is only in its infancy and at the moment is more of an experimental setup which can adapted to suit individual users. We have deliberately limited usage to a small number of low specification machines to prevent users accidentally eating up large chunks of the AWS credits (this may sound a lot but could easily be spent on large scale HTCondor applications). In particular, the pool makes use of "t2.micro" instances which are single core, have 1 GB of memory and have 8 GB of disk storage.

The instances run Amazon's own version of Linux (Amazon AMI) which is based closely on Red Hat 6. Each instance has R version 3.6.1 pre-installed (including all of the libraries used on HTCondor) and the MATLAB 2019a component runtime.

These instances should be capable of running very simple jobs however more demanding problems will probably need more powerful instances which the Research IT team can set up for you (email: arc-support<@>liverpool<.>ac.<.>uk). There are also a variety of operating systems available. To see a list of instances go to: https://aws.amazon.com/ec2/instance-types/. A list of the current pricing can be found at https://aws.amazon.com/ec2/pricing/on-demand/.

2.1 Basic Operation

The commands used with AWS HTCondor are directly equivalent to those used for the standard MWS HTCondor pool with condor replaced by aws. To see a list of available resources use the command:

$ aws_status

The output from this will generally be blank if there are no active jobs being processed as machine instances are only created on demand. However, there are a number of "dummy" offline instances which are used internally by HTCondor to "spin-up" instances on demand. These can be listed using:

$ aws_status_off
e.g.
$ aws_status_off
Name                                                     OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime  

slot1@t2_micro_offline1.eu-west-2.compute.amazonaws.com  LINUX      X86_64 Unclaimed Idle      0.000 810218278+10:43:36
slot1@t2_micro_offline2.eu-west-2.compute.amazonaws.com  LINUX      X86_64 Unclaimed Idle      0.000 810218278+10:43:36
slot1@t2_micro_offline3.eu-west-2.compute.amazonaws.com  LINUX      X86_64 Unclaimed Idle      0.000 810218278+10:43:36
...
Most users need not be aware of these.

To list all of the jobs in the queue use:

$ aws_q 
e.g.
$ aws_q
-- Schedd: AWS@condor1.liv.ac.uk : <138.253.100.17:31608?... @ 03/28/18 11:52:45
OWNER   BATCH_NAME         SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
smithic CMD: wrapper.sh   3/27 15:40      _      2      _      2 739.0 ... 744.0

2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
You can list individual jobs using the -nobatch option:
$ aws_q -nobatch
-- Schedd: AWS@condor1.liv.ac.uk : <138.253.100.17:31608?... @ 03/28/18 11:51:56
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 739.0   smithic         3/27 15:40   0+20:11:25 R  0   733. wrapper.sh analyseLiverCP.R 0 observationsLiver.dat result
 744.0   smithic         3/28 10:01   0+01:50:08 R  0   733. wrapper.sh analyseLiverCP.R 0 observationsLiver.dat result

2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended

To submit jobs, you will first need to create a job submission file e.g.

$ cat whathost.sub 
universe = vanilla
when_to_transfer_output=ON_EXIT
executable = /bin/hostname
output = output$(PROCESS)
error = error$(PROCESS)
requirements = ( Arch=="X86_64") && ( OpSys=="LINUX" )
notification = never
log = logfile
queue 10
Example files can be found in /opt1/condor/examples/aws on the HTCondor server.

This will run 10 jobs (specified in the queue attribute) on the AWS Cloud. The standard output and error are directed to individual files according to their Job-ID viz output* and error* respectively. $(PROCESS) performs a similar role to $SLURM_ARRAY_TASK_ID in SLURM on barkla in that it takes on different values for each job task (called a "process" in HTCondor) in the range 0..N-1.

The optional log attribute specifies a logfile where progress information will be written. The executable to be run on the remote machine is given in the executable attribute (note this will be the hostname executable copied from the HTCondor server rather than the one pre-installed on the host). Most users will find it much easier to use the simplified job submission commands described in the following sections.

To submit the job(s) use:
$ aws_submit whathost.sub
and the command will return with a Job-ID.

To list the queued jobs use:
$ aws_q

The job may wait in the idle state (denoted by "I") for several minutes as the AWS cloud resources are only made available on-demand (in order to save money). What happens in the background is that every five minutes the HTCondor scheduler will check if there are jobs that need running and start up some cloud instances to meet the demand. Once there are no more jobs to run, the instances should fall idle and will automatically power down after an idle period an hour. This should give users enough time to correct any errors and resubmit their jobs before the machines power-down. On the next submission jobs should start to run much more quickly.

Eventually the cloud machines will start up and the jobs should quickly run (denoted by the "R" state). On completion the error* files should be empty whereas each of the output* files should contain a hostname similar to:

ip-10-0-0-88.eu-west-2.compute.internal

You can look in the log file to see how long the jobs took to run.

The executable can also be a shell script containing commands to run on the remote machine. Here's one such example which will sleep for five minutes before runing the (local) hostname command - this should make it easier to see the jobs actually running.

$ cat sleeper.sub 
universe = vanilla
when_to_transfer_output=ON_EXIT
executable = myscript.sh
output = output$(PROCESS)
error = error$(PROCESS)
requirements = ( Arch=="X86_64") && ( OpSys=="LINUX" )
notification = never
log = logfile
queue 10

$ cat myscript.sh 
#!/bin/bash 
sleep 300
myhostname=`hostname` 
echo "hostname is $myhostname"

To submit the job use:

$ aws_submit sleeper.sub 
and again aws_q can be used to track the progress of the jobs.

2.2 R Applications

Jobs that make use of the R programming language can be submitted using simplified job submission commands that make the whole process easier. Here is an example of a simplifed job submission file:

$ cat run_analysis
R_script = analyse.R
indexed_input_files = observations.dat
indexed_output_files = results.dat
indexed_log = log
total_jobs = 10

There are ten input files observations0.dat, observations1.dat ... observations9.dat which are processed by the analyse.R R script to give ten output files results0.dat, results1.dat ... results9.dat. You can find all of the required files for this in /opt1/condor/examples/aws/R on the HTCondor server and the full background to the example can be found at: http://condor.liv.ac.uk/r_apps.

To submit the jobs use:

$ aws_r_submit run_analysis

and you can then can track their progress using aws_q.

The standard error and output from the jobs can be directed to files using the indexed_stdout and indexed_stderr attributes respectively e.g.

indexed_stdout = my_output
indexed_stderr = my_errors

This would direct the standard error and output from each individual job process to a separate file. Where the application uses multiple R files containing functions called from the main one, these should be specified as common input files using the input_files attribute e.g.

input_files = my_function1.R, my_funtion2.R

Data that is common to all jobs can also be stored in common input files e.g.

input_files = common_data1.RData, common_data2.csv

2.3 MATLAB Applications

Jobs that make use of MATLAB can be submitted using simplified job submission commands that make the whole process easier however M-files must be compiled into a standalone executable first which will run on the AWS cloud. Fortunately this can be created on the HTCondor server itself. First, start up a MATLAB session using:

$ matlab_run

Then invoke the MATLAB compiler using:

>> mcc -mv myscript.m

Where myscript.m should be replaced with your name of your M-file. This should eventually produce a standalone executable with the same filename as the M-file minus the .m "extension".

If your application uses other M-files containing functions used by the main one, place all of these in an empty directory called for example dependencies and use a command similar to:

>> mcc -mv myscript.m -a dependencies

Once you have finished, type "exit" to quit MATLAB session.

An example of a simplified MATLAB job submission file is:

$ cat run_product
indexed_input_files = input.mat
indexed_output_files = output.mat
executable = product
log = logfile
total_jobs = 10

There are ten input files input0.mat, input1.mat ... input9.mat which are processed by the product executable to give ten output files output0.mat, output1.mat ... output9.mat. You can find all of the required files for this in /opt1/condor/examples/aws/matlab on the HTCondor server and the full background to the example can be found at: http://condor.liv.ac.uk/matlab

To submit the jobs:

$ aws_matlab_submit run_product

and you can then can track their progress using aws_q.

The standard error and output from the jobs can be directed to files using the indexed_stdout and indexed_stderr attributes respectively e.g.

indexed_stdout = my_output
indexed_stderr = my_errors

This would direct the standard error and output from each individual job process to a separate file.

Data that is common to all jobs can be stored in common input files e.g.

input_files = common_data1.mat, common_data2.mat

2.4 Python Applications

Jobs that make use of the Python programming language can be submitted using simplified job submission commands that make the whole process easier. Note that the default version of Python on the AWS HTCondor pool is 3.7.4. Here is an example of a simplifed job submission file:

$ cat run_product 
python_script = product.py
indexed_input_files = input.dat
indexed_output_files = output.dat
log = log.txt
total_jobs = 10

There are ten input files input0.dat, input1.dat ... input9.dat which are processed by the product.py Python script to give ten output files output0.dat, output1.dat ... output9.dat. You can find all of the required files for this in /opt1/condor/examples/aws/python on the HTCondor server and the full background to the example can be found at: http://condor.liv.ac.uk/python.

To submit the jobs use:

$ aws_python_submit run_product

and you can then can track their progress using aws_q.

The standard error and output from the jobs can be directed to files using the indexed_stdout and indexed_stderr attributes respectively e.g.

indexed_stdout = my_output
indexed_stderr = my_errors

This would direct the standard error and output from each individual job process to a separate file. Where the application uses multiple Python files containing functions called from the main one, these should be specified as common input files using the input_files attribute e.g.

input_files = my_module1.py, my_module2.py

Data that is common to all jobs can also be stored in common input files e.g.

input_files = common_data1.dat, common_data2.json

2.5 Generic Applications

If you have your own (or third party) application software that will run on an Amazon AMI instance then you can submit jobs based on this using another simplified job submission command. Here is an example of a simplified job submission file for the hostname example given above:

$ cat gethost
executable = /bin/hostname
indexed_stdout = output
indexed_stderr = error
log = logfile
total_jobs = 10
To submit the jobs you would use the aws_array_submit command:
$ aws_array_submit gethost
Submitting job(s)..........
10 job(s) submitted to cluster 757.

and you can then use aws_q to monitor their progress. This would submit ten job processes and write the standard output and standard error to separate files. You can replace the hostname executable with your own.

Where you need to access input and output files, you will need to pass these as arguments to the executable or shell script. The INDEX placeholder can be used to tag the files with indices in 0..N-1 as in this example which simply copies the input file to the output:

$ cat copyfile
executable = /bin/cp
arguments = inputINDEX.txt outputINDEX.txt
indexed_input_files =  input.txt
indexed_output_files =  output.txt
indexed_stdout = output
indexed_stderr = error
log = logfile
total_jobs = 10

[smithic@condor1 test]$ aws_array_submit copyfile
Submitting job(s)..........
10 job(s) submitted to cluster 758.

On completion, the ten output files output0.txt, output1.txt ... output9.txt should have the same contents as the input files input0.txt, input1.txt ... input9.txt which is the correspinding index value 0..9. You could replace /bin/cp with your own executable or shell script "wrapper".

The files for these examples can be can be found in /opt1/condor/examples/generic on the HTCondor server.



last updated on Thursday, 21-Sep-2023 11:27:02 BST by I.C. Smith

Research IT RSE, University of Liverpool, Liverpool L69 3BX, United Kingdom, +44 (0)151 794 2000