Cloud Computing with HTCondor
1. What is Cloud Computing ?
2. The AWS HTCondor Pool
2.1 Basic Operation
2.2 R Applications
2.3 MATLAB Applications
2.4 Python Applications
2.5 Generic Applications
1. What is Cloud Computing ?
To begin with it is probably worth explaining a little of what is meant by "cloud computing" as there seem to be as many definitions as there are shapes of real clouds themselves. Many people will be familiar with cloud computing via commercial services such as Dropbox, Google Drive and iCloud. These allow users to archive and share data with other users in a transparent and seamless manner. The actual physical location of the data and the way in which it is accessed is usually unknown and probably of no real interest to the average user who just expects it to be always acessible on-demand. A useful analogy to cloud computing is the mains electricity supply. We expect it just be there 24/7 when we need it without worrying about how the electricity was produced or how it got to our homes. The same is true of the cloud computing - it is just there when we need it although, like the electricity supply, this comes at a cost.
The services described above are all examples of data clouds however there are other resources available called computing clouds. These allow users to run programs on remote computers rather than just store data on them. Here the term "programs" is deliberately vague and encompasses almost any application software including database engines, mail servers and web servers. Of more interest to researchers is the fact that a computing cloud can be used to efficiently run the kinds of computationally demanding software used extensively in science, engineering and medical applications. With cloud computing, these are usually serial applications however it is possible to provision an entire computing cluster (similar to the new barkla cluster) to run truly parallel codes "in the cloud".
There are a wide variety of cloud resources on offer from vendors such as IBM and Microsoft and the Research IT term have set up a pilot service service using Amazon Web Services (AWS) cloud. This has a monthly credit of a few thousand dollars. AWS started life as part of Amazon' online retail business but now can be considered a separate entity. It provides access to a huge amount of online services including computational resources (called EC2) and data storage facilities (called S3). The elastic computing cloud (EC2) service comprises a large amount of hardware from small single core virtual machines to large memory multi-core servers running a variety of Linux and Windows operating systems.
Users can create their own AWS "instances" (essentially virtual computers in the cloud) then login to them and run applications on them. From a research point of view, what is more useful is the ability to run multiple jobs in the cloud in a similar way to the MWS HTCondor Pool or clusters such as barkla. This is is where HTCondor comes in as it provides an efficient way of running multiple serial jobs on remote resources which may not be permanently available.
The HTCondor server (condor.liv.ac.uk) provides an easy-to-use way of accessing resources available through cloud computing as well as the MWS Pool. This can provide access to wide a range of dedicated hardware whose capabilities far exceed those of the standard MWS HTCondor Pool. There are two important points to make here:
-
(MWS HTCondor users note:) the AWS HTCondor service provides access to dedicated computing resources that will be available for the lifetime of the job and there is almost no chance of a job being evicted part way through execution. It is therefore much more effcient than the MWS HTCondor Pool in terms of throughput since jobs are not interrupted by ordinary teaching centre users.
- (Cluster users note:) in the AWS HTCondor pool, HTCondor just provides a simple way of submitting multiple serial jobs in the same way as job arrays using SLURM on barkla. Serial jobs can therefore be easily migrated from barkla to the AWS HTCondor Pool and the simpified job submission commands (e.g. matlab_submit, r_submit) work in the same way.
The are number of other advantages to using the AWS HTCondor pool over the MWS HTCondor pool (and the barkla cluster to some extent), namely:
- There are no restrictions on job run times although jobs running for several months do run a risk of a failure in the underlying infrastructure.
- There are a huge variety of processors, memory and storage options available meaning that jobs can be run on the most appropriate platform unconstrained by the limits of the commodity PCs used in the MWS HTCondor Pool.
- A variety of operating systems are available including several flavours of Linux (UNIX) and Windows.
- Third party application software can be pre-installed in bespoke cloud instances with no need to do this on-the-fly as with the MWS HTCondor Pool.
There is a one disadvantage that prevents the AWS HTCondor pool being used as a direct alternative to the MWS HTCondor pool and that is cost. During a farily typical month, approximately half a million core hours were clocked up on the MWS HTCondor Pool which would have cost at the very least $5,000 on the AWS Cloud. This is based on a cost of $0.01 per core hour and most instances cost at least a few times more than this. The AWS HTCondor service is therefore best suited to relatively small numbers of resource intensive or specialist serial jobs.
2. The AWS HTCondor Pool
The AWS HTCondor pool is only in its infancy and at the moment is more of an experimental setup which can adapted to suit individual users. We have deliberately limited usage to a small number of low specification machines to prevent users accidentally eating up large chunks of the AWS credits (this may sound a lot but could easily be spent on large scale HTCondor applications). In particular, the pool makes use of "t2.micro" instances which are single core, have 1 GB of memory and have 8 GB of disk storage.
The instances run Amazon's own version of Linux (Amazon AMI) which is based closely on Red Hat 6. Each instance has R version 3.6.1 pre-installed (including all of the libraries used on HTCondor) and the MATLAB 2019a component runtime.
These instances should be capable of running very simple jobs however more demanding problems will probably need more powerful instances which the Research IT team can set up for you (email: arc-support<@>liverpool<.>ac.<.>uk). There are also a variety of operating systems available. To see a list of instances go to: https://aws.amazon.com/ec2/instance-types/. A list of the current pricing can be found at https://aws.amazon.com/ec2/pricing/on-demand/.
2.1 Basic Operation
The commands used with AWS HTCondor are directly equivalent to those used for the standard MWS HTCondor pool with condor replaced by aws. To see a list of available resources use the command:
$ aws_status
The output from this will generally be blank if there are no active jobs being processed as machine instances are only created on demand. However, there are a number of "dummy" offline instances which are used internally by HTCondor to "spin-up" instances on demand. These can be listed using:
$ aws_status_offe.g.
$ aws_status_off Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@t2_micro_offline1.eu-west-2.compute.amazonaws.com LINUX X86_64 Unclaimed Idle 0.000 810218278+10:43:36 slot1@t2_micro_offline2.eu-west-2.compute.amazonaws.com LINUX X86_64 Unclaimed Idle 0.000 810218278+10:43:36 slot1@t2_micro_offline3.eu-west-2.compute.amazonaws.com LINUX X86_64 Unclaimed Idle 0.000 810218278+10:43:36 ...Most users need not be aware of these.
To list all of the jobs in the queue use:
$ aws_qe.g.
$ aws_q -- Schedd: AWS@condor1.liv.ac.uk : <138.253.100.17:31608?... @ 03/28/18 11:52:45 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS smithic CMD: wrapper.sh 3/27 15:40 _ 2 _ 2 739.0 ... 744.0 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspendedYou can list individual jobs using the -nobatch option:
$ aws_q -nobatch -- Schedd: AWS@condor1.liv.ac.uk : <138.253.100.17:31608?... @ 03/28/18 11:51:56 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 739.0 smithic 3/27 15:40 0+20:11:25 R 0 733. wrapper.sh analyseLiverCP.R 0 observationsLiver.dat result 744.0 smithic 3/28 10:01 0+01:50:08 R 0 733. wrapper.sh analyseLiverCP.R 0 observationsLiver.dat result 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
To submit jobs, you will first need to create a job submission file e.g.
$ cat whathost.sub universe = vanilla when_to_transfer_output=ON_EXIT executable = /bin/hostname output = output$(PROCESS) error = error$(PROCESS) requirements = ( Arch=="X86_64") && ( OpSys=="LINUX" ) notification = never log = logfile queue 10Example files can be found in /opt1/condor/examples/aws on the HTCondor server.
This will run 10 jobs (specified in the queue attribute) on the AWS Cloud. The standard output and error are directed to individual files according to their Job-ID viz output* and error* respectively. $(PROCESS) performs a similar role to $SLURM_ARRAY_TASK_ID in SLURM on barkla in that it takes on different values for each job task (called a "process" in HTCondor) in the range 0..N-1.
The optional log attribute specifies a logfile where progress information will be written. The executable to be run on the remote machine is given in the executable attribute (note this will be the hostname executable copied from the HTCondor server rather than the one pre-installed on the host). Most users will find it much easier to use the simplified job submission commands described in the following sections.
To submit the job(s) use:$ aws_submit whathost.suband the command will return with a Job-ID.
To list the queued jobs use:
$ aws_q
The job may wait in the idle state (denoted by "I") for several minutes as the AWS cloud resources are only made available on-demand (in order to save money). What happens in the background is that every five minutes the HTCondor scheduler will check if there are jobs that need running and start up some cloud instances to meet the demand. Once there are no more jobs to run, the instances should fall idle and will automatically power down after an idle period an hour. This should give users enough time to correct any errors and resubmit their jobs before the machines power-down. On the next submission jobs should start to run much more quickly.
Eventually the cloud machines will start up and the jobs should quickly run (denoted by the "R" state). On completion the error* files should be empty whereas each of the output* files should contain a hostname similar to:
ip-10-0-0-88.eu-west-2.compute.internal
You can look in the log file to see how long the jobs took to run.
The executable can also be a shell script containing commands to run on the remote machine. Here's one such example which will sleep for five minutes before runing the (local) hostname command - this should make it easier to see the jobs actually running.
$ cat sleeper.sub universe = vanilla when_to_transfer_output=ON_EXIT executable = myscript.sh output = output$(PROCESS) error = error$(PROCESS) requirements = ( Arch=="X86_64") && ( OpSys=="LINUX" ) notification = never log = logfile queue 10 $ cat myscript.sh #!/bin/bash sleep 300 myhostname=`hostname` echo "hostname is $myhostname"
To submit the job use:
$ aws_submit sleeper.suband again aws_q can be used to track the progress of the jobs.
2.2 R Applications
Jobs that make use of the R programming language can be submitted using simplified job submission commands that make the whole process easier. Here is an example of a simplifed job submission file:
$ cat run_analysis R_script = analyse.R indexed_input_files = observations.dat indexed_output_files = results.dat indexed_log = log total_jobs = 10
There are ten input files observations0.dat, observations1.dat ... observations9.dat which are processed by the analyse.R R script to give ten output files results0.dat, results1.dat ... results9.dat. You can find all of the required files for this in /opt1/condor/examples/aws/R on the HTCondor server and the full background to the example can be found at: http://condor.liv.ac.uk/r_apps.
To submit the jobs use:
$ aws_r_submit run_analysis
and you can then can track their progress using aws_q.
The standard error and output from the jobs can be directed to files using the indexed_stdout and indexed_stderr attributes respectively e.g.
indexed_stdout = my_output indexed_stderr = my_errors
This would direct the standard error and output from each individual job process to a separate file. Where the application uses multiple R files containing functions called from the main one, these should be specified as common input files using the input_files attribute e.g.
input_files = my_function1.R, my_funtion2.R
Data that is common to all jobs can also be stored in common input files e.g.
input_files = common_data1.RData, common_data2.csv
2.3 MATLAB Applications
Jobs that make use of MATLAB can be submitted using simplified job submission commands that make the whole process easier however M-files must be compiled into a standalone executable first which will run on the AWS cloud. Fortunately this can be created on the HTCondor server itself. First, start up a MATLAB session using:
$ matlab_run
Then invoke the MATLAB compiler using:
>> mcc -mv myscript.m
Where myscript.m should be replaced with your name of your M-file. This should eventually produce a standalone executable with the same filename as the M-file minus the .m "extension".
If your application uses other M-files containing functions used by the main one, place all of these in an empty directory called for example dependencies and use a command similar to:
>> mcc -mv myscript.m -a dependencies
Once you have finished, type "exit" to quit MATLAB session.
An example of a simplified MATLAB job submission file is:
$ cat run_product indexed_input_files = input.mat indexed_output_files = output.mat executable = product log = logfile total_jobs = 10
There are ten input files input0.mat, input1.mat ... input9.mat which are processed by the product executable to give ten output files output0.mat, output1.mat ... output9.mat. You can find all of the required files for this in /opt1/condor/examples/aws/matlab on the HTCondor server and the full background to the example can be found at: http://condor.liv.ac.uk/matlab
To submit the jobs:
$ aws_matlab_submit run_product
and you can then can track their progress using aws_q.
The standard error and output from the jobs can be directed to files using the indexed_stdout and indexed_stderr attributes respectively e.g.
indexed_stdout = my_output indexed_stderr = my_errors
This would direct the standard error and output from each individual job process to a separate file.
Data that is common to all jobs can be stored in common input files e.g.
input_files = common_data1.mat, common_data2.mat
2.4 Python Applications
Jobs that make use of the Python programming language can be submitted using simplified job submission commands that make the whole process easier. Note that the default version of Python on the AWS HTCondor pool is 3.7.4. Here is an example of a simplifed job submission file:
$ cat run_product python_script = product.py indexed_input_files = input.dat indexed_output_files = output.dat log = log.txt total_jobs = 10
There are ten input files input0.dat, input1.dat ... input9.dat which are processed by the product.py Python script to give ten output files output0.dat, output1.dat ... output9.dat. You can find all of the required files for this in /opt1/condor/examples/aws/python on the HTCondor server and the full background to the example can be found at: http://condor.liv.ac.uk/python.
To submit the jobs use:
$ aws_python_submit run_product
and you can then can track their progress using aws_q.
The standard error and output from the jobs can be directed to files using the indexed_stdout and indexed_stderr attributes respectively e.g.
indexed_stdout = my_output indexed_stderr = my_errors
This would direct the standard error and output from each individual job process to a separate file. Where the application uses multiple Python files containing functions called from the main one, these should be specified as common input files using the input_files attribute e.g.
input_files = my_module1.py, my_module2.py
Data that is common to all jobs can also be stored in common input files e.g.
input_files = common_data1.dat, common_data2.json
2.5 Generic Applications
If you have your own (or third party) application software that will run on an Amazon AMI instance then you can submit jobs based on this using another simplified job submission command. Here is an example of a simplified job submission file for the hostname example given above:
$ cat gethost executable = /bin/hostname indexed_stdout = output indexed_stderr = error log = logfile total_jobs = 10To submit the jobs you would use the aws_array_submit command:
$ aws_array_submit gethost Submitting job(s).......... 10 job(s) submitted to cluster 757.
and you can then use aws_q to monitor their progress. This would submit ten job processes and write the standard output and standard error to separate files. You can replace the hostname executable with your own.
Where you need to access input and output files, you will need to pass these as arguments to the executable or shell script. The INDEX placeholder can be used to tag the files with indices in 0..N-1 as in this example which simply copies the input file to the output:
$ cat copyfile executable = /bin/cp arguments = inputINDEX.txt outputINDEX.txt indexed_input_files = input.txt indexed_output_files = output.txt indexed_stdout = output indexed_stderr = error log = logfile total_jobs = 10 [smithic@condor1 test]$ aws_array_submit copyfile Submitting job(s).......... 10 job(s) submitted to cluster 758.
On completion, the ten output files output0.txt, output1.txt ... output9.txt should have the same contents as the input files input0.txt, input1.txt ... input9.txt which is the correspinding index value 0..9. You could replace /bin/cp with your own executable or shell script "wrapper".
The files for these examples can be can be found in /opt1/condor/examples/generic on the HTCondor server.