# HTCondor Case Studies

The following examples illustrate just a few projects which have benefited from using the Research IT HTCondor High Throughput Computing Service.

## Biostatistics projects

My current work focuses on computational methods for complex Bayesian models in large datasets. I focus on the use of Variational Bayes as an approximation technique to allow fast computation. This work is described in the 2021 Biostatistics paper and used HTCondor to run simulations to test the development of our algorithms.

- Hughes, D.M., Garcia-Finana, M., & Wand, M.P. (2021) Fast approximate inference for multivariate longitudinal data. (Biostatistics, in press)

We developed longitudinal discriminant analysis methods which allow us to classify individuals into groups based on their disease history as measured by a set of longitudinal biomarkers (variables measured repeatedly over time). We developed our methods and described and tested them in the four papers listed below. HTCondor was vital in allowing us to run a lot of simulations quickly to test our approach

- Hughes, D. M., Komárek, A., Czanner, G., & Garcίa-Fiñana, M. (2018) Dynamic Longitudinal discriminant analysis using multiple longitudinal markers of different types. Statistical Methods in Medical Research, 27(7) 2060-2080 (Received Honourable mention in 2017 Young Biometrician of the year award from the British and Irish Biometric Society and Fisher Memorial Trust)
- Hughes, D. M., Czanner, G., Bonnett, L.J., Komárek, A., & Garcίa-Fiñana, M. (2017) Dynamic classification using credible intervals in longitudinal discriminant analysis. Statistics in Medicine, 36(24) 3858-3874
- Hughes, D.M., El Saeiti, R & Garcίa-Fiñana, M. (2017) A comparison of group prediction approaches in longitudinal discriminant analysis. Biometrical Journal, 60(2) 307-322
- El Saeiti, R., Garcίa-Fiñana, M. & Hughes, D.M. (2020) The effect of random-effects misspecification on classification accuracy (International Journal of Biostatistics in press)

We then used the methods we developed and applied them to a range of clinical areas developing clinical prediction models to identify high risk patients in the fields of liver cancer, diabetic retinopathy, epilepsy and necrotising enterocolitis in preterm babies. These are described in the papers below. HTCondor was useful in allowing us to test different models to settle on the best performing models.

- Hughes, D.M., Berhane, S., Degroot, C.A., Toyoda, H., Tada, T., Kumada, T., Stomura, S., Nishida, N., Kudo, M., Kimura, T., Osaki, Y., Kolamunage-Dona, R., Amoros-Salvador, R., Bird, T., Garcia-Fiñana, M. & Johnson, P. (2020) A validated risk stratification approach to hepatocellular carcinoma surveillance using serial AFP estimations. (Clinical Gastroenterology and Hepatology, in press)
- Probert, C.S., Greenwood, R., Mayor, A., Hughes, D.M., Aggio, R., Jackson, R., Simcox, E., Barrow, H., García-Fiñana, M. & Ewer, A.K. (2019) Faecal volatile organic compounds in preterm babies at risk of necrotising enterocolitis: the DOVE study (Archives of Disease in Childhood, available online)
- Hughes, D. M., Bonnett, L. J., Marson, A., & Garcίa-Fiñana, M. (2019) Identifying drug-resistance after breakthrough seizures. Epilepsia, 60(4) 774-782
- García-Fiñana, M., Hughes, D.M., Cheyne, C.P., Broadbent D.M., Wang, A., Komárek, A., Stratton, I.M., Mobayen-Rahni, M, Alshukri, A, Vora, J.P. & Harding, S.P. (2018) Prediction of sight threatening diabetic retinopathy: a multivariate prediction approach vs. the use of stratification rules. Diabetes, Obesity and Metabolism, 21(3), 560-568
- Hughes, D. M., Bonnett, L. J., Czanner, G., Komárek, A., Marson, A., & Garcίa-Fiñana, M. (2018) Early identification of patients who will not achieve seizure remission on AEDs within 5 years of starting treatment. Neurology 91(22) e2035-e2044

**Dr David Hughes, Department of Health Data Science**

## Uncertain Heterogeneous Algorithmic Teamwork

- a PhD project in the EPSRC CDT in Distributed Algorithms, co-funded by EPSRC and IBM.Commonly used state of the art (SOTA) algorithms for numerical Bayesian inference (such as Markov Chain Monte Carlo [MCMC]) are inherently sequential and poorly suited to modern computing architectures. Sequential Monte Carlo (SMC) samplers, on the other hand, are capable of exploiting modern computing architectures and have the potential to outperform the current SOTA by several orders of magnitude.

Previous collaboration between the University of Liverpool, IBM Research and the STFC Hartree Centre has developed SMC samplers capable of exploiting homogeneous super-computing hardware through the Big Hypotheses research project. In spite of their performance, supercomputers are expensive and are not widely accessible: Uncertain Heterogeneous Algorithmic Teamwork (UHAT) aims to increase the accessibility of SMC samplers by distributing them on a collection of crowd-sourced commodity compute hardware. Thus far, the project has culminated in an asynchronous SMC sampler that interfaces with Stan (a Probabilistic Programming Language) and is distributed on the University of Liverpool's HTCondor pool.

**Matthew Carter, EPSRC Centre for Doctoral Training in Distributed Algorithms**

## Simulation of a Randomised Clinical Trial into lowering Cholestrol

I used HTCondor to simulate a randomised clinical trial for the cholesterol lowering drug simvastatin.
Using my own MATLAB code, I generated virtual patients who then had their cholesterol simulated using
both a standard dosing protocol twenty times and a new dosing protocol twenty times.
With ten patients the amount of jobs submitted to HTCondor totalled around 400, the
standard dosing protocol simulations took around 15 minutes and the new dosing protocol
simulations took around an hour and a half.

**Dr Ben Francis, Institute of Translational Medicine**

## Mathematical Modelling of Disease Transmission in Animal Herds

I have used HTCondor to model the transmission of *E. coli O157* between individual animals in a herd and the transmission
of bluetongue virus between individual farms in Suffolk and Norfolk. I write my own Matlab code and use HTCondor
to run multiple simulations.

**E. coli O157 [1]:**
Here I needed to run lots of simulations for each parameter set/scenario, so I ran 1500 jobs with 20 simulations per job.

**Bluetongue virus [2]**:
In this case, a single simulation could take up to an hour to run and used a
lot of Matlab's memory, so I ran 100 jobs with 1 simulation per job.

### Publications:

[1] Turner J, Bowers RG, Clancy D, Behnke MC, Christley RM (2008)

*A network model of E. coli O157 transmission within a typical UK dairy herd: The effect of heterogeneity and clustering on the prevalence of infection.*Journal of Theoretical Biology 254, 4554 (doi:10.1016/j.jtbi.2008.05.007).

[2] Turner J, Bowers RG, Baylis M (2012)

*Modelling bluetongue virus transmission between farms using animal and vector movements*. Scientific Reports 2:319 (DOI:10.1038/srep00319).

**Dr Joanne Turner, Department of Epidemiology and Public Health**

## Simulation of Radiotherapy Treatment using Monte Carlo Methods

The EGSnrc-BEAMnrc-DOSXYZnrc Monte Carlo analysis software (a suite of open source code developed by the National Research Council, Canada and written using the MORTRAN extension of FORTRAN) is used for radiotherapy treatment planning with DICOM computed tomography (CT) datasets. A typical simulation which contains billions of particle histories is split into 5000 parallel jobs. This code is used for a PhD project at the moment but will be expanded to clinical use as a Quality Assurance (QA) tool in the near future (in collaboration with Clatterbridge Cancer Centre, Wirral).

**Dr Mekala Chandrasekaran, Clatterbridge Centre for Oncology**

## Classification of Complex Signals/Images using Genetic Programming

We use HTCondor to evolve a set of potential cost functions that are used as projection indices for feature extraction in classification problems. The evolution process was implemented via genetic programming (GP), using cross-validation as fitness function. At the i-th iteration, also known as i-th generation, the GP creates a new population by mixing the chromosomes of the existing population via genetic operators like crossover or mutation. The fitness value of each offspring is then evaluated and the best performing ones are allowed to survive for the next generation. We use HTCondor to compute the fitness of each offspring, which is executed as a single job. A typical population consists of 100 individuals/jobs which are evolved for 20 generations. This system was implemented using MATLAB with checkpointing enabled.

Keywords: machine learning, classification, feature extraction, projection pursuit, model selection, genetic programming.

**Dr Eduardo Rodriguez Martinez, Department of Electrical Engineering and Electronics**

## Simulation of Disease Transmission in the Poulty and Aquaculture Industries

Initial work concentrated on analysing the effect of incursion of H5N1 avian influenza into UK poultry flocks [1]. This was performed by running large numbers (on the order of thousands) of simulations written using our own MATLAB code. HTCondor provided an ideal platform for this to be achieved in a reasonable time frame. Subsequent work has shifted to a similar analysis of the aquaculture industry in England and Wales [2]

**Publications:**

[1] Kieran J. Sharkey, Roger G Bowers, Kenton L. Morgan, Susan E. Robinson and Robert M. Christley,

*Epidemiological consequences of an incursion of highly pathogenic H15N1 avian influenza into the British poultry flock*, Proc. R. Soc. B (2008) 275, 19-28 (doi:10.1098/rsbp.2007.1100)

[2] A.R.T. Jonkers, K. J. Sharkey, M. A. Thrush, J. F. Turnbull and K. L. Morgan,

*Epidemics and control strategies for diseases of farmed salmonids: A parameter study*, Epidemics, v. 2, issue 4 (December 2010), pp 195-206 (doi:10:1016/j.epidem.2010.08.001)

**Dr Kieran Sharkey, Department of Mathematical Sciences**