Using the cloud as a supercomputer: How to analyse 63 billion genetic data points in three days

In the good old days (you know, like five years ago) you needed a supercomputer to do massive data analysis jobs. University research departments either had to build their own, or buy precious schedule time on somebody else’s supercomputer. You had to be pretty sure that your research was important, and going to deliver a valuable result, before you could contemplate committing such a major investment of computing time.

These days, you can often replace a supercomputer with cloud services – meaning supercomputers are all around and anybody with a credit card can rent them by the hour as a simple cloud service. My colleague Steve Clayton has just written about a series of projects from Microsoft Research where they are using the Microsoft Windows Azure cloud to analyse massive volumes of data as they research deep medical problems, such as diabetes, Crohn’s disease and coronary artery disease:


Research in these areas is notoriously tricky due to the requirement for a large amount of data and the potential for false positives arising from data sourced from related individuals. A technique and algorithm known as linear mixed models (LMMs) can eliminate this issue but they take an enormous amount of compute time and memory to run. To avoid this computational roadblock, Microsoft Research developed the Factored Spectrally Transformed Linear Mixed Model (better known as FaST-LMM), an algorithm that extends the ability to detect new biological relations by using data that is several orders of magnitude larger. It allows much larger datasets to be processed and can, therefore, detect more subtle signals in the data. Utilizing Windows Azure, MSR ran FaST-LMM on data from the Wellcome Trust, analyzing 63,524,915,020 pairs of genetic markers for the conditions mentioned above.

27,000 CPU’s were used over a period of 72 hours. 1 million tasks were consumed —the equivalent of approximately 1.9 million compute hours. If the same computation had been run on an 8-core system, it would have taken 25 years to complete.

That’s supercomputing on demand and it’s available to everyone – as is the result of this job in Epistasis GWAS for 7 common diseases in the Windows Azure Marketplace.


There’s a short video case study on YouTube (and with possibly the most intelligent set of comments on a YouTube video I’ve ever seen!).

Learn MoreRead more…
The Microsoft Research Connections blog has more detailed information on this and other research projects where projects are able to replace a supercomputer with cloud services.