Scaling Genomics in the Cloud with Microsoft Azure

A deep-dive into building scalable architectures in Azure for bioinformatics workloads.

Colby T. Ford, PhD
3 min readMay 3, 2022

--

The following is an blog-friendly extract from my upcoming book, Genomics in the Azure Cloud. This post summarizes the overall purpose of the book, which provides design considerations for a genomics-centric architecture that harnesses the cloud’s power for organizing and scaling bioinformatics.

Genomics has the power to unlock biological insights from human health to infectious diseases to agriculture. Given the plummeting costs of genomic sequencing, we now need to come up with ways to store and organize the data and scale our compute capabilities.

Why cloud?

The cloud offers some key benefits for research organizations.

For genomics specifically, the cloud allows us to take our existing processes, pipelines, and code and “cloudify” them to make them scale to meet our growing needs. Plus, having your data and compute infrastructure in a cloud environment allows for the automation of standardized processes (like processing data through a standard RNAseq pipeline) and provides a natural space for collaboration.

In addition to these features, the cloud reduces our effort to manage the underlying infrastructure, security, and access.

Why Azure?

Microsoft Azure is one of the leaders in the cloud space, providing a ton of platform services (PaaS) that allow you to create performant architectures to fit your unique needs.

In addition to simply being a great cloud provider, Microsoft has a rich network of partners to help your organization succeed in the cloud. Plus, the Microsoft Research has specific teams devoted to genomics, life sciences research, and more. This allows Microsoft to continually innovate and contribute scientifically, which shows through many genomics-specific solutions that are unique to the Azure cloud.

What’s in the book?

Part 1: Data Platform

The first half of the book focuses on setting up the data platform architecture for housing your genomics data. This includes the creation of your genomics data lake and building out a variant data warehouse in Azure Synapse Analytics. We round off this half of the book by covering data orchestration using Azure Data Factory.

Part 2: Compute

The latter half focuses on the compute side of things, specifically how to scale your bioinformatics pipelines and machine learning capabilities. I start by covering tools like Azure Databricks and Azure Machine Learning, providing examples on how they can be used for scaling bioinformatics-specific tasks.

Then, I also cover other scalable compute services like Azure Batch and CycleCloud. These services are useful for “cloudifying” other pipeline tools like Snakemake, Nextflow, or Cromwell on Azure.

At the end of the book, I also provide some considerations for service deployment, security, and managing costs — all things to make your IT team happy.

How can I get the book? #iwantitnow

The book is being published by O’Reilly Media this Fall and it will be available here and on Amazon. However, if you want to get your hands on the first few chapters, you can join O’Reilly’s Early Release Program!

Visit https://www.oreilly.com/library/view/genomics-in-the/9781098139032/ to get access to the first few chapters (and get the upcoming chapters as they’re released).

Stay Curious…

--

--

Colby T. Ford, PhD

Cloud genomics and AI guy and aspiring polymath. I am a recovering academic from machine learning and bioinformatics and I sometimes write things here.