Automated Alignment and Variant Calling in Azure using the Microsoft Genomics service and the msgen R package

Colby T. Ford, Ph.D.
4 min readJan 24, 2021

--

The Microsoft Genomics service in Azure is a cloud-based implementation of the Burrows-Wheeler Aligner (BWA) and the Genome Analysis Toolkit (GATK) for alignment and variant calling. This service came was released a few years ago and currently provides a Python 2.7 CLI to submit workflows. This software takes in either a pair of .FASTQs or a .BAM file and outputs a .VCF based on a human reference genome of your choice.

A benefit of this service is that it provides access to this automated pipeline at a low cost ($1 for the first 10 Gigabases +$0.10/Gb after that) and is compliant with a multiple regulatory standards (ISOs 27001, 27018, and 9001 and HIPAA). Learn more about the Microsoft Genomics service here.

However, since I know that many bioinformaticians use R (or want to avoid Python 2.7), I wrote a complementary package in R called msgen .

Check out my msgenR package here: https://github.com/colbyford/msgen

Getting Started

To begin, you’ll need to create a Genomics account in your Azure tenant.

Once you have the Genomics account up and running, make a note of the name you gave for the service and grab the service’s access key.

Installing and Using the R Package

Installation of the R package is quite simple. You can install the msgen package directly from GitHub.

remotes::install_github("colbyford/msgen")
library(msgen)

Submit a Workflow

To start, you’ll need a pair of .FASTQs (or a .BAM file) and some account information. Upload these files to an Azure Storage account either directly in the Azure Portal (as shown below) or using a tool like Azure Storage Explorer. If you’re interested in uploading a file to Azure Storage from within an R script, check out the AzureStor package on CRAN.

You’ll need to have the name of the storage account where your file(s) are held, the storage account’s key, and the region of your Genomics service.

Hint: You can find your storage account’s name and key from the Azure Portal.

Then, define whether you want to use the “snapgatk” or “gatk4” pipeline and with which human reference genome (“b37m1”, “hg19m1”, etc.) and give the workflow a description.

submit_workflow(subscription_key = "b999a0...",
region = "eastus",
process = "snapgatk",
reference = "b37m1",
description = "Breast cancer analysis.",
input_storage_account_name = "genomicsdls",
input_storage_account_key= "U7lAIWxJ...",
input_container_name = "myinputdata",
blob_name_1 = "chr21_1.fq.gz",
blob_name_2 = "chr21_2.fq.gz",
output_container_name = "myoutputdata")

Once submitted, this function returns a data.frame that includes this run’s workflow ID.

Output from the `submit_workflow` function. Note the workflow ID.

Check a Workflow’s Status

You can check on this specific workflow using the get_workflow_status function. Simply enter the workflow ID from the previous submission step.

get_workflow_status(subscription_key = "b999a0...",
region = "eastus",
workflow_id = "10001")

Cancel a Workflow

If you made a mistake and want to cancel a workflow, you can do so using the cancel_workflow function.

cancel_workflow(subscription_key = "b999a0...",
region = "eastus",
workflow_id = "10001")

List all Workflows

To see all of your workflow, you can list this using this function:

list_workflows(subscription_key = "b999a0...",
region = "eastus")

Accessing the Results

Once the workflow process has completed, navigate back to your Azure Storage account. You should now see the output .VCF file (and, if you used a pair of .FASTQs, you’ll see the .BAM and .BAI alignment files).

Future Enhancements

Today, this package is quite simple but provides a full interface to the Microsoft Genomics service in Azure. The R package has commands analogous to the submit , list , status , and cancel functionality seen in the Python 2.7-based CLI.

Future enhancements will include different security options (today, the service uses the storage account’s key to make a SAS token), output compression options (like making a compressed .VCF output), and the ability to use the config.txt file from the Genomics service on the Azure Portal.

If you use the package and have any issues or want to request a feature, let me know on the Issues page on GitHub: https://github.com/colbyford/msgen/issues.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Written by Colby T. Ford, Ph.D.

Cloud genomics and AI guy and aspiring polymath. I am a recovering academic from machine learning and bioinformatics and I sometimes write things here.

No responses yet

Write a response