Automated Alignment and Variant Calling in Azure using the Microsoft Genomics service and the msgen R package

The Microsoft Genomics service in Azure is a cloud-based implementation of the Burrows-Wheeler Aligner (BWA) and the Genome Analysis Toolkit (GATK) for alignment and variant calling. This service came was released a few years ago and currently provides a Python 2.7 CLI to submit workflows. This software takes in either a pair of .FASTQs or a .BAM file and outputs a .VCF based on a human reference genome of your choice.

A benefit of this service is that it provides access to this automated pipeline at a low cost ($1 for the first 10 Gigabases +$0.10/Gb after that) and is compliant with a multiple regulatory standards (ISOs 27001, 27018, and 9001 and HIPAA). Learn more about the Microsoft Genomics service here.

However, since I know that many bioinformaticians use R (or want to avoid Python 2.7), I wrote a complementary package in R called msgen .

Check out my msgenR package here: https://github.com/colbyford/msgen

Getting Started

To begin, you’ll need to create a Genomics account in your Azure tenant.

Once you have the Genomics account up and running, make a note of the name you gave for the service and grab the service’s access key.

Installing and Using the R Package

Installation of the R package is quite simple. You can install the msgen package directly from GitHub.

remotes::install_github("colbyford/msgen")
library(msgen)

To start, you’ll need a pair of .FASTQs (or a .BAM file) and some account information. Upload these files to an Azure Storage account either directly in the Azure Portal (as shown below) or using a tool like Azure Storage Explorer. If you’re interested in uploading a file to Azure Storage from within an R script, check out the AzureStor package on CRAN.

You’ll need to have the name of the storage account where your file(s) are held, the storage account’s key, and the region of your Genomics service.

Hint: You can find your storage account’s name and key from the Azure Portal.

Then, define whether you want to use the “snapgatk” or “gatk4” pipeline and with which human reference genome (“b37m1”, “hg19m1”, etc.) and give the workflow a description.

submit_workflow(subscription_key = "b999a0...",
region = "eastus",
process = "snapgatk",
reference = "b37m1",
description = "Breast cancer analysis.",
input_storage_account_name = "genomicsdls",
input_storage_account_key= "U7lAIWxJ...",
input_container_name = "myinputdata",
blob_name_1 = "chr21_1.fq.gz",
blob_name_2 = "chr21_2.fq.gz",
output_container_name = "myoutputdata")

Once submitted, this function returns a data.frame that includes this run’s workflow ID.

Output from the `submit_workflow` function. Note the workflow ID.

You can check on this specific workflow using the get_workflow_status function. Simply enter the workflow ID from the previous submission step.

get_workflow_status(subscription_key = "b999a0...",
region = "eastus",
workflow_id = "10001")

If you made a mistake and want to cancel a workflow, you can do so using the cancel_workflow function.

cancel_workflow(subscription_key = "b999a0...",
region = "eastus",
workflow_id = "10001")

To see all of your workflow, you can list this using this function:

list_workflows(subscription_key = "b999a0...",
region = "eastus")

Accessing the Results

Once the workflow process has completed, navigate back to your Azure Storage account. You should now see the output .VCF file (and, if you used a pair of .FASTQs, you’ll see the .BAM and .BAI alignment files).

Future Enhancements

Today, this package is quite simple but provides a full interface to the Microsoft Genomics service in Azure. The R package has commands analogous to the submit , list , status , and cancel functionality seen in the Python 2.7-based CLI.

Future enhancements will include different security options (today, the service uses the storage account’s key to make a SAS token), output compression options (like making a compressed .VCF output), and the ability to use the config.txt file from the Genomics service on the Azure Portal.

If you use the package and have any issues or want to request a feature, let me know on the Issues page on GitHub: https://github.com/colbyford/msgen/issues.

Computational Biomathematician and Cloud AI Guy. I research machine learning and genomics and I sometimes write things here.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store