Automated Alignment and Variant Calling in Azure using the Microsoft Genomics service and the msgen R package
The Microsoft Genomics service in Azure is a cloud-based implementation of the Burrows-Wheeler Aligner (BWA) and the Genome Analysis Toolkit (GATK) for alignment and variant calling. This service came was released a few years ago and currently provides a Python 2.7 CLI to submit workflows. This software takes in either a pair of .FASTQs or a .BAM file and outputs a .VCF based on a human reference genome of your choice.
A benefit of this service is that it provides access to this automated pipeline at a low cost ($1 for the first 10 Gigabases +$0.10/Gb after that) and is compliant with a multiple regulatory standards (ISOs 27001, 27018, and 9001 and HIPAA). Learn more about the Microsoft Genomics service here.
However, since I know that many bioinformaticians use R (or want to avoid Python 2.7), I wrote a complementary package in R called msgen
.

Check out my
msgen
R package here: https://github.com/colbyford/msgen
Getting Started
To begin, you’ll need to create a Genomics account in your Azure tenant.

Once you have the Genomics account up and running, make a note of the name you gave for the service and grab the service’s access key.

Installing and Using the R Package
Installation of the R package is quite simple. You can install the msgen
package directly from GitHub.
remotes::install_github("colbyford/msgen")
library(msgen)
Submit a Workflow
To start, you’ll need a pair of .FASTQs (or a .BAM file) and some account information. Upload these files to an Azure Storage account either directly in the Azure Portal (as shown below) or using a tool like Azure Storage Explorer. If you’re interested in uploading a file to Azure Storage from within an R script, check out the AzureStor package on CRAN.

You’ll need to have the name of the storage account where your file(s) are held, the storage account’s key, and the region of your Genomics service.

Then, define whether you want to use the “snapgatk” or “gatk4” pipeline and with which human reference genome (“b37m1”, “hg19m1”, etc.) and give the workflow a description.
submit_workflow(subscription_key = "b999a0...",
region = "eastus",
process = "snapgatk",
reference = "b37m1",
description = "Breast cancer analysis.",
input_storage_account_name = "genomicsdls",
input_storage_account_key= "U7lAIWxJ...",
input_container_name = "myinputdata",
blob_name_1 = "chr21_1.fq.gz",
blob_name_2 = "chr21_2.fq.gz",
output_container_name = "myoutputdata")
Once submitted, this function returns a data.frame that includes this run’s workflow ID.

Check a Workflow’s Status
You can check on this specific workflow using the get_workflow_status
function. Simply enter the workflow ID from the previous submission step.
get_workflow_status(subscription_key = "b999a0...",
region = "eastus",
workflow_id = "10001")
Cancel a Workflow
If you made a mistake and want to cancel a workflow, you can do so using the cancel_workflow
function.
cancel_workflow(subscription_key = "b999a0...",
region = "eastus",
workflow_id = "10001")
List all Workflows
To see all of your workflow, you can list this using this function:
list_workflows(subscription_key = "b999a0...",
region = "eastus")
Accessing the Results
Once the workflow process has completed, navigate back to your Azure Storage account. You should now see the output .VCF file (and, if you used a pair of .FASTQs, you’ll see the .BAM and .BAI alignment files).

Future Enhancements
Today, this package is quite simple but provides a full interface to the Microsoft Genomics service in Azure. The R package has commands analogous to the submit
, list
, status
, and cancel
functionality seen in the Python 2.7-based CLI.
Future enhancements will include different security options (today, the service uses the storage account’s key to make a SAS token), output compression options (like making a compressed .VCF output), and the ability to use the config.txt file from the Genomics service on the Azure Portal.
If you use the package and have any issues or want to request a feature, let me know on the Issues page on GitHub: https://github.com/colbyford/msgen/issues.