How Azure OneLake will revolutionize the way we manage and use -omics data
Earlier today, OneLake was just announced at Microsoft Build. This platform, built on Azure Storage and existing data lake protocols, sets out to help enterprises organize their data in data lakes. The goal is to allow more flexibility for groups to manage and use their own while enabling better security and governance capabilities.
As always, I wanted to give my 2¢ and provide some early thoughts as to how this will be useful in the bioinformatics and genomics space.
As an MVP, I got a sneak preview of this stuff last month, but couldn’t share until today. Do you know how hard it was for me to keep it all a secret?!?
The Need for OneLake
A common thing that happens in enterprises today is that groups within an organization go rogue. Don’t worry, I’ll explain.
IT usually owns the Azure Tenant for an organization and may create centralized resources such as a data lake or data warehouse for everyone to use. However, they often then constrict access, permissions, and usage of those resources and may provide insufficient support. So, because non-IT groups still need to get their work done, they go rogue and start deploying their own cloud services to meet their needs.
On one side, now the non-IT groups can get their work done. On the other side, it turns into a management and governance mess for IT. This also creates data siloes that limit the flow of data in an organization without making multiple copies of it.
OneLake attempts to solve this common occurrence by creating a “unified management and governance” layer on top of Workspaces. Workspaces can be managed by teams or groups in an org, called “domains”, while still allowing for IT oversight and visibility.
Here are a few feature quotes/themes from today’s announcement…
“OneLake comes automatically provisioned with every Fabric tenant with no infrastructure to manage.” I think this means that you’ll have a single OneLake per tenant, similar to how Azure Purview works today.
“Any data in OneLake works with out-of-the-box governance such as data lineage, data protection, certification, catalog integration, etc.”
“OneLake enables distributed ownership. Different Workspaces allow different parts of the organization to work independently while still contributing to the same data lake. Each Workspace can have its own administrator, access control, region, and capacity for billing.”
The goal: “No Silos!” 🙃
The Need for OneLake in Genomics
While the above features are interesting for enterprise data management in general, there are some other features I’d like to share that will truly change how we operate with our -omics data in the cloud.
One Copy of Data and Shared Compute
As we all know, -omics data can get huge. Previously, we would have to save data in the data lake, then again in tables in a data warehouse, then again in a Dataset for a Power BI dashboard, etc., potentially in different formats. This is no longer the case. OneLake enables multiple different compute engines to access the same data in a variety of ways.
- A data engineer can load variant data or clinical trial results using a Databricks notebook.
- A bioinformatician could query the data using T-SQL from Synapse and run some analyses.
- A trial manager or BI analyst could build a Power BI dashboard off the same data (using DirectLake) to showcase the information visually.
I foresee this causing considerable savings on storage costs. Not to mention, this will alleviate the common headache where you try to figure out which version of the data was used to perform some analysis a month later. #scientificreproducibility
The backend is still ADLS under the hood with some updated pathing. So, it’ll still work with the existing ADLS Gen2 APIs and SDKs.
Parquet to the Floor
Parquet is now the preferred format that these supported engines will use to read/write data. So, if we focus our efforts in converting our -omics data into Parquet, using the same data across various compute engines will be seamless.
This differs from the norm today in that we often store our data in its native format (VCF files, CSVs, etc.). Other services, like Power BI or Azure Synapse, also store data in proprietary formats that are unusable by other tools.
Having all the compute engines use the same format will cut down on data conversion exercises, erroneous versions of data, and the overall need to house duplicative copies of the same info in multiple formats. (Note that this doesn’t mean you have to use Parquet. This means that other Azure services will use Parquet and you can, too, making a very cohesive way to store and use data.)
For more information about the Parquet format, visit: https://github.com/apache/parquet-format
Support for non-Azure Data
Despite being an Azure guy, I do understand that many life science orgs use a blend of both Azure and Amazon Web Services (and maybe some GCP) across different pockets of the company. Plus, it seems that many sequencing/lab vendors like to deliver data via S3 buckets.
With OneLake, “shortcuts virtualize data across domains and clouds”, which will allow Amazon S3 buckets to be managed externally. I feel that this will be very powerful for multi-cloud organizations.
I’ve noticed that, at many life science companies, data often needs to remain private to the group that manages it. This means that perhaps a non-clinical group wouldn’t be able to freely access clinical data and vice versa. Plus, when data sharing does need to happen, it often means copying data from one silo to another just to make the access process easy.
With OneLake, we’ll be able to manage the security and access of the data within our Workspaces and also share data easily when needed…without making a copy of the data.
Final Early Thoughts
It’s obviously too soon to know all the capabilities of OneLake and how it will affect the way we manage our data. From what I’ve seen so far, it abstracts a lot of the struggles that we see today in implementing data lake-centric architectures in large organizations.
OneLake Documentation: https://learn.microsoft.com/en-us/fabric/onelake/
My hope is that this platform allows democratization of data within an organization while still keeping the IT security and governance people happy. This balance between flexibility and structure will be how organizations can thrive by using their data to its fullest potential.
Pretty soon, I’ll be playing with this platform hands-on. I also want to get my hands on some of the other things that were announced at Build (Synapse vNext and Azure AI Studio, to name a couple).
Expect some more ramblings by yours truly soon!
Stay curious… ⛵🧬