Building a data warehouse on Azure

by Gerben Butter

Part 1: The love for the Blob

When architecting a greenfielddata warehouse on the Azure platform, the first step is to determine which features of Azure you want to start building on. In most cases, the need for a data warehouse is driven by generating insights out of validated data that resides in various systems and sources. And, because Zamro is a scale-up, it should be cheap and highly scalable.

Ok, so we have a few types of source databases, some cloud sources and files, an Azure subscription, and an insight-hungry management team. The end result stating one conformed, time-variant and integrated database with applied business rules so anyone can use that data to gain insights with whatever tool you use.

But where to start? And with what components to use?

There are quite some options:

  •        The safe but old fashioned way: an entire data warehouse on a VM with SQL Server and SSIS
  •        The new kid: Azure Data Lake with Data Factory and Data Lake Analytics and USQL
  •        The big data option: a complete HDInsight Hadoop cluster with Hive, Spark, Airflow etc
  •        The massive parallel option: Azure Data Warehouse
  •        Or maybe a mix of all above?

They all have their pro’s and cons, but they also have one thing in common: they all need data. And then the start becomes an easy choice: The Azure Blob Storage. Of course Amazon and Google are providing similar packages with different options and price tags, but we’ve chosen Azure and the options and prices are very competitive.

To get all required data into the data warehouse platform, you want (at least):

  •        Flexibility of ingestion methods (C#, Java, Python, SSIS, Powershell, etc)
  •        Flexibility on storage types (tsv, xml, json, avro etc)
  •        Good security options and data at rest encryption
  •        Accessibility by (at least) Azure components mentioned earlier
  •        With very limited cost for hot and cold storage

And that is exactly what the blob store offers. Other features are queues, table storage and file share.

This low cost and flexible storage solution enabled a swift ingestion of various sources and to quickly start building up history on data. Because of various ingestion options, other Zamroni’s were able to push relevant data into the Blob storage with their own beloved method (in our case, Java), relieving the data warehouse team from building a data feed.

With the data flowing into our blob store, Zamro chose for building a data warehouse on a VM with SQL Server and SSIS, but also use the power Azure Data Lake and Data Lake Analytics. More on the choices made, in the next blog.

Resources:

Amazon S3, MS Azure and Google Cloud Storage Pricing Comparison
Azure Storage Documentation
How to use Blob storage from Java

PS: We are looking for a Data Warehouse engineer. Apply here!

Leave a Reply

Your email address will not be published. Required fields are marked *