Data warehouses, data vaults, lake houses, where do I even start???

donuts and bagel display
Photo by Igor Ovsyannykov on Pexels.com

It often seems like there’s always someone coming up with something ‘new’ that somewhat solves something someone somehow thought needed solving, but actually probably didn’t 😂

Yeah, someone needs to solve your awful writing skills…

Yes, apologies for that above opening sentence, it’s about the best that my poor literary skills can muster, but I thought I’d dive into the murky depths of data architectures, methodologies and other data terminology that’s knocking around. That sentence captures some of my feelings on this subject. However, I thought I’d go over some of these and see what’s what. But before we get to that, we need to keep in mind that all of these architectures, methodologies etc… have arisen to solve just one single problem. Data is being collected and someone needs to report on it and/or analyse it for some reason. Simple as that…

So, why are there so many different architectures and methodologies trying to solve the same thing? Well, some are driven by new technologies that open up solutions that were previously impractical, others come around because some is trying to improve a particular area because they think it’s more efficient to do it this way or that. Some are just business processes that you follow, and aren’t necessarily technical. To be honest, all, and I mean all, have pros and cons and there isn’t one size fits all. A lot of these different approaches will try to make out that previous methodologies or architectures are too slow to adapt to change, or overcomplicate things… but as is the way with pros and cons, changing one part might speed it up or simplify it, but you’ll almost certainly be making a trade off with one thing for something else.

But can I connect to it using Excel?

mad formal executive man yelling at camera
Photo by Andrea Piacquadio on Pexels.com

After working with data for many years, I’ve realised that switching between various different methodologies and architectures… none of them will absolve you of doing some kind of in depth modelling work at some point, despite what they might promise. If you’ve got several systems storing customer data, you will at some point, no matter the architecture or methodology, need to get it into (or put on top) some kind of decent data model that is clean and efficient in terms of business user querying, reporting and analysis. People reporting on the raw data will often come up with the same measures but different calculations and thus different results, that’s obviously a problem, I’ve seen it many times…

However, it doesn’t matter how cool, efficient or simple your data solution is if the end users cannot make head nor tail of it. Literally, I’ve had end users throwing a hissy fit and trashing a whole data platform because they say that they can’t use Excel to connect to the data (even though in almost every case they could, but just didn’t know the server name, connection string or something like🤦‍♂️).

You mean I have to actually do some work?

a yes word written on beach sand
Photo by Karolina Kaboompics on Pexels.com

I’m afraid so… 😮

We’ll go over some of these architectures and methodologies briefly, but first I want to try and focus on what I think almost any data project like this should (IMHO) be focusing on before choosing an architecture or methodology. Some of these tie nicely into various architectures and/or methodologies though. Obviously this is not an exhaustive list, but I hope it might provide a decent base to start from. I’d also imagine that anyone who has been at this for quite a while could just roll off a similar list too.

FYI – for want of a better word going forward, I’m going to refer to the final destination for data just as a ‘warehouse’, no matter the architecture, underlying technology, methodology or platform…

  • Timeliness – We need to get the data into our warehouse in a timely fashion of course. In some cases we’re constrained by other factors beyond our control… such as when a data file is posted on a website for example. But in other cases, data might be streamed into the warehouse in near real-time or loaded nightly, or both, depending on business need. Generally, even though some data may flow into the warehouse in near real-time, a warehouse doesn’t often offer real-time reporting or analytics. Although sometimes it can be possible, but don’t put extra effort into this though if there is no business need.
  • Accuracy – I’ve seen a number of posts from people saying data accuracy isn’t something that should be a top priority due to the resources and time it takes to get 100% accurate data into a warehouse. I respectfully disagree… for general rough directions, trends and some types of analytics you could argue that 100% data accuracy isn’t a requirement. However, for MI (Management Information) style reporting, especially reporting that is directly involved in the amount people are paid, there is no wiggle room. In the UK, the NHS in works like this, the hospitals record the number of operations, appointments and procedures etc… then they submit this information to get paid on those very numbers. If your figures are less than what they should be, some very expensive operations could end up getting missed and this can be thousands, if not tens of thousands in some cases.
  • Efficiency – Ideally any solution should be as efficient as possible, ideally the simpler a solution is can often be correlated closely to how efficient it is. You should of course always try to aim for maximum efficiency, but remember you might want (or need) to sacrifice some efficiency if the trade off it worth it.
  • Flexibility – Any warehouse design or architecture should be flexible to allow for changes further down the line. Obviously you can’t account for every possible change, but things like using textual fields for business keys, despite some performance hit, might save you a headache later on. For example, if later on down the line you find that you need to also ingest data from a system that uses GUID’s instead of integers as its business keys. It would be good if everyone just used integers for their business keys, but some crazy developers out there insist on using non-integers like GUIDS, despite the issues with sorting and indexing…
  • Scalability – Unless you know exactly how much data you’ll ever process, you’ll also need make sure you’ve got a scalable platform so that you’ll be able to process future data as the total amount of data grows. Maybe you might archive some or partition data (if possible) or model tables to split data. Or maybe you might be able to scale your compute up and down. Sometimes you may need to pre-process some data before using it in another large operation to improve the speed of the larger operation.
  • Reliability – Any architecture or platform needs to be reliable, obviously, but how far do you take it? What happens if a pipeline fails, do you re-run a specific number of times or fail straightaway? What if a dataset process fails that other subsequent processes have a dependency on, do you fail everything or just that? Even infrastructure might need to have failover or high availability in place, although within reason and budget.
  • Security and governance – Will your model support security by design or do you need to put another layer over the model to provide the correct security? Do you need row level or even object level permissions?
  • Cost – Obviously you want the solution that incurs the least cost, but sometimes an initial cost for say a capital outlay like physical hardware might seem expensive or cheaper than a cloud solution. But what if you’re scaling your cloud resources up and down, does this even out the costs. Obviously there’s added overhead of data centre costs too if you’re going down the on-prem route.
  • Performance – Any solution is pointless if its too slow… But when I say too slow, too slow for what? Often, a solution that takes time to process into its final form might be seen as slow to build, but if its very fast to query, this is fine. Obviously you’d hope to get processing/building time down as much as you can with the tools and technology you’ve got, but in a data platform the end users only care if its fast to query, interrogate, analyse and has the data they expect at a time when it should be there. On the flip side though, if your solution is far too slow to process data but still fast to query, users might not be happy with the delay in data delivery.
  • User friendliness – If you solution works brilliantly but none of the end users can connect to it easily or understand it, then you’ve got a problem. It’s always a good idea to engage end users early in development to make sure they’re aware of how it works, how to connect to it and that its straightforward to use. You’ll sometimes find you need to compromise on some areas that you may ‘like’ to make it easy for users to actually use.
  • Monitoring – Finally, if your system is failing but you don’t know, that’s a big problem. So make sure you’ve got some kind of monitoring. Even simple emails if something fails.

In terms of the actual data entities in any solution, I usually try to keep these in mind…

  • Clear and concise business entities – The original source system may have several tables, but are these only just one real business ‘entity’? Should you denormalise them into a single new object which clearly represents a business entity? I’ve seen systems with several tables just for the ‘customer’, is this going to be easy for the business end users and analysts to query or do you need to build a better model for them to query easily and quickly?
  • Data granularity – What level of aggregation or summarised data do your end users want or need? You don’t want to mix different levels of granularity in the same table otherwise you’ll get problems when users run aggregation functions.
  • Point in time – Do you need to know historical attributes for data at a particular time or will the latest attribute values do? Think of peoples addresses, they change over time, does your reporting need to account for this or not? Don’t build it in if its not something you need.
  • Conformed tables (or dimensions) – We may have several fact tables, but often we want to slice these using the same attributes, such as number of purchases by customer location. We don’t want repeating dimensions for each fact table if they share a common relationship. Try and share dimensions across fact tables. For example in a medical scenario, you don’t want many different patient dimensions, ideally you just have the one ‘master’ patient dimension.
man looking in binoculars during sunset
Photo by Francis Seura on Pexels.com

So, with all that in mind, lets actually take a look at this from the point of view of some of these architectures, methodologies and concepts:-

  • Data warehouse – This is a type of data store that is usually built using either the Kimball (i.e. star schema) or Inmon (i.e. star schema but with added snowflake) methodology. Basically both involve fact tables that hold your measures, whilst dimension tables hold ‘reference’ attributes that you might want to split the measures by.
  • Data vault – This was developed by a guy called Dan Linstedt several years ago. He was essentially trying to solve slow data ingestion and adaptation to schema changes in traditional data warehouse design approaches like Kimball. The arguments have been somewhat nullified by data lakes these days, there an interesting blog post on using this inside databricks here. You’ll still at some point though need to put something over the vault like a Kimball style star schema or similar, even if its just a set of ‘views’. For end reporting purposes its not usually going to match the speed of a star schema based approach.
  • Data mesh – This is kind of an approach to dealing with data manipulation ownership, attempting to decentralise control over data engineering from a central team, and inside pushing control out to business domain teams. The proviso being that those teams ‘know’ their business domain data better than a more general data engineer who is less concerned with knowing the data and more focused on how to manipulate it. The downside to this is that being a data engineer and intimately knowing a specific business domains data are not necessarily two things that go hand in hand. I know loads of analysts who know the data, but not many are at the same level as a data engineer who can expertly manipulate it. There’s a more detail description on Wikipedia here.
  • Data fabric – This is more of a concept on how to better integrate data from different sources. Using metadata and existing infrastructure. However, like many other concepts though, it does not absolve you of doing any real work. Short of some highly advanced AI (i.e. more than just ChatGPT) this will still need skilled people to work over the top of it and add governance, otherwise it will just become like a ‘wild west’. See this link here for more detailed description.
  • Lakehouse / data lakehouse – This is what a lot of organisations are currently building or investing in. Its essentially combining the speed and flexibility of data lakes with a Kimball style approach to get a ‘best of both worlds’. Because of the speed of data lakes and the fact you can load raw structured or unstructured files directly into the lake (as its essentially a file system but with bonus features) means you can quickly and easily get your data into it. With most of the data platforms nowadays, you can usually query/report directly on the files themselves. Databricks and Microsoft Fabric support this kind of structure and design.
  • Medallion architecture – This originated from the people at databricks and is less of an architecture and more of a way to organise the quality of your data. Bronze is raw data, silver is cleaned data and gold is ready for end user query and consumption. See here

So how would you do it?

Well, as I talked about in the part 9 of my series on using Microsoft Fabric for NHS data processing (see here), I personally like to take the bits I like from wherever I find them. I find that with all these different terms knocking around these days, and all the various approaches and architectures it can get a pretty confusing. Almost so you can’t see the wood for the trees, which is why I take the approach I do. Saying that, I always try to keep things as simple as possible and don’t use something just for the sake of it.

However, this approach could be a problem if you’re working within a team or if you need to adhere to a particular architecture and/or methodology for some reason. Also, surely if everyone worked like this it’d end up a complete mess? Yes, although I’d argue that in a lot of cases the horse has already bolted… Secondly, I personally like to think that if I’m creating something, if I’ve named it clearly and well enough, it should tell other people what it does. For example, if I created a ‘landing’ lakehouse, would you say its obvious that this is where data is landing? Hopefully its not just me 😉

In terms of my personal favourite approach though, I usually bring it down to what technology is being used? Generally if your end users need to use PowerBI or most other reporting tools, a star schema is usually the way to go (i.e. Kimball style), as this will give you the best performance. You do trade off some processing time to get the data into this structure though, but that’s the idea. The work is done before hand, so that the final structures are the most optimal form for reporting and querying… For example, a data vault might be quicker to load data into, although to be honest, with data lakes nowadays, this is less of a problem than the old days of loading into a relational database, but the whole architecture and platform will be more complex and never be as fast to query as a star schema (at least with most tools). In fact, I’ve seen some places put a Kimball style layer on top of a data vault anyway! So, with that out of the way, generally, in most cases, I’d go with the following…

  • Land your data in a lakehouse – these days this is usually the fastest way to get data into a platform as write speeds using data lakes are generally much faster than loading into a database. Get the data from the source systems whilst keeping an almost identical schema structure to the source. This helps mitigate errors and helps track down issues between the source and extracted data.
  • Merge incoming data in history tables – merge any incoming ‘landing’ data into full history tables, again keeping the schema almost the same as the source. This way you’ve got all the data you need without having to go back to the various source systems each time. Source system data experts and analysts can often use this for testing purposes as they’ll usually know the data better than a data engineer does.
  • Load data from your history tables into a well designed data model – the key phrase here is ‘data model’, you need to take some time and care to design a data model (usually in collaboration with your analysts) that will tick all the boxes you need. Don’t be fooled by any of these strategic ‘data [insert buzzword here]’ concepts discussed above, rarely are source systems in a nice structure that will be quick and easy for end users to query. As such, there is usually no shortcut and in almost 99% of cases you will have to design some kind of better data model and transform/load your raw data into it. For most data platforms and end user reporting tools, a Kimball style star schema will usually give users the fastest query performance. So facts and dimensions, combined with aggregated or summarised views should cover most cases.

Note that I’m not advocating a stage in between the history tables and the data model transformations. In medallion architecture there would be a stage there where the data has been cleansed, deduplicated and mildly transformed. However, I wouldn’t waste time on this unless its a business requirement, don’t just build it because its in the ‘medallion architecture’! You’ll often find most analysts want to see the raw data as it was in the source system(s), this includes duplicates and other data issues. Obviously if they do want you to clean and deduplicate tables whilst keeping the structure etc… then looks like you would need this stage 😂

woman leaning on her table
Photo by Marcus Aurelius on Pexels.com

You still with me? 😉

Most modern platforms that work with data lakes, such as databricks, and now Microsoft Fabric, will handle this with ease. So, call these stages or layers whatever you want, but just make sure it means something clear to everyone else who you work with (not just what you think sounds cool) and who might be employed in the near future 🫡

Hopefully this has given you some assistance, let me know your thoughts. Anyway, till next time, thanks for reading my blog…