Ah that old chestnut, if I had a bitcoin for every time someone said something like this to me I’d be able to retire. Actually, even if I retired I’d still do stuff like this as I just enjoy doing it anyway! Obviously a single flat table is easy to understand, especially if its not got many fields or rows. Well, for small tables it’s not really an issue, but when you start getting lots of columns and especially lots of rows then this brings us back to the same argument I use almost every time in regard to data warehousing and why do things a certain way or not…

Performance…

close up of electric lamp against black background — Photo by Pixabay on Pexels.com

This is stupid, a single flat table is easier to make and for users to understand

Well, a small one yes, they can be easy to make and understand. However, when you start getting larger and larger datasets, what is your primary goal when designing your data warehouse? Apart from data modelling to make sure you can answer the questions that the business will want answers to… I’d argue the primary driver is (and should always be) query performance, but that doesn’t mean it can’t also be easy to understand and relatively easy to make. Plus, for business users we can do some other things to help them later on, such as creating aggregated or even materialised views.

a man and woman wearing angel costumes — Photo by KoolShooters on Pexels.com

Lets first define the what we’re talking about and then look at the pros and cons… If you’re not familiar with Kimball methodology when it comes to data warehousing you might want to read up on that first (like maybe here), otherwise carry on. If you are familiar then you’ll know that Kimball data warehouse design methodology is the leading design methodology used when building a data warehouse, so here we’re going to focus on Kimball. I’d also say that this methodology still makes sense if we’re talking about a lakehouse – at least in terms of structured data. With that in mind I’d personally say a well designed or structured data warehouse should really only have 3 types of tables. You could argue there maybe a forth, but more on that later:-

Fact tables – These tables contain the measures (i.e. the facts, such as length of stay, cost, size etc…) and dimension ‘keys’ which should be foreign keys (usually just integer fields) to dimension tables. There would also usually be a ‘business key’ (or natural key) field which often tends to be the original key or ‘identifier’ for the corresponding row from the source system (potentially several different source systems). They may also contain some basic auditing date information and a maybe a deleted flag (or sometimes a deleted date) to indicate if the row should be considered as ‘deleted’ or not. Note that in a good data warehouse no row is should be physically deleted, it should only ever be flagged as deleted (i.e. soft deleted). Physically deleting rows may save space, but you lose auditing, traceability and might have problems merging new or updated data into the fact table later on.
Dimension tables – You could think of these like special ‘lookup’ tables, but they serve more purposes than a simple lookup table. Many (well designed) dimensions will contain hierarchies, for example a category dimension may have fields named ‘parent category’, ‘main category’ and ‘sub category’ etc… This enables aggregation (e.g. count, sum, etc…) at the different levels of the hierarchy. They might also be ‘slowly changing’, which means a different version of an existing row can be added to the table with a specific ‘from’ and ‘to’ date range to which the data in that row only applies within. There should also be a primary key (the dimension key) which almost (if not) always should be a surrogate key like an integer. Yes, even for a date dimension! Also, a business key which is usually the original source systems row ‘identifier’ or ‘key’ field. Then several descriptive attribute columns (e.g. name, description, category etc…). Note that there are many different types of dimension tables, such as junk dimensions, flag dimensions, slowly changing dimensions etc… I’ve done posts on some of these, but I’ll cover all of them eventually.
Bridging tables – These serve the purpose that their name suggests, they form a bridge between a fact and dimension table when there is a many-to-many relationship. For example, imagine you have a group therapy session but there were many patients in the session, in a ‘flat’ group session fact table how would you store this list of patients? Maybe comma separated, or an XML column, or even several columns? Instead, the bridge table sits in-between the fact table and the patient dimension and tells you for which row in the fact table, which patients are related. This approach means you don’t need several ‘patient’ dimension key columns in your fact table, as we’ve turned them into several rows in the bridge table. Thinning out the fact like this by reducing columns means it requires less disk space (or memory) and that usually means you’ll get better performance (i.e. less data to scan).

clear glass bottle on white table — Photo by Yaroslav Shuraev on Pexels.com

Show me the money…

So now we know the basics of what tables should form a Kimball style data warehouse, why might we want to use them and not just use the simple flat tables that we talked about before? Let’s look at some of the pros and cons…

Pros of Kimball methodology

Table size and query speed – Flat tables with lots of rows can be converted a fact table with the same number of rows. ‘So what?’ you say? Well, for example, if your flat table is very wide (i.e. it has many fields, especially text or comment fields) this means a lot of space taken up on disk (or in memory) for the table. With a large amount of data to scan for a query, this obviously will take time to complete compared to a query with less data to scan.
Fast report filters – in most reports and dashboards you will have options to filter the data, usually a drop down menu or something. To get the list for the drop down menu on a flat table, how do you do this? You would either SELECT DISTINCT ‘ColumnName’ or SELECT ‘ColumnName’ and use a GROUP BY. How well would this work on a flat table which had 1 million rows, 10 million rows or even 100 million rows or more? Would it not be quicker to have just the distinct rows in their own lookup (dimension) table so you could query that first to find a set of integer ‘dimension key’ values which you can then search for in the fact table?
Hierarchies and drill downs – what if we want to sub categories for a product so users could ‘drill down’ from the higher level to a lower one? With a single flat table you can only add more fields to the flat table, and remember if we add say 3 more fields, each of these fields is just repeated on each and every row. Alternatively in Kimball design we would add these 3 fields to the relevant (much smaller) dimension table instead, thus keeping the size of the fact table unchanged.

Cons of Kimball methodology

Extra work – well someone (a collaborative group ideally) needs to actually do some data modelling to find out what questions and answers are relevant to the business. Then there’s the transformation to change raw data, cleanse and structure it to create dimensions and fact tables instead of just using the original flat table.
Slower processing time – it will of course eat up processing time performing the required operations on the raw data to transform it into the final desired form. However, part of data warehousing is that we do sacrifice some additional time beforehand processing data so that we can transform it into the most optimal structure for query speed later on.
End user understanding – it’s likely that depending on the size of the warehouse, end users will struggle to understand lots of different facts and dimension tables. This can be mitigated though will considered use of views and/or data marts.

It should also be mentioned that a lot of the end user reporting tools (such as PowerBI) work much faster with data in star schema format (i.e. Kimball design).

TLDR: if you’ve got just a few small thin tables then maybe this is overkill, if you’ve got large datasets then this should be the way to go if you want the fastest queries and reports – just be aware that you’ll need to put some effort into it and it will eat up import/processing time.

black binocular on round device — Photo by Skitterphoto on Pexels.com

Here are some useful references and links to techniques and best practices

Dimensional modelling – From the main guy who invented the data warehouse concept (with some help from others). Despite it being fairly old and no longer updated, this link has some great definitions, descriptions and tips/suggestions for data warehousing. I’ve directly linked some of these below, but they can all be accessed via this ‘top’ level link 😉
https://www.kimballgroup.com/2009/05/the-10-essential-rules-of-dimensional-modeling/

Wikipedia’s take on dimensions and facts
https://en.wikipedia.org/wiki/Dimension_(data_warehouse)
https://en.wikipedia.org/wiki/Fact_table

PowerBI and the star schema – it’s relevance to the star schema (i.e. facts and dimensions)
https://docs.microsoft.com/en-us/power-bi/guidance/star-schema
https://youtu.be/vZndrBBPiQc

Dimension tables
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/dimension-table-structure/

‘Flag’ dimension tables
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/flag-indicator-attribute/

‘Text/comment’ dimension table
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/text-comment/

Fact tables
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/fact-table-structure/

No NULL’s in your fact table!
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/fact-table-null/