Data warehousing – what the heck is a flag dimension?

girl in red crew neck shirt looking confused — Photo by Anna Shvets on Pexels.com

Ah yes, what the heck indeed… The Kimball group website has an old page about this here, but basically a flag dimension is a special type of dimension that you can (and should) put all your ‘flag’ type fields into instead of leaving them in a fact table or putting them in another dimension. I know that Kimball refers to this kind of dimension more generally as a ‘junk’ dimension, which can allow other types of fields and not just flags. However, in some cases it might be worth while creating a specific dimension just for flags. But why would anyone do this though, and what’s the point? Isn’t it just complicating things? I can hear the groans of the single flat table brigade from a mile away already… well, read on and lets see why I’ve not gone mad (yet).

I feel the need, the need for speed…
Maverick

The primary driving reason for doing this, like most other stuff in data warehousing, is to get the most performance when people (and their client tools) are querying the data. You might incur some time for the processing and building, but we pay that price for improved performance later on. Hang on though, how does moving the flags to a separate dimension improve performance? Well, before we go any further, lets clear up what I mean by a ‘flag’, these are the fields in the source data that are either Boolean, which means they only have 2 possible values e.g. true/false, yes/no, y/n etc… or are similar and have very low cardinality, for example yes/no/maybe. I think I’d personally draw the line at that last example, but hopefully that makes sense? Yes? Ok, let’s move on…

To illustrate the advantages lets just make up a basic example. Imagine we’ve got a relatively large table in our raw source data, maybe say it’s a list of hospital appointments with about 50 million rows. Each row might have the following fields (see below). Note that below I’ve put any flag type fields in bold and listed the possible values that the field could contain (not necessarily that it will, just that it could). So these ‘flag’ fields might be BIT or CHAR or something else, they could also potentially be NULL too, but hopefully you get the idea:-

Appointment Date
Start Time
End Time
Staff Code
Room Code
Is Private Appointment – 1, 0
Requires Support Staff – 1, 0
Requires Double Room – y, n
Requires Monitoring Equipment – yes, no
Not To Be Disturbed – yes, no
Will Respond To Advice – yes, no, maybe

Ok, so we’ve got 6 flag fields, some with just 2 possible values others with 3. Now obviously the ‘flag’ fields will take up a certain amount of space on disk or in memory just like any other field in a table. The exact amount isn’t relevant here, as this can vary between database providers and whether compression or other settings are used. What we’re interested in is the fact that obviously 6 columns will take up more space than say 1 column? So, for our 50 million row table we’ve got 6 columns * 50 million rows = 30 million potential pieces of information we need to store somewhere, whether on disk or in memory.

I’ve seen people do this (insert face palm here) many times, but when we start to build our fact table we don’t really want to just ‘bung’ these 6 flags onto the end of it – maybe at a pinch if there’s literally only ever going to be 1 flag field. Just remember, in a fact table we typically/ideally want to just have three types of fields:-

Measures – values like ‘cost’, ‘amount’ or ‘number of minutes waiting’
Dimension keys – just foreign keys (not necessarily enforced ones) to our dimension tables, usually integers
Metadata – e.g. primary key for the fact table, business key(s), date of insert, deleted date etc…

So what do we do if we can’t or don’t want to put them in the fact table? We put them in a flag dimension of course (clue is in the title of this post after all). We replace the 6 flag fields with a single dimension key field in our fact table… but wait, isn’t that just like splitting the fact table in 2? How’s that making things better? Well, this is where the good part comes in…

japanese lucky coin cat — Photo by Miguel Á. Padriñán on Pexels.com

The flag dimension will not actually be 50 million rows in total… what, what?!? Think about it, the 6 flag fields can only ever have a certain limited number of combinations. So, if they were all Boolean type fields then this would be 2 to the power of however many flag fields you have. In our example, its a little different as we’ve also got a field with 3 potential values. But just for example (ignoring our 3 value field) say it was just 2 to the power of 6, then this is just 64 rows! That’s quite a bit less than 50 million… but wait, it gets even better, as most of the time in the real world the data will probably not contain all 64 possible combinations, so your flag dimension table row count could be even less!

man jumping over white fence — Photo by Ady April on Pexels.com

Jumping back to the fact table for a moment, we’ve now got just a single field which is for our flag dimension key and which can only have values in the range of 1 to 64 (if we’re using an integer as our dimension key column type, which I highly suggest for best performance reasons). Naturally then, this dimension key field is going to have very low cardinality (which means it only has a small number of distinct values), and as such it can be highly compressed, resulting in much less disk space… and, if there is less space needed to scan, that means a query will run faster (are you starting to see the benefits now?). Not withstanding the fact that first you only need to scan the flag dimension (just those 64 rows) to find the flag combination you want, then get the relevant key value, and then finally scan the larger fact table for that single particular dimension key value. In most cases, the performance boost from this could be tremendous, especially with large fact tables.

Hopefully by now I’ve illustrated the benefits, granted most of these benefits you’ll only see for large tables but still, its something worth keeping in mind when you’re trying (or need) to squeeze the most performance out of your data warehouse.

Ok, but how do you create one?

My favourite part of all of this, is that these flag dimensions are stupidly simple to create. At the very least, you can just do a ‘SELECT DISTINCT’ (or ‘GROUP BY’) to get a list all the flag field combinations from your source table and that’s pretty much it. Although often I will (and strongly suggest you also) add some additional descriptive string fields to contain nicer textual descriptions for the flags, think ‘yes/no’ instead of ‘1/0’. After all, a good dimension is meant to contain clear descriptive values for business user reports and dashboards. Ideally you don’t want a dashboard designer then also having to add a custom field in the report to translate your ‘1’ and ‘0’ into ‘yes’ and ‘no’ right? Imagine showing an executive a report with ‘1/0’ or ‘true/false’ instead of a nice ‘yes/no’, it’s not very professional looking is it? By all means keep the ‘BIT’ fields though as you’ll probably need to use them to join onto your source data when you’re updating your dimension and also to find the relevant dimension key for your fact table rows. Also, you could use them for filtering internally on a report… just make sure you ‘hide’ them in any report view so business users can’t see them and definitely do not show them in something like a slicer!

So, an example of a procedure to create this flag dimension might be something like this below:-

CREATE PROCEDURE [uspProcessDimFlag] AS
BEGIN
	-- Load the flags into a temp table first, could use a CTE for a small
	-- source table maybe and performance would probably be ok/same ;-)
	DROP TABLE IF EXISTS [#Flags]
	CREATE TABLE [#Flags] (
		[IsPrivateAppointment]			BIT NULL
		,[RequiresSupportStaff]			BIT NULL
		,[RequiresDoubleRoom]			CHAR(1) NULL
		,[RequiresMonitoringEquipment]	VARCHAR(10) NULL
		,[NotToBeDisturbed]				VARCHAR(10) NULL
		,[WillRespondToAdvice]			VARCHAR(10) NULL
	)

	INSERT INTO [#Flags] (
		[IsPrivateAppointment]
		,[RequiresSupportStaff]
		,[RequiresDoubleRoom]
		,[RequiresMonitoringEquipment]
		,[NotToBeDisturbed]
		,[WillRespondToAdvice]
	)

	SELECT
	[IsPrivateAppointment]			-- Could be 0, 1, NULL
	,[RequiresSupportStaff]			-- Could be 0, 1, NULL
	,[RequiresDoubleRoom]			-- Could be Y, N, NULL
	,[RequiresMonitoringEquipment]	-- Could be 'YES', 'NO', NULL
	,[NotToBeDisturbed]				-- Could be 'YES', 'NO', NULL
	,[WillRespondToAdvice]			-- Could be 'YES', 'NO', 'MAYBE', NULL

	FROM
	[MyBigSourceTable]

	-- Group by so we get a distinct list of all flag combinations!
	GROUP BY
	[IsPrivateAppointment]
	,[RequiresSupportStaff]
	,[RequiresDoubleRoom]
	,[RequiresMonitoringEquipment]
	,[NotToBeDisturbed]
	,[WillRespondToAdvice]

	-- Merge flag combinations into flag dimension
	MERGE
	[DimFlag] [trg]

	USING
	[#Flags] [src]
	ON
	-- We always need to join on every flag field to make sure we get all combinations! Note 
	-- the use of EXISTS/SELECT/INTERSECT as this takes NULL's into account when matching 
	-- fields in [trg] and [src]. Otherwise we'd need to use something like 'ISNULL' to 
	-- replace all the NULL's with a value, this is more robust and should perform better
		EXISTS(SELECT [trg].[IsPrivateAppointment]			INTERSECT SELECT [src].[IsPrivateAppointment])
	AND EXISTS(SELECT [trg].[RequiresSupportStaff]			INTERSECT SELECT [src].[RequiresSupportStaff])
	AND EXISTS(SELECT [trg].[RequiresDoubleRoom]			INTERSECT SELECT [src].[RequiresDoubleRoom])
	AND EXISTS(SELECT [trg].[RequiresMonitoringEquipment]	INTERSECT SELECT [src].[RequiresMonitoringEquipment])
	AND EXISTS(SELECT [trg].[NotToBeDisturbed]				INTERSECT SELECT [src].[NotToBeDisturbed])
	AND EXISTS(SELECT [trg].[WillRespondToAdvice]			INTERSECT SELECT [src].[WillRespondToAdvice])

	-- Insert any new flag combinations. Note that as we're joining on all flag fields 
	-- above, there is NO update part to the merge! This will allow us to work with both
	-- full refreshes and incremental data changes coming from source, as either way any
	-- new combinations will just get merged into the flag dimension. We 'could' mark
	-- non-existent combinations as deleted in the dimension, however this would ONLY be
	-- possible with a full load and not if you're doing incremental loads! Hence this is
	-- why there is no UPDATE (or soft delete update) part to this merge statement.
	WHEN NOT MATCHED BY TARGET THEN INSERT(
		[IsPrivateAppointment]
		,[RequiresSupportStaff]
		,[RequiresDoubleRoom]
		,[RequiresMonitoringEquipment]
		,[NotToBeDisturbed]
		,[WillRespondToAdvice]
	) VALUES (
		[src].[IsPrivateAppointment]
		,[src].[RequiresSupportStaff]
		,[src].[RequiresDoubleRoom]
		,[src].[RequiresMonitoringEquipment]
		,[src].[NotToBeDisturbed]
		,[src].[WillRespondToAdvice]
	);

	-- Done, get rid of temp table for good measure (although of course it would be dropped
	-- when the session finishes anyway ;-)
	DROP TABLE IF EXISTS [#Flags]
END

Then when you come to build your fact table you’d join the flag dimension in a similar manner, maybe something like this below. Note that I’ve excluded other parts for brevity:-

-- Select all fields for our fact table, probably then load all this into a temp table
-- and finally merge the temp table into the destination fact table ;-)
SELECT
...The primary key and other metadata fields would be here probably
,[dim_flag].[FlagDimensionKey] -- Get our flag dimension key for this row
...and so on for other dimension tables, measure fields etc...

FROM
[MyBigSourceTable] [fact_source]

-- Why INNER JOIN? Since all combinations will exist there will never be a non-matching
-- or 'unknown' row so there is no point in doing a LEFT JOIN...
INNER JOIN
[DimFlag] [dim_flag]
ON
-- We always need to join on every flag field to make sure we get all combinations! Note 
-- the use of EXISTS/SELECT/INTERSECT as this takes NULL's into account when matching 
-- fields in the 2 tables. Otherwise we'd need to use something like 'ISNULL' to 
-- replace all the NULL's with a value, this is more robust and should perform better
	EXISTS(SELECT [fact_source].[IsPrivateAppointment]			INTERSECT SELECT [dim_flag].[IsPrivateAppointment])
AND EXISTS(SELECT [fact_source].[RequiresSupportStaff]			INTERSECT SELECT [dim_flag].[RequiresSupportStaff])
AND EXISTS(SELECT [fact_source].[RequiresDoubleRoom]			INTERSECT SELECT [dim_flag].[RequiresDoubleRoom])
AND EXISTS(SELECT [fact_source].[RequiresMonitoringEquipment]	INTERSECT SELECT [dim_flag].[RequiresMonitoringEquipment])
AND EXISTS(SELECT [fact_source].[NotToBeDisturbed]				INTERSECT SELECT [dim_flag].[NotToBeDisturbed])
AND EXISTS(SELECT [fact_source].[WillRespondToAdvice]			INTERSECT SELECT [dim_flag].[WillRespondToAdvice])

-- Other joins to other dimensions etc...

Anyway, on that note I’m going to end this post here. I hope this has got you thinking about how you might use flag dimensions. I’m amazed that the number of data warehouses I’ve seen where people either don’t know about them or just don’t use them, despite the potential performance (and readability) benefits. Anyway…

I’ll discuss some other types of ‘junk’ dimensions in further posts, for now, till next time…

Data warehousing – what the heck is a flag dimension?

Like this:

Related

Leave a ReplyCancel reply

Data warehousing – what the heck is a flag dimension?

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Aventius