Data warehousing – bridge tables, what, why, eh?

a tuxedo cat on a hanging wooden bridge — Photo by Arina Krasnikova on Pexels.com

Are bridge tables scary? Maybe… but once you clearly understand the concept and the use cases they’re a lot less terrifying. After that there’s not much to it really, and in fact you’ll probably find you’ve already created something similar somewhere along the line in your career. So, what are they and why might you want to use one? The simple answer is that they’re a table that sits between a fact table and a dimension table, like a ‘bridge’ (hence the name), which lets us model many to many relationships between facts and dimensions. For another more formal description, see the Kimball groups definition here.

Sometimes the source data might even have something like a bridge table already, sometimes it won’t. Obviously if there is one already, that means you don’t have to do much else, but often you’ll see a row in a source table with something like the following columns:-

AppointmentID
AppointmentDate
PatientID
ClinicianID1
ClinicianID2
ClinicianID3
and so on…

This is where we probably need to, or should, create a bridge table. Obviously you could just add several dimension keys fields to the fact table such as ‘ClinicianKey1’, ‘ClinicianKey2’, ‘ClinicianKey3’ and so on (you get the idea). Many people do take this approach… However, what if each appointment has a varying number of clinicians present? Meaning that sometimes the ‘ClinicianID’ fields were NULL and sometimes all 3 were populated? Or, what if there were 10 ‘ClinicianID’ fields instead of just 3?

lined up of rubber duck on shelf — Photo by Designecologist on Pexels.com

Say for example you edge on the side of caution and add 10 clinician dimension key columns to your fact table to cover all bases. Then checking the source data and you find in most cases the appointments only involved 1 or 2 clinicians? For a start, that’s a lot of unused or redundant columns in the fact table. More columns in a table obviously also makes a table wider, thus taking up more disk space which then results in slower query performance (more data to scan, slower to finish). In addition, this makes the fact table very rigid and not at all flexible or scalable. For example, what if the source system gets updated and now has 15 ‘ClinicianID’ fields instead of the previous 10? That would mean its time to modify your fact table structure, ouch! Also not forgetting aggregation, what if you wanted to count the average number of clinicians per appointment? Erm, ah… yes, ugly… good luck with that 😉

person showing white and orange helicopter — Photo by Inge Wallumrød on Pexels.com

Thankfully, this is where bridge tables comes to the rescue… we can take the ’10’ (or however many, it doesn’t matter) dimension key columns out of the fact table, turn them into rows in a single column and put them in a separate table (the bridge table). This approach now means we can have a varying number of clinicians for any appointment and they’ll only take up as many rows as required. The other advantage is that since we’ve narrowed the 10 columns down to 1 single column and the fact that this new column will have relatively low cardinality, this means that the compression for the column should be pretty good… thus, resulting in less disk space needed, which again means less data to scan and faster query performance.

So, what structure should our bridge table be designed with? Well, it’s pretty simple, for each row in the bridge table we’ll need to know 2 things at the very least…

Which appointment the bridge table row corresponds to in the fact table
What clinician the bridge table corresponds to in the clinician dimension

This means we’ll essentially need our bridge table to contain something like the following information:-

Appointment fact table	Clinician dimension table
Appointment 1	Dr Strangelove
Appointment 1	Dr Evil
Appointment 2	Dr Colossus
Appointment 3	Dr Manhattan
Appointment 3	Dr Feelgood
Appointment 3	Dr Watson

Pretty straightforward yes? We can see for appointment 1, there were 2 clinicians. For appointment 2, just the one. Finally, for appointment 3 there were 3 clinicians. So, if we now replace the above and use the fact table primary key to relate to the fact table, and the clinician dimension key to relate to the clinician dimension, we might end up with something like this:-

AppointmentFactKey	ClinicianDimensionKey
1	2
1	1
2	8
3	6
3	9
3	12

This is now the core of our bridge table, our fact table hasn’t been polluted with unnecessary additional columns and the clinician dimension stays exactly as it was. The only difference is that now we have the information to determine for any appointment, which clinicians (and how many) were present. It’s always a good idea to keep a bridge table as ‘thin’ (i.e. the least amount of columns) as possible, since we’ve traded columns for rows this means the number of rows in a bridge table will always contain more rows (or at the very least the same number of rows) than the source table it is built from.

two man holding white paper — Photo by Anamul Rezwan on Pexels.com

Ok, fine, but how do I build and maintain one?

Ok, go on then. I’ve written a little rough T-SQL as an example of one way to do this. It’s a fair bit of code, but there’s quite a lot of sample data and tables to create at the start, so don’t be too put off by that. The main chunk of the processing after that isn’t too tricky. I’ve put comments to explain everything in the code, but essentially what we’re doing is:-

Unpivoting the columns into rows and loading into a temp table
Joining the dim and fact tables onto this temp table to get dim and fact keys, then loading the result of that into another temp table
Merging the rows from our temp table into the final bridge table, whilst deleting any rows for appointments/clinicians which are not valid anymore. Note the ‘EXISTS’ condition on the ‘DELETE’ part where we limit the deletion to only rows in the bridge table where we’ve got a corresponding appointment in the source data. If we didn’t do this, then the delete would remove all other rows in the bridge table for appointments not in the incoming source data! 😉

-- We don't need row counts...
SET NOCOUNT ON;

-- Create an example source data table
DROP TABLE IF EXISTS [#SourceTable];
CREATE TABLE [#SourceTable] (
	[AppointmentID] INT NOT NULL
	,[ClinicianID1] INT NULL
	,[ClinicianID2] INT NULL
	,[ClinicianID3] INT	NULL
);  
GO  

-- Insert some test rows into our source data table
INSERT INTO [#SourceTable] VALUES (1,4,3,5);  
INSERT INTO [#SourceTable] VALUES (2,4,1,5);  
INSERT INTO [#SourceTable] VALUES (3,4,3,5);  
INSERT INTO [#SourceTable] VALUES (4,4,2,5);  
INSERT INTO [#SourceTable] VALUES (5,5,1,3);  
GO

-- Create a 'fake' fact table
DROP TABLE IF EXISTS [#FactAppointment];
CREATE TABLE [#FactAppointment] (
	[AppointmentKey] INT NOT NULL PRIMARY KEY
	,[AppointmentID] INT NOT NULL
	,[SomeOtherDimensionKey] INT NOT NULL DEFAULT(-1)
);
GO

INSERT INTO [#FactAppointment] VALUES (1,1,1);
INSERT INTO [#FactAppointment] VALUES (2,2,6); 
INSERT INTO [#FactAppointment] VALUES (3,3,3); 
INSERT INTO [#FactAppointment] VALUES (4,4,8); 
INSERT INTO [#FactAppointment] VALUES (5,5,7);
INSERT INTO [#FactAppointment] VALUES (6,6,2);
GO

-- Create a 'fake' clinician dimension table
DROP TABLE IF EXISTS [#DimClinician];
CREATE TABLE [#DimClinician] (
	[ClinicianKey] INT NOT NULL PRIMARY KEY
	,[ClinicianID] INT NOT NULL
	,[Name] VARCHAR(50) NOT NULL
);
GO

INSERT INTO [#DimClinician] VALUES (1,1,'Bob');  
INSERT INTO [#DimClinician] VALUES (2,2,'Sue');  
INSERT INTO [#DimClinician] VALUES (3,3,'Jim');  
INSERT INTO [#DimClinician] VALUES (4,4,'Jane');  
INSERT INTO [#DimClinician] VALUES (5,5,'Pete'); 
GO

-- Create a 'fake' bridge table
DROP TABLE IF EXISTS [#Bridge];
CREATE TABLE [#Bridge] (
	[AppointmentKey] INT NOT NULL
	,[ClinicianKey] INT NOT NULL	
);
GO

INSERT INTO [#Bridge] VALUES (1,1);  
INSERT INTO [#Bridge] VALUES (1,2);  
INSERT INTO [#Bridge] VALUES (2,1);  
INSERT INTO [#Bridge] VALUES (3,3);  
INSERT INTO [#Bridge] VALUES (3,5);
INSERT INTO [#Bridge] VALUES (3,4);
INSERT INTO [#Bridge] VALUES (4,5);
INSERT INTO [#Bridge] VALUES (4,1); 
INSERT INTO [#Bridge] VALUES (6,1);
GO

-- Now the real parts, first we'll unpivot the data and load 
-- into a temp table for ease of use later on
DROP TABLE IF EXISTS [#temp_unpivoted];
CREATE TABLE [#temp_unpivoted] (
	[AppointmentID] INT NOT NULL	
	,[ClinicianID] INT NOT NULL
	,[ClinicianNumber] INT NOT NULL
)

INSERT INTO
[#temp_unpivoted] (
	[AppointmentID]
	,[ClinicianID]
	,[ClinicianNumber]	
)

SELECT
-- We'll need the business key field so we can join 
-- onto the fact table later on...
[src].[AppointmentID]
	
-- These columns we'll need so we can join onto
-- the dimension table later on...
,[unpvt].[ClinicianID]
,[unpvt].[ClinicianNumber] -- This is optional, but might be useful in some cases

FROM
[#SourceTable] [src]

-- Unpivot the columns into rows, you could use UNPIVOT (see below) here instead of 
-- using 'CROSS APPLY'. However, in this example I've added an additional column using
-- the VALUES statement for the 'ClinicianNumber' which might be useful (or not) later
-- on. This also preserves NULL's as CROSS APPLY applies the join to all rows (which 
-- is why it keeps NULL values from columns). However, in a bridge table this isn't 
-- something needed or wanted. It's not a problem here though as the INNER JOIN to the 
-- dimension later on will eliminate these NULL's. If you're being picky or don't
-- need additional columns like my example, just use UNPIVOT instead (see below for
-- commented out example)
CROSS APPLY (
	VALUES
	(1, [src].[ClinicianID1])
	,(2, [src].[ClinicianID2])
	,(3, [src].[ClinicianID3])
	-- and so on for however many columns in the source table...
) [unpvt] (
	-- You might want to use 'ClinicianNumber' if you need/want to know the ordinal 
	-- number in relation to the source data table, for instance if it has business 
	-- relevance (e.g. 1 = primary clinician, 2 = others, etc...). This is why I often
	-- end up using CROSS APPLY instead of UNPIVOT for certain bridge tables ;-)
	[ClinicianNumber]
	,[ClinicianID]
)

-- Alternatively, this would be the equivalent using UNPIVOT. Note that we can't specify 
-- additional columns like we can with CROSS APPLY (which allows us to use the VALUES 
-- statement). We could still derive the 'ClinicianNumber' from the name of the original 
-- column saved in the 'ClinicianIDColumnName' field later on, but too much effort ;-)
/*
UNPIVOT (
	[ClinicianID] FOR [ClinicianIDColumnName] IN (
		[src].[ClinicianID1]
		,[src].[ClinicianID2]
		,[src].[ClinicianID3]
	)
) [unpvt] 
*/

WHERE
[unpvt].[ClinicianID] IS NOT NULL

-- Now we can join this unpivoted data onto the fact/dim tables to get the keys we need
-- and build a temp table ready for merging into the final 'real' bridge table
DROP TABLE IF EXISTS [#temp_bridge];
CREATE TABLE [#temp_bridge] (
	[AppointmentKey] INT NOT NULL
	,[ClinicianKey] INT NOT NULL
)

INSERT INTO
[#temp_bridge] (
	[AppointmentKey]
	,[ClinicianKey]
)

SELECT
[FactKey] = [fact].[AppointmentKey] -- Fact table primary key
,[DimensionKey] = [dim].[ClinicianKey] -- Dimension table primary key

FROM
[#temp_unpivoted] [unpvt_bridge]

-- Join our unpivoted data onto the fact table so 
-- we can get a fact table primary key
INNER JOIN 
[#FactAppointment] [fact]
ON
[unpvt_bridge].[AppointmentID] = [fact].[AppointmentID] -- Join on the fact business key

-- Join our unpivoted data onto the dimension table so 
-- we can get a dimension table primary key
INNER JOIN 
[#DimClinician] [dim]
ON 
[unpvt_bridge].[ClinicianID] = [dim].[ClinicianID] -- Join on the dimension business key

-- Finally, we merge the changes into the bridge table (note that we don't/can't do any
-- update part as we're joining on the 'key' fields of the table, so there are no fields
-- to update. We only insert new rows or delete rows no longer applicable to the range
-- of data we're now inserting
MERGE
[#Bridge] [trg]

USING
[#temp_bridge] [src]
ON
[trg].[AppointmentKey] = [src].[AppointmentKey]
AND [trg].[ClinicianKey] = [src].[ClinicianKey]

-- Insert any new rows
WHEN NOT MATCHED BY TARGET THEN INSERT (
	[AppointmentKey]
	,[ClinicianKey]
) VALUES (
	[src].[AppointmentKey]
	,[src].[ClinicianKey]
)

-- We only want to delete rows in the target bridge table where we've got appointments that are
-- in the source data, but where there's a different clinician. If we just deleted all rows here 
-- with no condition then we'd delete all the other rows in the entire bridge table for appointments
-- that are not in the incoming source data! ;-)
WHEN NOT MATCHED BY SOURCE AND EXISTS(	
	SELECT 
	1 
	
	FROM 
	[#temp_bridge] [src_data]

	WHERE
	[src_data].[AppointmentKey] = [trg].[AppointmentKey] -- Only for appointments in the target AND source!
)
THEN DELETE;

-- TEST ONLY - check what happened after the merge. We should have ALL rows from the source in there
-- where there was an 'AppointmentID' match, with any existing rows that matched but had a different
-- 'ClinicianID' now gone. Secondly, we need to check any 'AppointmentID' rows that were not in the 
-- source are still there. Which means for our example, there should still be a single row in the final
-- bridge table with 'AppointmentKey' = 6 and 'ClinicianKey' = 1, as that appointment wasn't in the 
-- incoming source data so we must leave it! ;-)
SELECT
[b].[AppointmentKey]
,[b].[ClinicianKey]

FROM
[#Bridge] [b]

ORDER BY
[b].[AppointmentKey]
,[b].[ClinicianKey]

-- Drop temp tables...
DROP TABLE IF EXISTS [#SourceTable];
DROP TABLE IF EXISTS [#FactAppointment];
DROP TABLE IF EXISTS [#DimClinician];
DROP TABLE IF EXISTS [#Bridge];

DROP TABLE IF EXISTS [#temp_unpivoted];
DROP TABLE IF EXISTS [#temp_bridge];

Ok, and that’s it. Hope this helps someone out there, give me a shout if you’ve got any questions or if you spot any mistakes I’ve made 😉

Data warehousing – bridge tables, what, why, eh?

Like this:

Related

Leave a ReplyCancel reply

Data warehousing – bridge tables, what, why, eh?

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Aventius