Microsoft Fabric – NHS data processing… part 5, modelling the data

floor plan on table — Photo by JESHOOTS.com on Pexels.com

Ok, so we’ve got our history tables with raw data in. Now we need to get any new or updated data out of them and into the dimensional model in our warehouse, this is where the real transforming happens. We also need to check if any data has been flagged as deleted in the history tables, since we’ll also need to make sure its flagged as deleted in the warehouse too…

The next logical question then should be, how do we model the data in our dimensional model? Well, that’s a subject we could spend a long time on, but for our case let’s just go over some brief outlines. First and foremost, the whole reason we’re doing all of this is to help the business answer questions about its past/current/future state. So, from this high level perspective, we need to create something which can provide all these answers, and ideally it should do this simply, quickly and accurately, whilst also being scalable and flexible enough to cope with potential change. In terms of the ‘business questions’ though, you could break this down a little further to better understand what you’ll need to provide. For example, almost every solution would likely need to provide at least the following:-

General MI (Management Information) reports e.g. number of sales, performance of staff etc…
Data marts which are simple and easy to understand for users (i.e. most likely analysts) to query further for more detailed information
Clear and concise dashboard visualisations to provide quick insightful overviews to high level mangers or executives
Some predictive analytics to answer questions about where the business is heading, or should be heading

The better the data model we design, the easier it should be for the business to quickly find answers to all these questions. But also for us as developers, remember it should make it easier to scale and maintain the model too and have it adapt to change without too much fuss.

yellow wait signage — Photo by Kelly on Pexels.com

However, before we go any further down this rabbit hole, lets take a look at the actual raw data structure we’ve got already in the lakehouse…

Obviously we’ve got several fields regarding the period, the organisations and some other attributes. However, pay careful attention to all the ‘Gt**To**WeeksSUM1’ fields. These go on for quite a while, all the way from ‘Gt00To01WeeksSUM1’ right up to ‘Gt103To104WeeksSUM1’ (see screen shot below).

This means on each row we’ve got over 100 fields representing the number of patients waiting for the specified number of weeks! We’ve also got several ‘total’ fields (see screen shot below) too at the bottom of the column list. That’s quite a lot of columns for a single row! 😮

So what do we do? Well, our final structure needs to be optimised for fast queries, easy to understand and modify, and be flexible enough for potential change. We could keep all these 100+ measures as fields in our final data model, but that would make it quite unwieldy. Do we keep the total fields, or do all the ‘Gt’ fields add up to the same total values? In terms of those ‘total’ fields, why are there 2 of them? Then we’ve got the ‘unknown clock start date’ measure too. This is all a bit head scratching, and actually finding the definitions for these fields wasn’t easy…

Basically the ‘Gt’ fields are added together for the ‘Total’ field, then the ‘unknown clock start’ field is separate. Then the ‘TotalAll’ is everything including the unknown clock starts. So what questions should we keep in mind when deciding on our model? Looking back to the earlier posts we came up with some questions, but lets add one more to the list at the top as this will be pertinent to our modelling:-

Breakdown of number of patients waiting by weeks waiting

So for our ‘new’ question, if we’ve got the data in its current structure can we answer this question easily? The answer is not exactly… why? Well, whilst we’ve got all the columns for each number of weeks waiting, these are columns not rows so we’d need to include every one of these columns in a query. What if we need or want to aggregate at a different level? This is where it gets tricky…

close up of human hand — Photo by Pixabay on Pexels.com

Let’s imagine a different scenario, if we pivot the columns into rows we’d have a table like this:-

Weeks Waiting Banding	Number Of Patients
Greater than 0 to 1 weeks	3
Greater than 1 to 2 weeks	1
Greater than 2 to 3 weeks	8

and so on…

We’ve now reduced the number of columns from 100+ to just 2. We could easily now just sum up the measure field and group by the banding field, a dead simple query. Also, remember, for a fast and efficient fact table we want it to be as ‘thin’ (i.e. as few columns as needed) as possible. If we left all 100+ columns as measure fields in a fact table, we’d have less rows than this pivoted alternative, but a much ‘wider’ table and summing up all those columns would be much slower.

The other reason we might want to use this alternative structure is in terms of aggregating at different levels. For example, if we wanted to know how many patients have been waiting between 0 to 3 weeks we can easily sum up the ‘Number Of Patients’ measure for these 3 rows and we get a total of 12 patients. Imagine if we wanted to know how many patients between 2 and 80 weeks, again a single ‘sum’… however, if we were using the 100+ column table, we’d have to write a query with almost 80 individual columns in it! I think this should raise a red flag if nothing else.

But how would you do that using this alternate table structure I hear you say? Well, this is where we get into the guts of data modelling. We can create a dimension table which we can then use to ‘breakdown’ this measure at whatever level of aggregation the user needs. Lets expand table from above a little bit…

Weeks Waiting Banding Of 1 Week	Weeks Waiting Banding Of 3 Weeks	Number Of Patients
Greater than 0 to 1 weeks	Greater than 0 up to 3 weeks	3
Greater than 1 to 2 weeks	Greater than 0 up to 3 weeks	1
Greater than 2 to 3 weeks	Greater than 0 up to 3 weeks	8

and so on…

Now if we wanted 3 week banding (or however many weeks) instead of the fixed 1 week fields in the raw original table, we can simply add another column to our table to group our ‘sum’ instead of needing to specify the 3 rows using the 1 week banding column. You can take this as far as you like or need, depending on how many bandings you need.

But this is just creating more columns like before? Surely we’re just going to end up with 100’s of columns this way too? I don’t see the point…

You’d be correct… however, this is where the magic happens now as we move these banding columns to a separate ‘dimension’ table and replace them in the ‘fact’ table with a single ‘dimension key’ field which will relate (i.e. a foreign key) to this dimension table with all the bandings in.

Our fact table now should look a little bit like this after this change…

Weeks Waiting Banding Dimension Key	Number Of Patients
1	3
2	1
3	8

and so on…

For our dimension table that holds our bandings, it would look something like this. Separating the banding into another table makes it much easy to maintain, as we can now add additional rows and columns to the dimension table without changing the structure of the fact table at all.

Weeks Waiting Banding Dimension Key	Weeks Waiting Banding Of 1 Week	Weeks Waiting Banding Of 3 Weeks	Under Or Over 18 Weeks
1	Greater than 0 to 1 weeks	Greater than 0 up to 3 weeks	Under 18 weeks
2	Greater than 1 to 2 weeks	Greater than 0 up to 3 weeks	Under 18 weeks
3	Greater than 2 to 3 weeks	Greater than 0 up to 3 weeks	Under 18 weeks

and so on…

But there’s still going to be 100+ columns in the dimension table, aren’t we just robbing Peter to pay Paul instead?

Potentially yes, there could be 100+ columns in the dimension table, but it depends on how many bandings you want. If you only need 1 week bandings then you only need the one column. Its the rows that now contain the individual number of weeks instead. So the dimension table will only ever be probably around 100 ish rows (which is very small), so even if it had 500 columns it would still be very fast to query…

This leads us onto the next point, since in a query we’d first query the dimension table for the required banding column and corresponding dimension keys. Remember, the dimension table will only have around 100 rows, with 1 row each for each number of weeks, or 1 row per original ‘Gt’ field. So it’s a very small table and will be very fast to query. Then once we’ve got our dimension keys, we can now search just a single column in the fact table for these key values, finally we just sum up the measure only for these particular rows! And boom, we’ve got our answer! So even if the fact table was 100 million rows, for a column with low cardinality (i.e. a small number of distinct values) like our dimension key, this will be super fast to filter. Lets also remember that Fabric uses delta tables under the hood which are highly compressed files in parquet format, this means even less data to scan, which means further increases query speed. In fact with compression, even if our fact table ends up with 100’s of millions of rows, the actual size on disk (on OneLake) will be tiny…

person using forklift — Photo by ELEVATE on Pexels.com

Ok, so that seems to be the optimal way to make our weeks waiting work, it’ll provide speed, simplicity and scalability. What about the rest of the fields? Well, we follow a similar approach, most of the other fields are business keys (i.e. fields which together form a unique identifier for a particular row). For any fields were we’ve got a code/ID and a description (like the organisation fields), we can move those to their own dimensions too. That just leaves the period, this again we can replace with a calendar dimension so for each date we can have a table that specifies the year, month, day name, quarter name etc… so then users can break down the data by aggregating at any periodic level they need.

With this in mind, in the warehouse we’ll look to create the following tables shown below. This will help breaking down the data so we’ve got nice clean dimensions which are easily expandable and a thin fact table which will be efficient and fast to query.

In the next post we’ll look at creating these tables and coding some stored procedures to populate them… Hope that wasn’t too bad, let me know if you’ve got any questions.

See you next time 🫡

Microsoft Fabric – NHS data processing… part 5, modelling the data

Like this:

Related

Leave a ReplyCancel reply

Microsoft Fabric – NHS data processing… part 5, modelling the data

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Aventius