Photo by Pixabay on Pexels.com

After doing a number of POC’s I thought I’d put together a series of posts detailing how you might build a metadata (table) driven ingestion process using Microsoft Fabric. Why another one? Well, I’m sure you’ve seen quite a few example walkthroughs out there, but nothing seemed ‘real world’ enough for me. So, in this series we’ll go over the steps to create full metadata (table) driven E2E (end to end) process which will be able to load, transform and finally visualise some data. We’ll use data factory for orchestration, notebooks and a bit of pyspark for downloading the files, and then try out the warehouse capabilities for our final ‘data warehouse’.

We’ll also see what visuals we can produce using PowerBI, a little bit like this dashboard below, which we’ll build off a custom made semantic model connected to our warehouse using the ‘direct lake’ connection method (i.e. fast, but no refresh required, yey!).

We’ll use some publicly available NHS data (the UK’s national health service) for this example. Anyone can download this data themselves from the NHS England website here if you want to play around with it yourself. There are many different datasets available for download on the site, but for now we’ll just take a look at the ‘referral to treatment’ dataset. This tracks how many weeks patients have been waiting for treatment since they got referred for treatment. The primary target for this is that a patient should be referred and then receive the treatment within 18 weeks (or maybe 2 weeks I think for cancer?). The data is listed by hospital provider and also by commissioner.

Provider and commissioner, what? who? eh?

Just for those unfamiliar with how the NHS works internally, its a little convoluted, but for our purposes we only need to know that the hospital ‘providers’ in the UK are commissioned (and paid) by ‘commissioning’ organisations to do the actual work. These commissioning organisations used to be called “CCG’s” (Clinical Commissioning Groups) but as per usual every several years the NHS goes through name changes and now its all been restructured and the organisations are now called “ICB’s” (Integrated Care Boards). They’re both NHS organisations, it’s strange I know, don’t ask me why, but the NHS commissions itself (and sometimes private organisations) to do work. Anyway, enough of that, on to the fun stuff…

man showing distress
Photo by Andrea Piacquadio on Pexels.com

They’ve changed the bloody column names again!

Another reason I chose NHS data for this particular example is that (apart from my personal experience working within the NHS and with data like this) is that it’s often quite challenging to work with. By that I don’t mean it’s difficult to understand or import, but rather the structures can be quite inconsistent and can change quite often over time. I remember trying to build processes to ingest this kind of data back in the day when I worked in some of my permanent roles in the NHS, it felt like almost each month either the column names in the files we received were changing, or they’d add/remove a column, or even change its format. Why this happened I do not know, but it was a right pain in the a** to deal with…

It get’s even better though… apart from the schema difficulties mentioned above, another challenge we may have to deal with is that many of the data files from the NHS also contain multiple levels of aggregation, in the same file! So, for example you’ll might have several rows for say individual organisations, then a row for the total of all those organisations, but then also you could have yet another total row for all the totals! Oh yes, I kid you not…

Well, it wouldn’t be any fun just creating a simple ingestion process here for nice clean flat CSV files would it? I did say I wanted to build something a bit more ‘real world’ didn’t I?

close up shot of a gold medal on a black surface
Photo by Nataliya Vaitkevich on Pexels.com

For this little project, we won’t be exactly following the popular medallion architecture design pattern that’s floating around at the moment, but we will be using some relatively similar steps. Just for a quick overview, the medallion part is referring to 3 stages or layers (listed below), at least at a high level that is:-

  1. Bronze – holds the raw data
  2. Silver – cleaned, deduplicated and enriched data
  3. Gold – likely dimensions and fact tables that are optimised for reporting, maybe even some aggregated or snapshot tables

Now there are various opinions on what exactly should happen within each stage, and some don’t even like the idea of this bronze/silver/gold pattern. Watch enough videos about this on YouTube and you’ll see what I mean. What I would say though is that this architecture is actually pretty similar to how most ingestion (ETL or ELT) has already been done many times in the past in the world of data warehousing. To be honest, it kind of feels like just a new set of buzz words to put over it, every other year or so there’s something like this. Although coming from the people over at databricks has given it some weight, so now numerous people in the industry are adopting the parlance and the methodology. Essentially though it works out, whatever you want to call it, a little bit like the steps I’ve listed below.

one black chess piece separated from red pawn chess pieces
Photo by Markus Spiske on Pexels.com

Note, as I said earlier, this process I’m suggesting is a little different to the medallion architecture, but I’ve highlighted where I think you could argue there might be some crossover:-

  1. Download files (or import source data) to the lake – Sometimes referred to as a landing zone, you could place this before your ‘bronze’ layer or make it part of it. I’ve seen people suggest both ways. I’d personally suggest it sits just before the ‘bronze’ layer. If you’re extracting data from a source database as opposed to downloading files, you’d still want to drop that data in the lake as files (I’d suggest using parquet). For file sources though, you’d probably want to keep the existing format instead of translating say a CSV into parquet in transit (for auditing purposes alone). Why use parquet? Well, its due to the numerous advantages it provides over something like CSV, such as compression, query speed and my personal favourite… not having to worry about bl**dy delimiters, quotes and escape characters!
  2. Check for schema changes – As I mentioned before, NHS data is rife for this kind of thing. So here we need to either check for schema changes and throw an error, or potentially map the new schema to the existing schema (if possible). If you’re sticking to the medallion architecture, then I’d say this should be part of the bronze layer. It’s not something which should concern the ‘landing’ part!
  3. Merge the file data into archive or history tables – This is to keep a historical archive of all the raw data in our lake. The tables here should basically be pretty much the same schema and content as the original files (or close enough). I’d personally say this should still be a part of the bronze layer. You need this as there maybe numerous occasions where you need to reload or refer to historical and unchanged data, such as a reload due to a bug or if analysts need to check some of the raw data.
  4. Transform the incoming data into clearly defined and well optimised structures – This is where designing a good data model beforehand comes in, as we’ll want to fit our incoming data into our data model appropriately so we can easily and efficiently answer the questions that the business is demanding. This is especially true if you’ve also got data from several different sources flowing into the same model, you’ll need to make sure the model can accommodate this. Often the tables here will be in terms of dimension, fact and maybe bridge tables (if you need to model many to many relationships). Again, as per data modelling, make sure your facts and dimensions are modelled well and not necessarily just carbon copies of the source data structure, unless that structure is appropriate. This layer would be the main data warehouse in traditional terms. As for the medallion architecture then I’d say this would be part gold and silver. This is where you could have a ‘silver’ step before to cleanse and/or deduplicate data if need be, which would (or could) be classified as part of the ‘silver’ layer. I’d argue that its a step you only need unless it’s required for performance or business purposes.
  5. Create additional specialised or aggregated tables for business purposes – Once you’ve built your core fact and dimension tables, sometimes you may have to create additional tables such snapshots or tables at specific aggregation levels for end user consumption. In classic terminology this would still be your data warehouse, or maybe a cube. In terms of medallion architecture though I guess it would be similar to the ‘gold’ layer. End users potentially will be able to query this layer, or part of it, perhaps specific data marts for various business areas.
  6. Visualisations – Here we’ll try and create clear and concise visualisations that show meaningful data to business users. This would mainly visualise data which is in the ‘gold’ layer in terms of the medallion architecture. Or, in the old days, the visualisations would have looked at one or more cubes, as users would be unlikely to have had their reporting tools querying the data warehouse directly. For query purposes you’d likely have created some specific data marts for different business domain needs. The general rule of thumb with visualisations is that dashboards should be simple and quick to see what they’re trying to tell you, if they don’t then they’re not good dashboards. More detailed reports and visualisations can drill into more detail. Visualisations are meant to make it easier to spot and see patterns or trends etc… instead of looking at raw tables of data which are not easy for people to quickly spot patterns in. The old “a picture paints a thousand words” type thinking…
questions answers signage
Photo by Pixabay on Pexels.com

Lastly for this part 1 post, we need to just keep in mind that all the architecture and other stuff discussed above is there to help provide answers to the questions the business has. I suppose I should have really put this part at the top of this post eh? In addition, ideally we should be able to use the data to also find things of ‘added value’ such as being able to potentially answer or even pose questions the business doesn’t even know about.

Starting simple, I’ve looked through some of the documentation surrounding this dataset for potential questions we can ask of the data. Here’s what I’ve got, so we’ll try and see if we can build something to answer these questions:-

  • How many patients are still waiting at the end of each month, quarter, year (maybe by fiscal periods as well)
  • How many patients are still waiting at the end of each month who have been waiting over the 18 weeks target (number total and percentage?)
  • How many patients at the end of each month whose waiting has finished (completed) and came in under the 18 weeks waiting target
  • How many patients are still waiting at the end of each month who have been waiting over 52 weeks, 104 weeks etc…
  • How many new referral to treatment (RTT) pathways were started at the end of each period
  • How many patients whose wait has completed who were admitted to hospital, plus (or vs) how many were not admitted
  • The average waiting time (for completed and non-completed waits)
  • The 92nd percentile waiting time (for completed and non-completed waits)
  • The waiting time vs same period last month, quarter, year (maybe by fiscal periods as well)
  • The trajectory of waiting times (is it heading up or down)
  • Benchmarking one organisation against another
  • Maybe also some predictions…

So, that’s it for part 1, we’ll start looking at the actual implementation of this in the next post and what parts of Fabric we’ll use to do what. Till next time…

Leave a Reply

Discover more from Aventius

Subscribe now to keep reading and get access to the full archive.

Continue reading