Microsoft Fabric – NHS data processing… part 9, using ‘task flows’ and the medallion architecture

This post is a late addition to the NHS data processing series, sorry. However, in my defence, the task flow feature arrived in the middle of my series. I figured this might be a nice opportunity to show how the architecture (or data flow) I’ve used in this series kind of aligns to the popular medallion architecture pushed by the databricks people. You could of course create tasks with completely different names, I could have just used the names of the stages I’ve used throughout this series… Anyway, the task flows in Microsoft Fabric is a nice way to visually tie artifacts to how your data actually flows (and is processed) through Fabric. Note that it is not providing any data transformation, technical or any orchestration facility (despite it looking kind of like a data flow), it is just purely visual and to be there as a useful reference. Anyone can simply look at this diagram, click on a task and below see the related artifacts. I think its quite useful for say new developers coming in, so they can quickly get a sense of how your particular architecture works. Also, I think it might be something useful for managers or execs so they can get their heads around what you’ve developed 😉

So, how does our example architecture relate to the medallion architecture? Is there actually any correlation at all? Well, as you can see below, it aligns pretty close. This is because (as mentioned in earlier posts) the medallion architecture is really just some new words to describe a common general layering of data that’s been used for many years. If your business wants to call it ‘medallion’, fair enough, if not, use whatever means something in your organisation that people know. There’s always someone coming up with some new ‘architecture’ that gets popularised over the years, but the general work flows and layers tend to stay quite similar IMHO…

Come on, get on with it

Orchestration – ok, lets start with this, since without this nothing else works… the ‘Orchestration’ task contains the control warehouse (with the data source metadata tables) and of course the data factory pipeline which actually runs the whole process and calls items in each of the different layers. The original template that Microsoft provide in Fabric task flows for the medallion architecture though doesn’t come with this task. However, I’ve modified their template and removed/added some bits, with this being one of the additions I’ve made. Hopefully this illustrates how the orchestration controls, relates to, and indeed ‘calls’ in most cases, the other items within the architecture

Get data – next we move on to the first real ‘task’ which is of course to get the data. This I’ve related to the notebook which we developed that downloads our files from the web. In your architecture though, this might well be a data factory pipeline or something else like an event stream maybe… In Microsoft’s original template there was also a ‘low volume’ task for getting data, but since in our example we’re only importing data from one source I’ve removed it. In a more complex architecture though, this might be something you’d keep and relate maybe real-time feeds and overnight feeds to the relevant ‘get data’ task.

Bronze data – after that we’ve got the bronze data store. In our example architecture, this is the landing lakehouse where we’re ‘landing’ raw data files in their original format and then creating some transient tables. This is where sometimes confusion arises with the medallion architecture as some people will create a landing zone and place it before bronze, others might land data in bronze and merge or append into tables too. For me this is one of the reasons I’ve used the artifact names I have in our example, I personally prefer naming each artifact by its function, not the quality of data in it. Hence, ‘landing’ is for landing raw data, ‘staging’ is for staging that data and preparing it for ingestion into the warehouse, and ‘data warehouse’ is the final data warehouse… The actual quality of the data might be the same in landing and staging in some cases, but could be different in others.

Initial process – this is where we take the data from bronze (i.e. landing) and get it into the silver data store, which in our example I’ve referred to as the ‘staging’ lakehouse. In our example, it merges landing data into history tables in ‘staging’. So obviously I’ve associated the ‘merge landing into staging’ notebook here to this task. However, if you had a warehouse instead of a lakehouse for this part, then you’d likely use stored procedures to read from the ‘bronze’ / ‘landing’ layer. So in that case this task wouldn’t apply and could be deleted…

Silver data – as mentioned previously, the ‘initial process’ task is associated with the notebook which takes data from the ‘landing’ / ‘bronze’ and merges it into full history tables in the ‘staging’ lakehouse… or as shown here, the ‘silver’ data store. Ideally you want to keep this history data in near identical structure to the original data so that you’ve essentially got a history table which matches the source data in structure and content (i.e. so you don’t have to keep going back to the source to make queries). This might mean it contains duplicates and/or other data quality issues, which is why I don’t really love calling this ‘silver’, but it kind of aligns the closest… Again, my naming refers to function rather than the data quality of the content 😉

Golden data – this is the place where all the final data is held in its most optimal format for further queries. For our example, this is the data warehouse and also a semantic model for use with PowerBI that points to the referral to treatment data inside the data warehouse.

Data visualise – finally, we’ve got the visualisations. Well, in our example, just the one dashboard in PowerBI 😎

And we’re done… 😎

This is just one way, we could of course ignore the medallion architecture terminology like I’ve done with the names of the artifacts we’ve already created. However, I think the task flow feature in Fabric is nice to show some sort of visual representation of how data is flowing within your architecture, whatever that might be. As I mentioned at the start of this post, its probably quite useful for showing to managers or execs when they ask how everything works 😊

Here’s a link to the previous post, part 8 here, don’t forget also that all the source code for the Fabric workspace and all these artifacts is available on my GitHub repo here too.

See you next time…

Microsoft Fabric – NHS data processing… part 9, using ‘task flows’ and the medallion architecture

Like this:

Related

1 comment

Leave a ReplyCancel reply

Microsoft Fabric – NHS data processing… part 9, using ‘task flows’ and the medallion architecture

Share this:

Like this:

Related

1 comment

Leave a ReplyCancel reply

Discover more from Aventius