Azure data factory

Ah the latest new fancy shiny thing! Last month Microsoft released Fabric to public preview and are currently promoting it hard with all sorts of articles, video’s and posts. If you’re already using Azure Synapse though or thinking about using it, what’s the story, what should you do, if anything? Well, that’s up to you, here I’ll just give you my opinions on this topic, just don’t blame me either way!

First off I’m going to assume if you’re reading this then you’re aware of Fabric and Synapse, it’s hard to miss if you’re involved in the world of data and data warehousing. If you don’t know what Fabric is, why are you even reading this? 😉 There are tons of articles on the internet and YouTube which explain it all, so I’m not going to go over that here, have a search on the web and enjoy…

This article is targeted more at anyone currently using Azure Synapse or starting to use it, as there are a number of questions I’m sure you’ve got. These are just my questions, but hopefully I’m not alone! Some of them came straight to mind on first hearing of Fabric, others from playing around with the public preview for a while now.

I thought Microsoft were pushing Azure and all its data processing tools, so how come a large chunk of the data platform side of Azure is now being moved outside of it?

I’ve got to say, the release of Fabric and its approach kind of took me by surprise, and I’m sure I’m not the only one. However, when I say ‘by surprise’ I mean I was suprised that this wasn’t in Azure, but just an evolution of the PowerBI service. Instead of adding this as a new version of Synapse in Azure, they’ve taken parts of Azure Synapse and instead built on top of the PowerBI service. While I remember, the PowerBI service is turning into Microsoft Fabric on the 1st of July, so mark that for your diary.

I can’t answer the reasons behind this decision as I don’t work for Microsoft, but I would imagine that they’re just trying to build on the success of PowerBI and appeal to those users who only use PowerBI and not Azure. I guess this way they’re hoping they can tempt people away from using their on-prem data warehouses and ETL/ELT processes. They’ve been kind of doing this with the data flows functionality which came before Fabric. I remember having a similar thought back then for those, confusing…

Not everyone might agree with this decision, I mean I for one never understood why they kept PowerBI sitting outside Azure anyway, I always felt it should have and could have been brought in. The current SaaS approach that they’re taking with Fabric could have remained, I just think it just could have been more integrated with Azure considering that lots of organisations have already invested heavily in Azure and its existing data platform technologies like Synapse and data factory.

Is this why Azure Synapse has been floating around in a sort of half built/finished state for such a while?

I’d be guessing that the simple answer to this question is yes… This one for me is a little frustrating, as I’m sure it is for other Azure Synapse developers, we’ve been asking for all these kinds of features like delta, source control etc… for ages and now quietly they’ve gone and more or less implemented all of them, but in a different place!?! What gives Microsoft? What makes this even more frustrating is that it’s not like Fabric is totally new product or the next big shiny thing, Fabric is the next version of Synapse (gen3 if you will), Microsoft have even referred to it as such. Even using Fabric is very similar to Synapse, except there’s now less configuration required (as per the SaaS approach).

I’m sure the Microsoft developers we’re working on Fabric instead of Synapse features. It wouldn’t make sense to spend the time to implement the same features in two places, even with a shared code base there’d still be a lot of work. They’ve also said that Synapse will not be getting any new feature updates from this point onwards though, so that’s something to keep in mind. Maybe some features already in the works or promised before Fabric was anounced could still get released though, but I don’t know – don’t quote me on that! They’ve also stated that Synapse is not being discontinued (yet), but I’d say the writing is on the wall and it will go the way of the dodo in the not too distant future. Remember all the talk of U-SQL back in the day anyone?

Hang on, so there’s now three slightly different versions of data factory? Azure data factory, Azure Synapse pipelines and Fabric data factory? Eh?

This is starting to get messy now, a lot of the reasons for using the Azure based data factory have been stripped away now, at least in a certain sense. You must admit it does seem a bit odd having three slightly different versions of the same product all over the place. Some of the differencies for example are datasets and linked services, which have gone from Fabric, replaced with ‘connections’. I actually think this is a good move as datasets did seem a bit pointless. However, in terms of the linked services, the Fabric ‘connections’ which have replaced them are not parameterised (maybe they implement that in future releases). So, if you’ve created a generic set of processing pipelines and parameterised linked services in Azure data factory to do your ETL, how does this translate to Fabric? Erm, it doesn’t really… For now at least you’d need to create a different pipeline for each individual data source or source system you’re importing from.

What’s the deal with Azure databricks, if all the data platform stuff is now in Fabric but that’s still left behind in Azure?

To be honest I don’t know, although I suspect its similarity in terms of functionality to Azure Synapse, which already provides Apache Spark clusters and notebooks to do all sort of processing, is something to do with it. As with most of the other parts of Azure Synapse, Microsoft have kind of copied and pasted almost verbatim (with some little improvements like cluster startup times) the Synapse Spark experience into Fabric and called it Synapse Data Engineering, it would be like doing this again for databricks.

There are of course differences between the Azure Synapse Spark experience and databricks such as some more focus on machine learning in some databricks scenarios, plus other bits and bats. Although you could argue that Microsoft are ‘countering’ this in Fabric with Synapse Data Science. I think the functionality overlap if they brought databricks in would be too confusing, so I personally can see they left it in Azure. Although saying that, we’re now left in this strange situation where most of the data platform tools are outside Azure but some are still in it. I repeat Microsoft, why wasn’t Fabric part of Azure and PowerBI brought inside too, this would solve this and some of the other issues. I do think its fragmented the Microsoft data platform landscape a bit…

One thing for sure is that setting up and configuring databricks can be a royal pain in the ass… sorting out source control and mounting datalakes just starters. Whilst in Fabric there is virtually no setup or configuration to worry about. Unless you really need some of the extras in databricks, I’m starting to see a potential future where I don’t need to faff around with databricks anymore…

If I’ve just started working with Azure Synapse should I switch over to Fabric?

This is a big one, I’d say that you need to look into the differences between the two, how much work you’ve put into Synapse testing or prototyping already, plus potential cost analysis. These are all a little bit in the air at the moment as Fabric has only recently been released and its still in public preview (so some bits don’t work at all and others sometimes work and sometimes don’t). It’s still in a state to do an initial PoC, although it is likely some features may change, and some will be added later. You could create something in your PoC that is the rendered pointless by a new feature or removal of one! Just be aware!

With that in mind, lets have a look at a quick comparison of some key features. We’ll have some descriptions, pros and cons, plus a rough score for each one out of 10. I won’t go into super detail, but these would be my key points for evaluating:-

Feature	Azure Synapse	Score	Microsoft Fabric	Score
Setup and configuration	Spin up the service and configure firewall. Some people have had to enable a service provider before creating a workspace.	7	SaaS so nothing to setup or configure, it just exists. Some issues connecting to firewalled Azure resources.	8
Cost	Dedicated SQL pools have fixed cost depending on DWU level chosen. Serverless SQL queries charged per query. Spark charged for size of the pool and how long its running. Can be cheap if you only use for short periods. Dedicated SQL can pause and resume saving costs, but this is *not* automatic pause and resume like it is with Azure SQL serverless databases, it’s manual (pain).	9	Still early days but you purchase Fabric capacity, which if you already have PowerBI capacity this is the same. Note that the Fabric capacities have a different SKU name but they do map to each other. Could do better with more clarity required.	8
Source control	Some, but quite poor support… any views, procedures and functions that are in the lake databases and serverless SQL databases are *not* included. In fact nothing in the serverless SQL databases are under source control. Dedicated SQL pool databases can be kept under source control using Visual Studio database projects.	4	All objects everywhere (eventually).	10
On-prem connectivity	Yes, installing the self hosted integration runtime service on a machine on-prem. Although you need a separate service – and thus a separate server or machine to install it on – for each Synapse workspace as they can’t share integration runtimes unlike normal Azure data factory.	7	Much simpler, just uses the on-prem gateway service installed on a server or machine on-prem.	10
Data warehousing	You can use the dedicated SQL pool (i.e. the evolution of the old parallel data warehouse) with a limited T-SQL instruction set. Depending on the DWU tier you can do some massive amounts of processing in a short time. You can do some ‘logical’ data warehousing using serverless engine. However, this cannot update, delete or truncate tables (i.e. files in the data lake). There is limited write functionality to write tables/files to the data lake in both dedicated and serverless, but there isn’t much control. Also, the dedicated SQL pool has some issues writing parquet files with correct file extensions (we had to use the REST API to rename files after export!).	8	Parts of dedicated SQL pool and serverless have been combined into a new iteration of the old Synapse serverless SQL engine. So now we can create tables and update, insert, delete etc… using the underlying delta files. There is still a limited T-SQL instruction (for now) set similar to the dedicated SQL pool, but it’s a kind of best of both worlds. One big plus point is that you don’t need to write or export your data back to the data lake after processing as all the data is *already* in the data lake (OneLake).	9
Client tools	You can use good old SSMS if you’re working with dedicated SQL pool databases. It’s a different story with serverless SQL though as it is quite limited, no DML allowed. Its similar for the lake databases too, these can *only* be built using the Synapse studio designer which is very limited (and is likely to stay that way now). As I mentioned in the source control section you *can* add your own procs, views and functions to serverless or lake databases using SSMS or Azure data studio but these objects are *not* kept under source control.	6	Currently quite similar to Azure Synapse. You can only use external client tools like SSMS or Azure data studio for DDL, so create or drop tables, you cannot insert, update, delete etc… for this you need to be in Fabric and use the GUI. All the SQL endpoints are just read only for now. The Fabric designers and editors are a bit more evolved than the Azure Synapse studio designer but there is still some work required to make it all like a proper IDE or editor. There is talk that Microsoft are working to enable external tools so developers can use their preferred tool of choice eventually, but there’s nothing I could find that states what parts of Fabric this applies to (is it just some or all?).	7
Spark (called Synapse data engineering in Fabric)	The only real pain here is the startup time of the clusters, its like it was in data factory where it would take several minutes to spin up a cluster before you could do any work. They fixed this in data factory ages ago, but Synapse still has this issue.	8	Its almost a copy and paste job for the Spark GUI in Fabric compared to Synapse, some small improvements have been made, but the big plus here is the issue of cluster startup times, this has been reduced to mere seconds now.	9
Data factory	The data factory (or pipelines) element of Azure Synapse is virtually identical to the normal data factory. Although it seems to lag behind in terms of certain features. So no global parameters for instance and the issue with the self hosted integration runtimes and sharing is real pain. Also a lack of dark mode, which for a lot of developers nowadays is a must. I remember the days when I had three monitors all glaring at me and it felt like I was getting a sun tan 😉 So considering its similarity, but feature lag compared to normal data factory this is frustrating.	7	This one is interesting as data factory has largely stayed the same except for a few crucial differencies. Microsoft have completely binned off the old data factory Spark powered data flows and replaced them with the ‘PowerBI’ data flows and called them ‘gen2’ data flows. This could be a learning curve for developers who are used to the old data factory data flows. Linked services and datasets have gone, replaced with just ‘connections’. For me this is good as datasets were almost pointless, especially if you were doing generic linked services and pipelines. However, currently the connections functionality in Fabric lacks any kind of dynamic functionality or parameterisation, if this doesn’t change it could be a bit of blocker for some migrations. Two steps forward, one step back I guess.	8
PowerBI	Synapse was sold to us as a more integrated data platform that was bringing together several related parts of a modern data platform, which included PowerBI. In reality though it’s not much more than viewing PowerBI reports in an iframe and links to download PowerBI desktop. Kind of useful, but not full integration.	5	PowerBI has had some major integration into the elements of Fabric. Just creating a warehouse or lakehouse for example now automatically creates a PowerBI dataset that incorporates all the tables you add to the warehouse or lakehouse. The *real* bonus here though is the new direct lake mode, PowerBI can now query the OneLake (the data lake where all the delta tables are) directly so there isn’t a need to import or refresh your data model (or use direct query). This means time saved, no in memory storage required and very fast query performance. The nightmare of scheduling refreshes is almost at an end!	10
Totals	Azure Synapse	61	Microsoft Fabric	79

I’m sure I’ve probably missed some stuff, let me know if you’ve got any thoughts on this. However even taking into account that Fabric is in preview, it’s looking very tempting to try a PoC. I would definitly hold off developing and putting anything production ready in it though as there could be changes before Fabric goes GA. My main concerns against Fabric now are potential cost, which I guess only a PoC would be able to determine.

Photo by Andrea Piacquadio on Pexels.com

I’ve got all this Azure Synapse infrastructure, should we migrate?

If you’ve got all your Azure Synapse processing setup and working now then I’d say there’s not much incentive to migrate, at least not right now. Microsoft have promised some Synapse to Fabric migration tools and there’s the dedicated SQL pool ‘link’ to Fabric warehouse feature which might be useful. You could take advantage of that if your organisation uses PowerBI so you don’t need to schedule data model refreshes, however the rest of the functionality Fabric has over Synapse such as improved source control etc… won’t give you any visible performance improvements etc… It also remains to be seen if a high scale DWU dedicated SQL pool can outperform the speed of the data lake (OneLake) or not. It could be very close, or not at all.

Hope that helps, and if Microsoft are reading, next time speak to me before doing anything 😉

Tag: Azure data factory

You’re using Azure Synapse, so where does Microsoft Fabric fit in?