Microsoft Fabric vs databricks, a simple choice?

anonymous woman choosing clothes in store — Photo by Liza Summer on Pexels.com

I figured I’d jump into the ring in regards to this debate despite the fact that there are many discussions of this comparison already out there. However, this isn’t going to be a lengthy post as I think most of the time the questions and comparisons are too detailed. I just want to look at a high level here instead. So, if you want more detail, go and find one of those posts out there that delves into the detail. Otherwise, read on…

TLDR…

Microsoft Fabric – If like virtually zero configuration and administration, and you’re coming primarily from a traditional SQL and/or data warehouse background but want the possibility to try or use some data engineering oriented tools eventually – this might be for you. This is especially true if you’re already using PowerBI, as this is now deeply integrated into Fabric – in fact if you’ve got PowerBI you’ve already got Fabric. Did I mention you can also use SQL management studio to write queries, create tables and procedures too for the warehouse side of Fabric? That last bit alone managed to win over some people, ah, the love for SSMS is still out there…
Databricks – If you like (or don’t mind) more configuration/administration, love coding, want the most control and/or also have a strong interest or need for advanced machine learning (or data science), this is probably the choice for you. Basically, this is the full power of a mature and advanced data platform. Just be aware you’re probably not going to get the most out of it if your team is only at a junior data engineer level. Also, make sure you’ve got a good IT department and possibly an external visualisation tool…

man running on black asphalt road — Photo by RUN 4 FFWPU on Pexels.com

Ok, all done, thanks and good night… was that a bit shorter than you expected, you want a bit more? Ok, go on then… Note that I’m going to leave out costs from this comparison, since cost will vary depending on what your use case is. I’d recommend a proof of concept for either of these technologies to get a more realistic idea of what you might pay, or need to pay, to get the performance you want or need.

So, why not choose…

Just to be a little different, lets maybe start instead with a ‘why not to choose’ one over the other. So, why would I not choose Microsoft Fabric (over databricks)?

(lack of) Maturity – It’s obviously no secret that Microsoft Fabric is a new product, despite it being kind of an evolution of Azure Synapse. This of course means that not everything is clear and present, at least not yet. There are (at the time of writing) some key features missing from Fabric that could be deal breakers, or at least pain points. For example, currently you cannot parameterise connections in data factory (unlike you can in the old Azure data factory linked services), so if you’ve built a table/metadata driven ingestion process in Azure then you’re going to struggle to build that in Fabric (although you could use data flows gen2, but that’s another story). You’d literally need a connection for each different data source you’re pulling from (even on the same server!). If you’ve got 50 different data sources this is going to be messy especially in data factory. Hold the press though, I’m assured that this is a feature in the works – see this response to a comment here on a Microsoft blog regarding building a Fabric metadata driven pipeline.
Limitations – This is in some part related to the products maturity, but there are also a number of feature/functionality limitations compared to databricks. The data engineering aspect of Fabric is kind of like a databricks ‘lite’ environment, so if you’re heavy into data engineering as opposed to data warehousing then some of the missing functionality/features could be a dealbreaker for you. Although, for most developers out there I’d say the data engineering aspect of Fabric should cover many of their needs. Data scientists, might have to check first though. In terms of limitations for Fabric warehouse, there are a few at the time of writing, currently you cannot alter tables to add/remove/change columns, there are no temp tables and you can’t create scalar functions. I’ve no doubt these features will come eventually, but for now…
Cost clarity – Yes I said I wasn’t going to talk about cost, but this is less about the actual cost and more how the costs are represented. Unlike databricks where you’ve got a specific cost for spinning up a cluster (plus some additional ‘little’ cloud costs) of a particular level of power, Fabric is priced using ‘capacity units’ which are kind of similar to how the ‘DTU’ model of Azure SQL databases works. What do I mean? Well, several aspects of the processing power and hardware are hidden behind a single cost and its not super clear how that cost is divided up between them. For example, if you’re using some Spark clusters, running warehouse queries and also doing PowerBI reporting, which part is going to consume the largest part of your ‘Fabric capacity’? Some parts could be more equal than others…

So, lets now look at databricks… why would I not choose that?

Administration and configuration – Unlike Fabric, which can almost be zero configuration required, there is significantly more configuration/administration required to get things up and running, such as sorting out storage account connections. Try mounting a data lake, then compare Fabric where you can just create shortcuts as long as you’ve got correct permissions etc… One other small thing of note, when you spin up one of your clusters it takes significantly longer to start than in Fabric. Most of the time you’ll get a cluster allocated and spun up in several seconds in Fabric. Here though, it can take minutes just like it did/does in Azure Synapse (i.e. go and make a brew then come back).
Complexity – With great power comes great responsibility, and this is definitely true for databricks. It’s a full data platform essentially (maybe you could argue over visualisations vs PowerBI). This means you’ve really got to know what you’re doing really. So, if you’ve got a team of junior or lower skilled developers, I’d steer clear of databricks for this reason alone. That is, unless you’re willing to invest heavily in training, are confident your team can improve and also are willing to wait until your team levels up in skill.
Potential cost – If you don’t already know, databricks sits on top of Spark (just like data engineering in Fabric), so this means you have to create and spin up clusters to run all your processing. If you’re using several of these clusters for data processing and also for querying and analysis then running them is going to cost you. You can spin them down when not in use, which you could save a lot, but if this isn’t an option for your use case, keep this in mind.

Anyway, I really don’t want to get any more detailed than that. Personally, I like the power of databricks but coming from a long data warehousing background I’m quite drawn to Fabric, especially with how they’ve integrated PowerBI. Once Fabric matures then this will be a much closer race, especially if they keep improving the data engineering aspect of Fabric…

Hopefully that’s given you some high level insight into both products, there is definitely some heavy crossover between them (and I’m not getting into Snowflake, at least not yet). Let me know your thoughts and if you think I’ve missed any other key points or if you disagree and why…

Till next time 🙂

Microsoft Fabric vs databricks, a simple choice?

Like this:

Related

Leave a ReplyCancel reply

Microsoft Fabric vs databricks, a simple choice?

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Aventius