List files in a data lake using Azure Synapse serverless SQL (no REST API required)

Photo by Andrea Piacquadio on Pexels.com

Ah the magic of the filepath function…

Whilst its possible to get a list of files from the data lake using the ADLS gen2 REST API, if you’re running queries in Synapse serverless SQL you can stay where you are and just use some T-SQL to get a similar result.

Say in your data lake container you’ve got a folder structure something like ‘SomeRootFolder/Organisation/SourceSystem/TableName‘ and under each ‘TableName‘ folder you’ve got some parquet (or CSV maybe) files. We can use the filepath function in conjunction with wildcards to easily get the names of each folder in the path without having to resort to splitting the path and picking out individual parts using string functions.

Note this function only works with wildcards (the * characters), it doesn’t return all the folders in the path unless you use it without a parameter! The wildcard positions in the path correspond to the filepath parameter, so the first wildcard in the example code below (after the ‘SomeRootFolder‘) would correspond to filepath(1) and so on for the other positions of the other wildcards. So, you can hard code the main base or root folder path up to the point where you want to then get the different potential names of the folders, this is where you would then use the wildcards.

Note in the code example below I’ve just used some additional code to sort the results by various columns. This could be useful if you need to process each file in name order or something, but its not required…

WITH [cteFilePaths] AS (
	SELECT DISTINCT
    [Path] = [result].[filepath]()
    ,[OrganisationName] = [result].[filepath](1)
    ,[SourceSystemName] = [result].[filepath](2)	
    ,[TableName] = [result].[filepath](3)

    FROM
    OPENROWSET(
        BULK 'https://storageaccount.dfs.core.windows.net/container/SomeRootFolder/*/*/*.parquet',
        FORMAT = 'PARQUET'
    ) AS [result]
), [cteSort] AS (
	SELECT
	[OrganisationName]
	,[SourceSystemName]
	,[TableName]
	,[Path]
	
	,[SortOrder] = ROW_NUMBER() OVER (
		PARTITION BY 
		[OrganisationName]
		,[SourceSystemName]
		
		ORDER BY 
		[TableName]		
	)
	
	FROM
	[cteFilePaths]
)

SELECT
[OrganisationName]
,[SourceSystemName]
,[TableName]
,[Path]
,[SortOrder]

FROM
[cteSort]

Hope that comes in useful, well, until next time…

Using Azure Synapse dedicated SQL pool to get a list of columns from a delta table in the data lake

Photo by cottonbro studio on Pexels.com

Ok, stick with me for this, I admit its a little hacky/messy but as (at the time of writing) Synapse dedicated SQL pools don’t support proper reading of delta tables from a data lake we have to do some fiddling. There are some posts knocking around from Microsoft with some SQL scripts that you can use as a work around to load delta tables (see here). However, in my case I was just after a list of the fields, not the data, in a particular delta table, so I could then create a ‘matching’ table in the dedicated SQL pool database. So this is what I came up with, maybe this might come in handy to someone…

Whilst there’s a fair bit of documentation on delta and how to use it, there’s not much I could find on how delta actually stores the meta data it uses to keep track of the changes in the underlying parquet files (short of digging through the Github code etc…). Although I must admit I didn’t look super hard as its reasonably straightforward since its pretty much a bunch of JSON files (check out the ‘_delta_log’ folder for these JSON files). If anyone knows a link to somewhere that details this meta data structure ‘officially’ please leave a comment below. Anyway, if we dig into these files you can see some of this meta data, and luckily for us part of that is schema information about the fields. We can use this in a rough way to pull out a list of fields and data types. Just for reference, here is a link to the docs for delta.

Photo by Arthur Brognoli on Pexels.com

Small word of warning… I’m not guaranteeing this code will consistently work in the future, like I said at the start of this post, it’s a little hacky and messy, but for now it appears to work ok. If the structure of the JSON files change in the future though this of course could break. So, consider that fair warning if you want to use this! I accept no blame for anything that goes wrong with the use of this code below…

-- Reads some of the delta table JSON files and gets a list of columns and data
-- types. Note that the types are not SQL server data types so if you want you could
-- add to this code something to translate them into SQL server data types ;-)
DECLARE @storageAccountName VARCHAR(32) = 'nameofstorageaccount'
    ,@containerName VARCHAR(32) = 'nameofcontainer'
    ,@folderPath VARCHAR(8000) = 'Some/Folder/Path/Here'
    ,@deltaTableName VARCHAR(128) = 'deltatablename'
    ,@destinationTableName VARCHAR(128) = NULL -- Set to NULL to just output the result or name of destination table

-- To build T-SQL for the copy, note that Synapse dedicated SQL pools
-- do not support blob data types for dynamic SQL (e.g. NVARCHAR(MAX))
-- so we can only use small strings of T-SQL!
DECLARE @sql NVARCHAR(4000)

-- Create the url pointing to the storage account and folder 
-- location for delta log folder
DECLARE @storageLocationUrl VARCHAR(8000) = CONCAT(
    'https://'
    ,@storageAccountName
    ,'.dfs.core.windows.net/'
    ,@containerName
    ,'/'
    ,@folderPath
    ,'/'
    ,@deltaTableName
    ,'/_delta_log'
)

-- Lets get some information from the last checkpoint file
IF OBJECT_ID('tempdb..#last_checkpoint') IS NOT NULL
DROP TABLE [#last_checkpoint]

CREATE TABLE [#last_checkpoint] (
    [FileContent] NVARCHAR(MAX) NOT NULL
) WITH (
    DISTRIBUTION = ROUND_ROBIN
    ,HEAP
)

SET @sql = ''
SET @sql += '
COPY INTO [#last_checkpoint] FROM ''' + @storageLocationUrl + '/_last_checkpoint'' 
WITH (
 FILE_TYPE = ''CSV''
 ,FIELDTERMINATOR = ''0x0b''
 ,FIELDQUOTE = ''0x0b''
 ,ROWTERMINATOR = ''0x0d''
)
';

PRINT @sql;
EXEC sp_executesql @stmt = @sql;

-- Now pick out the version from the file data
DECLARE @version INT = (
    SELECT
    JSON_VALUE([FileContent], '$.version')

    FROM
    [#last_checkpoint]
)

-- Now we can use this version number to build the filename for the check point file. It
-- has leading zero's so we can use the FORMAT function to sort that out for us ;-)
DECLARE @checkpointFilename VARCHAR(500) = FORMAT(@version, '00000000000000000000') + '.checkpoint.parquet'	

-- Build the SQL command to load the data for the checkpoint file
SET @sql = ''
SET @sql += '
COPY INTO [#checkpoint] 
FROM ''' + @storageLocationUrl + '/' + @checkpointFilename + ''' 
WITH (
 FILE_TYPE = ''PARQUET''
 ,AUTO_CREATE_TABLE = ''ON''
)
';

-- Load the checkpoint data
IF OBJECT_ID('tempdb..#checkpoint') IS NOT NULL
DROP TABLE [#checkpoint]

PRINT @sql;
EXEC sp_executesql @stmt = @sql;	

-- Now we can get the schema string from the checkpoint 'metaData' field (JSON contents)
-- and this contains field information (at least for the specific version)
DECLARE @schemaString NVARCHAR(MAX) = ''

SELECT TOP (1)
@schemaString = JSON_VALUE([metaData], '$.schemaString')

FROM 
[#checkpoint]

WHERE
[metaData] IS NOT NULL

ORDER BY
[metaData]

-- Load into destination table if a name has been supplied. If no destination
-- table name has been supplied then just execute the T-SQL select statement
IF @destinationTableName IS NULL SET @sql = ''
ELSE SET @sql = 'INSERT INTO [' + @destinationTableName + '] '

-- Use a some JSON functions to pick out the field names and types ;-)
SET @sql += '
SELECT
[cols].[ColumnName]
,[cols].[DataType]

FROM
OPENJSON(@fieldJSON) WITH (
    [Columns] NVARCHAR(MAX) ''$.fields'' AS JSON
) [f]

OUTER APPLY 
OPENJSON([f].[Columns]) WITH (
    [ColumnName] NVARCHAR(128) ''$.name''
    ,[DataType] NVARCHAR(128) ''$.type''
) [cols]	
';

DECLARE @parameterDefinition NVARCHAR(500) = '@fieldJSON NVARCHAR(MAX)';
EXEC sp_executesql @stmt = @sql
    ,@params = @parameterDefinition
    ,@fieldJSON = @schemaString;

Have fun, I’ll be back with some more data platform code, tips and tricks soon…

Using a single calendar with multiple date fields in a fact table dynamically in PowerBI (with a twist)

Photo by Bich Tran on Pexels.com

Why am I covering this? It’s the age old arguement in PowerBI, do I have a single calendar or multiple calendars, one for each date in my fact table? Many people have already written posts on how to solve this issue, mostly either using the USERELATIONSHIP function in DAX or just by creating multiple (i.e. copies) calendars, one for calendar for each date… ok so all done then, post complete? Hang on, wait a sec, this is a little twist on the normal proceedings so stick with me for a moment, this might be useful… maybe…

Photo by Min An on Pexels.com

First a little background and for anyone who remembers the old days of multidimensional SSAS, this was a problem usually solved using ‘true’ role playing dimensions. Role playing dimensions are, or were, one real dimension masquerading as several via multiple relationships to a fact table. This is different to how you would probably implement ‘role playing’ dimensions in PowerBI, your only choices really are physical copies (usually calculated tables) of the dimension (one for each date field in your fact table). Alternatively, using a single dimension with several relationships to the same fact table and then employing the good old USERELATIONSHIP function. See what I mean about this here in the Microsoft PowerBI documentation regarding star schemas and dimensional modelling. Unfortunately this nice feature from the old days of multidimensional SSAS was never implemented in SSAS tabular (i.e. the technology that sits behind PowerBI) so we’re left with the usual options talked about above. It’s mainly preference I guess, but if you’ve only got a few dates then creating a few calendars isn’t too bad and doesn’t bloat the data model too much or increase its size massively. However, if you’ve got a lot of dates fields in your fact table then this can quickly spiral out of control. I think most people would (at this point) just have a single calendar and (as detailed above) rely on the USERELATIONSHIP function. You’d probably then create several measures which just activate the relevant calendar relationship. Well, maybe we can simplify this by just using a single measure and also gain some other added value benefits…

Photo by SHVETS production on Pexels.com

This is going to work by using a combination of a SWITCH statement in the measure and also creating a static table with some hard coded values in. This table is going to list each of the names of the dates, it could then serve as a slicer or filter to allow someone to choose which date relationship to use and the SWITCH statement in the measure decides which relationship to activate. The data in this table could be entered manually (like below) or even defined in the data source (such as an SQL table), either way the principle is the same.

Now we can just create a single measure and use the SWITCH statement to figure out which date the user wants to use depending on the selection of the date name in the active date table. We can figure out which date has been selected by using the SELECTEDVALUE function. So the DAX for the measure becomes something a bit like this:-

Total Sales = SWITCH(
    SELECTEDVALUE('Active Date'[Date Name]),
    
    "Open Date",
    CALCULATE(
        SUM(MyFactTable[Some Value]),
        USERELATIONSHIP(
            'Calendar'[Date],
            MyFactTable[Open Date]
        )
    ),
    
    "Closed Date",
    CALCULATE(
        SUM(MyFactTable[Some Value]),
        USERELATIONSHIP(
            'Calendar'[Date],
            MyFactTable[Closed Date]
        )
    ),
    
    // If nothing selected, default to just use
    // the 'Open Date' relationship
    CALCULATE(
        SUM(MyFactTable[Some Value]),
        USERELATIONSHIP(
            'Calendar'[Date],
            MyFactTable[Open Date]
        )
    )
)

Not super clean code I admit, in a perfect world I’d want to just have a single CALCULATE statement then use the SWITCH statement inside it to just choose which relationship to use, that’d look much cleaner in my eyes. Unfortunately that’s not supported at the time of writing so we’ve got to live with the several CALCULATE statements. Ok, but what’s the benefit of all this?

Photo by Image Hunter on Pexels.com

Well, the main point of using this method is that we can now let the user choose which dates to use by using a slicer for example, like this:-

Another cool, but maybe not super useful feature, is that you can select all dates and show them on the same chart. I guess this could be helpful for some comparisons, maybe, or it could be too much overload – so use only where appropriate. To do this just drop the ‘Date Name’ field from the static table onto the ‘legend’ panel for the visualisation, then select all the dates on the slicer and hey presto:-

Ok there we go. That’s all for today from me, hope you found this interesting and possibly useful. I’ll be back with some more random tidbits soon…

Using PowerQuery to automatically clean column names in PowerBI

Photo by Suzy Hazelwood on Pexels.com

What’s your favourite naming convention for tables or columns? Pascal case? Camel case? Snake case? Kebab case? Well as a developer it could be any of them depending on your preference, or even a combination. However, just pause for a moment and take off your developer hat. Now, imagine you’re a manager viewing a report, would you be expecting to see Pascal case, camel case or some other case on that report anywhere? Probably not, I mean imagine a report with a title of ‘this_is_my_report‘, that wouldn’t look very professional would it?

What you want to see is nice clear English names for all your columns and tables. Obviously we could solve this by just renaming columns manually. So, for example ‘some_field_in_some_table‘ would become ‘Some field in some table‘ or ‘Some Field In Some Table‘ (depending on how you like to capitalise words). However, this would of course become a pain in the butt if you’ve got loads of tables and columns. Luckily though, we can use PowerQuery here to help us out a bit…

Photo by Markus Spiske on Pexels.com

We’ll use a couple of the splitter functions here, first the Splitter.SplitTextByCharacterTransition function to handle Pascal case where we’ve got transitions from upper case to lower case. We’ll also need to use another function too, the Splitter.SplitTextByAnyDelimiter function. This will handle any delimited text that’s using separators (i.e. snake case or kebab case). Using a combination of both these functions we should be able split apart any text into groups of characters, depending on certain criteria. So with this in mind, lets just create a little function to do this text cleaning. We’ll start simple then improve on this later:-

let
    SplitFunction = (TextToSplit) => let
        // Another function to split non-delimited text
        SplitNonDelimitedText = (x) => Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"})(x),
        
        // Another function to split delimited text
        SplitDelimitedText = (x) => Splitter.SplitTextByAnyDelimiter({"_", "-", " "})(x),
        
        // Split the supplied text
        SplitResult = if Text.Contains(TextToSplit, "_") 
            or Text.Contains(TextToSplit, "-")
            or Text.Contains(TextToSplit, " ") 
            then SplitDelimitedText(TextToSplit)
            else SplitNonDelimitedText(TextToSplit)

        in 
        SplitResult
    
    in
    SplitFunction

Running this function with a parameter value of say either ‘ExampleText’, ‘Example-Text’ or ‘Example_Text’ produces the following output. Nice, just what we need:-

{"Example", "Text"}
Photo by Pixabay on Pexels.com

Ok lets build on this a little by cleaning and capitalising the text correctly. Using the Text.Clean function to remove any dodgy control characters (unlikely, but…) and Text.Proper to capitalise. Then finally a bit of code to get the column names from the supplied table and replace each of them with our ‘cleaned’ names. This is what we end up with:-

let
    CleanTableColumnNames = (TableToFormatColumnNames as table) => let                
        // Create a little function to split and clean the text
        CleanColumnName = (ColumnName) => let
            // Another function to split non-delimited text
            SplitNonDelimitedText = (x) => Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"})(x),
            
            // Another function to split delimited text
            SplitDelimitedText = (x) => Splitter.SplitTextByAnyDelimiter({"_", "-", " "})(x),
            
            // Split the supplied text
            SplitResult = if Text.Contains(ColumnName, "_") 
                or Text.Contains(ColumnName, "-")
                or Text.Contains(ColumnName, " ") 
                then SplitDelimitedText(ColumnName)
                else SplitNonDelimitedText(ColumnName),
                
            // Now recombine any parts using space as a word separator
            ParsedName = Text.Combine(SplitResult, " "),

            // Finally clean the text so we get nice capitalisation
            NewColumnName = Text.Clean(Text.Proper(ParsedName))

        in 
        NewColumnName,
                
        // Get the existing columns names from the table
        ExistingColumnNames = Table.ColumnNames(TableToFormatColumnNames),

        // Loop through each column name in the table and build new 
        // 'clean' names using the little split function we defined above
        NewColumnNames = List.Transform(
            ExistingColumnNames,
            each CleanColumnName(_)
        ),

        // We need to create a new list of lists with each original column
        // paired with its replacement column name
        OldAndNewColumnNamePairs = List.Zip({ExistingColumnNames, NewColumnNames}),

        // Now finally we can use the renamecolumns function to rename 
        // the columns in the supplied table with our replacement names ;-)
		RenamedColumns = Table.RenameColumns(TableToFormatColumnNames, OldAndNewColumnNamePairs)

    in
    RenamedColumns

in
CleanTableColumnNames

Now you can just use this function to apply to your table as just another step and watch the column names magically change to nice readable text. This will be much friendlier and professional looking than Pascal, snake or kebab case column names. Especially when designers are dragging these fields onto visualisations or slicers. Anyway that’s all for this one, hope someone finds this useful…

You’re using Azure Synapse, so where does Microsoft Fabric fit in?

Photo by Pixabay on Pexels.com

Ah the latest new fancy shiny thing! Last month Microsoft released Fabric to public preview and are currently promoting it hard with all sorts of articles, video’s and posts. If you’re already using Azure Synapse though or thinking about using it, what’s the story, what should you do, if anything? Well, that’s up to you, here I’ll just give you my opinions on this topic, just don’t blame me either way!

First off I’m going to assume if you’re reading this then you’re aware of Fabric and Synapse, it’s hard to miss if you’re involved in the world of data and data warehousing. If you don’t know what Fabric is, why are you even reading this? πŸ˜‰ There are tons of articles on the internet and YouTube which explain it all, so I’m not going to go over that here, have a search on the web and enjoy…

This article is targeted more at anyone currently using Azure Synapse or starting to use it, as there are a number of questions I’m sure you’ve got. These are just my questions, but hopefully I’m not alone! Some of them came straight to mind on first hearing of Fabric, others from playing around with the public preview for a while now.

Photo by Suzy Hazelwood on Pexels.com

I thought Microsoft were pushing Azure and all its data processing tools, so how come a large chunk of the data platform side of Azure is now being moved outside of it?

I’ve got to say, the release of Fabric and its approach kind of took me by surprise, and I’m sure I’m not the only one. However, when I say ‘by surprise’ I mean I was suprised that this wasn’t in Azure, but just an evolution of the PowerBI service. Instead of adding this as a new version of Synapse in Azure, they’ve taken parts of Azure Synapse and instead built on top of the PowerBI service. While I remember, the PowerBI service is turning into Microsoft Fabric on the 1st of July, so mark that for your diary.

I can’t answer the reasons behind this decision as I don’t work for Microsoft, but I would imagine that they’re just trying to build on the success of PowerBI and appeal to those users who only use PowerBI and not Azure. I guess this way they’re hoping they can tempt people away from using their on-prem data warehouses and ETL/ELT processes. They’ve been kind of doing this with the data flows functionality which came before Fabric. I remember having a similar thought back then for those, confusing…

Not everyone might agree with this decision, I mean I for one never understood why they kept PowerBI sitting outside Azure anyway, I always felt it should have and could have been brought in. The current SaaS approach that they’re taking with Fabric could have remained, I just think it just could have been more integrated with Azure considering that lots of organisations have already invested heavily in Azure and its existing data platform technologies like Synapse and data factory.

Photo by Pixabay on Pexels.com

Is this why Azure Synapse has been floating around in a sort of half built/finished state for such a while?

I’d be guessing that the simple answer to this question is yes… This one for me is a little frustrating, as I’m sure it is for other Azure Synapse developers, we’ve been asking for all these kinds of features like delta, source control etc… for ages and now quietly they’ve gone and more or less implemented all of them, but in a different place!?! What gives Microsoft? What makes this even more frustrating is that it’s not like Fabric is totally new product or the next big shiny thing, Fabric is the next version of Synapse (gen3 if you will), Microsoft have even referred to it as such. Even using Fabric is very similar to Synapse, except there’s now less configuration required (as per the SaaS approach).

I’m sure the Microsoft developers we’re working on Fabric instead of Synapse features. It wouldn’t make sense to spend the time to implement the same features in two places, even with a shared code base there’d still be a lot of work. They’ve also said that Synapse will not be getting any new feature updates from this point onwards though, so that’s something to keep in mind. Maybe some features already in the works or promised before Fabric was anounced could still get released though, but I don’t know – don’t quote me on that! They’ve also stated that Synapse is not being discontinued (yet), but I’d say the writing is on the wall and it will go the way of the dodo in the not too distant future. Remember all the talk of U-SQL back in the day anyone?

Photo by Ivan Samkov on Pexels.com

Hang on, so there’s now three slightly different versions of data factory? Azure data factory, Azure Synapse pipelines and Fabric data factory? Eh?

This is starting to get messy now, a lot of the reasons for using the Azure based data factory have been stripped away now, at least in a certain sense. You must admit it does seem a bit odd having three slightly different versions of the same product all over the place. Some of the differencies for example are datasets and linked services, which have gone from Fabric, replaced with ‘connections’. I actually think this is a good move as datasets did seem a bit pointless. However, in terms of the linked services, the Fabric ‘connections’ which have replaced them are not parameterised (maybe they implement that in future releases). So, if you’ve created a generic set of processing pipelines and parameterised linked services in Azure data factory to do your ETL, how does this translate to Fabric? Erm, it doesn’t really… For now at least you’d need to create a different pipeline for each individual data source or source system you’re importing from.

Photo by David Morris on Pexels.com

What’s the deal with Azure databricks, if all the data platform stuff is now in Fabric but that’s still left behind in Azure?

To be honest I don’t know, although I suspect its similarity in terms of functionality to Azure Synapse, which already provides Apache Spark clusters and notebooks to do all sort of processing, is something to do with it. As with most of the other parts of Azure Synapse, Microsoft have kind of copied and pasted almost verbatim (with some little improvements like cluster startup times) the Synapse Spark experience into Fabric and called it Synapse Data Engineering, it would be like doing this again for databricks.

There are of course differences between the Azure Synapse Spark experience and databricks such as some more focus on machine learning in some databricks scenarios, plus other bits and bats. Although you could argue that Microsoft are ‘countering’ this in Fabric with Synapse Data Science. I think the functionality overlap if they brought databricks in would be too confusing, so I personally can see they left it in Azure. Although saying that, we’re now left in this strange situation where most of the data platform tools are outside Azure but some are still in it. I repeat Microsoft, why wasn’t Fabric part of Azure and PowerBI brought inside too, this would solve this and some of the other issues. I do think its fragmented the Microsoft data platform landscape a bit…

One thing for sure is that setting up and configuring databricks can be a royal pain in the ass… sorting out source control and mounting datalakes just starters. Whilst in Fabric there is virtually no setup or configuration to worry about. Unless you really need some of the extras in databricks, I’m starting to see a potential future where I don’t need to faff around with databricks anymore…

Photo by Andres Ayrton on Pexels.com

If I’ve just started working with Azure Synapse should I switch over to Fabric?

This is a big one, I’d say that you need to look into the differences between the two, how much work you’ve put into Synapse testing or prototyping already, plus potential cost analysis. These are all a little bit in the air at the moment as Fabric has only recently been released and its still in public preview (so some bits don’t work at all and others sometimes work and sometimes don’t). It’s still in a state to do an initial PoC, although it is likely some features may change, and some will be added later. You could create something in your PoC that is the rendered pointless by a new feature or removal of one! Just be aware!

With that in mind, lets have a look at a quick comparison of some key features. We’ll have some descriptions, pros and cons, plus a rough score for each one out of 10. I won’t go into super detail, but these would be my key points for evaluating:-

FeatureAzure SynapseScoreMicrosoft FabricScore
Setup and configurationSpin up the service and configure firewall. Some people have had to enable a service provider before creating a workspace.7SaaS so nothing to setup or configure, it just exists. Some issues connecting to firewalled Azure resources.8
CostDedicated SQL pools have fixed cost depending on DWU level chosen. Serverless SQL queries charged per query. Spark charged for size of the pool and how long its running. Can be cheap if you only use for short periods. Dedicated SQL can pause and resume saving costs, but this is not automatic pause and resume like it is with Azure SQL serverless databases, it’s manual (pain).9Still early days but you purchase Fabric capacity, which if you already have PowerBI capacity this is the same. Note that the Fabric capacities have a different SKU name but they do map to each other. Could do better with more clarity required.8
Source controlSome, but quite poor support… any views, procedures and functions that are in the lake databases and serverless SQL databases are not included. In fact nothing in the serverless SQL databases are under source control. Dedicated SQL pool databases can be kept under source control using Visual Studio database projects.
4All objects everywhere (eventually).10
On-prem
connectivity
Yes, installing the self hosted integration runtime service on a machine on-prem. Although you need a separate service – and thus a separate server or machine to install it on – for each Synapse workspace as they can’t share integration runtimes unlike normal Azure data factory.7Much simpler, just uses the on-prem gateway service installed on a server or machine on-prem.10
Data warehousingYou can use the dedicated SQL pool (i.e. the evolution of the old parallel data warehouse) with a limited T-SQL instruction set. Depending on the DWU tier you can do some massive amounts of processing in a short time. You can do some ‘logical’ data warehousing using serverless engine. However, this cannot update, delete or truncate tables (i.e. files in the data lake). There is limited write functionality to write tables/files to the data lake in both dedicated and serverless, but there isn’t much control. Also, the dedicated SQL pool has some issues writing parquet files with correct file extensions (we had to use the REST API to rename files after export!).8Parts of dedicated SQL pool and serverless have been combined into a new iteration of the old Synapse serverless SQL engine. So now we can create tables and update, insert, delete etc… using the underlying delta files. There is still a limited T-SQL instruction (for now) set similar to the dedicated SQL pool, but it’s a kind of best of both worlds. One big plus point is that you don’t need to write or export your data back to the data lake after processing as all the data is already in the data lake (OneLake).9
Client toolsYou can use good old SSMS if you’re working with dedicated SQL pool databases. It’s a different story with serverless SQL though as it is quite limited, no DML allowed. Its similar for the lake databases too, these can only be built using the Synapse studio designer which is very limited (and is likely to stay that way now). As I mentioned in the source control section you can add your own procs, views and functions to serverless or lake databases using SSMS or Azure data studio but these objects are not kept under source control.6Currently quite similar to Azure Synapse. You can only use external client tools like SSMS or Azure data studio for DDL, so create or drop tables, you cannot insert, update, delete etc… for this you need to be in Fabric and use the GUI. All the SQL endpoints are just read only for now. The Fabric designers and editors are a bit more evolved than the Azure Synapse studio designer but there is still some work required to make it all like a proper IDE or editor. There is talk that Microsoft are working to enable external tools so developers can use their preferred tool of choice eventually, but there’s nothing I could find that states what parts of Fabric this applies to (is it just some or all?).7
Spark (called Synapse data engineering in Fabric)The only real pain here is the startup time of the clusters, its like it was in data factory where it would take several minutes to spin up a cluster before you could do any work. They fixed this in data factory ages ago, but Synapse still has this issue.8Its almost a copy and paste job for the Spark GUI in Fabric compared to Synapse, some small improvements have been made, but the big plus here is the issue of cluster startup times, this has been reduced to mere seconds now.9
Data factoryThe data factory (or pipelines) element of Azure Synapse is virtually identical to the normal data factory. Although it seems to lag behind in terms of certain features. So no global parameters for instance and the issue with the self hosted integration runtimes and sharing is real pain. Also a lack of dark mode, which for a lot of developers nowadays is a must. I remember the days when I had three monitors all glaring at me and it felt like I was getting a sun tan πŸ˜‰ So considering its similarity, but feature lag compared to normal data factory this is frustrating.7This one is interesting as data factory has largely stayed the same except for a few crucial differencies. Microsoft have completely binned off the old data factory Spark powered data flows and replaced them with the ‘PowerBI’ data flows and called them ‘gen2’ data flows. This could be a learning curve for developers who are used to the old data factory data flows. Linked services and datasets have gone, replaced with just ‘connections’. For me this is good as datasets were almost pointless, especially if you were doing generic linked services and pipelines. However, currently the connections functionality in Fabric lacks any kind of dynamic functionality or parameterisation, if this doesn’t change it could be a bit of blocker for some migrations. Two steps forward, one step back I guess.8
PowerBISynapse was sold to us as a more integrated data platform that was bringing together several related parts of a modern data platform, which included PowerBI. In reality though it’s not much more than viewing PowerBI reports in an iframe and links to download PowerBI desktop. Kind of useful, but not full integration.5PowerBI has had some major integration into the elements of Fabric. Just creating a warehouse or lakehouse for example now automatically creates a PowerBI dataset that incorporates all the tables you add to the warehouse or lakehouse. The real bonus here though is the new direct lake mode, PowerBI can now query the OneLake (the data lake where all the delta tables are) directly so there isn’t a need to import or refresh your data model (or use direct query). This means time saved, no in memory storage required and very fast query performance. The nightmare of scheduling refreshes is almost at an end!10
TotalsAzure Synapse61Microsoft Fabric79

I’m sure I’ve probably missed some stuff, let me know if you’ve got any thoughts on this. However even taking into account that Fabric is in preview, it’s looking very tempting to try a PoC. I would definitly hold off developing and putting anything production ready in it though as there could be changes before Fabric goes GA. My main concerns against Fabric now are potential cost, which I guess only a PoC would be able to determine.

Photo by Andrea Piacquadio on Pexels.com

I’ve got all this Azure Synapse infrastructure, should we migrate?

If you’ve got all your Azure Synapse processing setup and working now then I’d say there’s not much incentive to migrate, at least not right now. Microsoft have promised some Synapse to Fabric migration tools and there’s the dedicated SQL pool ‘link’ to Fabric warehouse feature which might be useful. You could take advantage of that if your organisation uses PowerBI so you don’t need to schedule data model refreshes, however the rest of the functionality Fabric has over Synapse such as improved source control etc… won’t give you any visible performance improvements etc… It also remains to be seen if a high scale DWU dedicated SQL pool can outperform the speed of the data lake (OneLake) or not. It could be very close, or not at all.

Hope that helps, and if Microsoft are reading, next time speak to me before doing anything πŸ˜‰

Azure data factory – rename a file in the data lake using the gen2 REST API

I recently needed to rename a file in a datalake, ideally I wanted to do this using Azure data factory as it was part of a large import running in data factory. Searching the web there are quite a few posts about how to use the ADLS gen2 REST API to rename files in a data lake, however I struggled to find any examples on how to do this using Azure data factory and a web activity. I got stung on a number of things in data factory whilst trying to set this up, mostly related to authorisation and the odd missing slash! So I figured I’d do a quick post here to help anyone else encountering similar issues and hopefully save other some of the pains I experienced πŸ˜‰

Before we go any further, for reference purposes we’re using the ADLS data lake gen2 REST API, more specifically the ‘path create’ method. See the link here to the Microsoft reference documentation https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create

Okay, now that’s out of the way, lets look at the authorisation part. The easiest way I found to do this using data factory is to just use the managed identity of the data factory you’re doing this with. To do this you need to go to your data lake in question and add the relevant data factory identity as a ‘storage blob data contributor’ so it can read/write/create the files. On my particular lake the data factory also has ‘contributor’ permissions as its doing a lot more other stuff than just renaming files.

Now back to data factory, creating a new pipeline I’ve added the following parameters for this example shown below. Hopefully the names are self explanatory! The only things to note are that the file path parameters should not contain the name of the container/filesystem, these are just the paths underneath. So you might have the following parameter values for example:-

  • StorageAccountName = mydatalake
  • ContainerName = somecontainername
  • OldFilePath = somefolder/subfolder/myfile.csv
  • NewFilePath = somefolder/subfolder/newfilename.csv

As you can see above, we’ve added a ‘web activity’ onto the pipeline canvas. I’ve called mine ‘Rename file in datalake’ but of course you could name it whatever you like. Next lets change some of the web activity’s settings:

  • URL = this is the full url of the new file in the data lake, NOT the location of the current existing file!
  • Method = this is the web method for the request, we need to use ‘PUT
  • Body = this particular request does not require a body, however data factory complains about an empty body so we use a workaround. Click the ‘add dynamic content‘ under the Body value and set the expression to be @toLower(”) then this will produce an empty body and stop data factory complaining πŸ™‚
  • Authentication = set this to use the ‘System Assigned Managed Identity‘ as the this will make the web request to the API as the data factory’s own managed identity (make sure you’ve added permissions for the data factory’s account on the datalake!) and we don’t need to worry about getting or setting tokens and all that jazz
  • Resource = set this to ‘https://storage.azure.com
  • Headers = we can get away with just a single header here, the ‘x-ms-rename-source‘ header. This is the folder location of the current existing file. In addition, and although not technically required for the call to work, we maybe should add the API version header incase things change in the future. If you want to do this add another header called ‘x-ms-version‘ and set the value to the current version shown in the Microsoft documentation link given previously (at the time of writing the API version is ‘2021-10-04’)

Now we’ll take a look at the URL setting and the dynamic expression we’re using the build the URL. Here I’ve used the ‘concat’ function to build the string value we need:

Now, we need another slightly different expression for the ‘x-ms-rename-source‘ header value, this is not a url value, it’s just the location of the existing file (but it includes the name of the container too). See below:

That’s it. As long as you’ve set the correct permissions for the data factory to access the data lake and got the file locations and URL’s correct you’re good to go. Hope that helps someone, let me know any comments if I’ve missed anything or if you found this useful. See you next time…

T-SQL comparing columns in two tables

Just another little T-SQL nuget here which I hope might help others, I recently had to do a quick check between two tables where they were both copies of the same original table but had been modified slightly. I wasn’t interested in the data, just which columns had been added or changed between the two tables. Luckily there is a handy way to do this using the T-SQL set operator EXCEPT. Quite often you’ll see this operator used to compare result sets of two tables to find where rows exist in one but not the other, a bit like this:-

SELECT * FROM table1

EXCEPT

SELECT * FROM table2

However this is quite simple to modify to allow us to check which columns exist in one table but not the other, we can use the INFORMATION_SCHEMA.COLUMNS view to get the columns from both tables and in combination with the EXCEPT operator we can see where there are differences:-

-- Set the names of the tables we want to check
DECLARE @tableSchema1 VARCHAR(500) = 'dbo'
DECLARE @tableName1 VARCHAR(500) = 'SomeTableName'
DECLARE @tableSchema2 VARCHAR(500) = 'dbo'
DECLARE @tableName2 VARCHAR(500) = 'OtherTableName'

-- Find columns in table 1 that are NOT in table 2
SELECT
[ColumnsInTable1ThatAreNotInTable2] = [COLUMN_NAME]

FROM
[INFORMATION_SCHEMA].[COLUMNS]

WHERE
[TABLE_SCHEMA] = @tableSchema1
AND [TABLE_NAME] = @tableName1

EXCEPT

SELECT
[COLUMN_NAME]

FROM
[INFORMATION_SCHEMA].[COLUMNS]

WHERE
[TABLE_SCHEMA] = @tableSchema2
AND [TABLE_NAME] = @tableName2

ORDER BY
[COLUMN_NAME]

-- Find columns in table 2 that are NOT in table 1
SELECT
[ColumnsInTable2ThatAreNotInTable1] = [COLUMN_NAME]

FROM
[INFORMATION_SCHEMA].[COLUMNS]

WHERE
[TABLE_SCHEMA] = @tableSchema2
AND [TABLE_NAME] = @tableName2

EXCEPT

SELECT
[COLUMN_NAME]

FROM
[INFORMATION_SCHEMA].[COLUMNS]

WHERE
[TABLE_SCHEMA] = @tableSchema1
AND [TABLE_NAME] = @tableName1

ORDER BY
[COLUMN_NAME]

Data Warehousing Snippets – a slightly different ISDATE() function

The T-SQL function ISDATE() is great for determining if a string is a valid date or not. I imagine you’ve used it many times if you’ve been writing T-SQL for a long time. However, there is something that can make it act quite unexpected… the language settings!

Eh? The language settings? What has that to do with dates???

Well, it has to do with the various formats that constitute a valid date in different parts of the world. For example, in the U.S.A. they often refer to dates by month first followed by day like April 1st. In the U.K. we’re the other way around, so you’d likely see 1st April. Taking this back to T-SQL we have the following date ‘2019-25-01’, this is a perfectly valid date. However, we also could have ‘2019-01-25’ which is also a perfectly valid date. The problem is when using the ISDATE() function it takes into account the language. Try this T-SQL below and you’ll see which dates are valid or not. Note that we’re using the SET LANGUAGE to explicitly set the language (duh!) for the session:-

DECLARE @invalidDateInUK VARCHAR(10) = '2019-25-01';
DECLARE @validDateInUK VARCHAR(10) = '2019-12-01';  
  
SET LANGUAGE italian;  
SELECT 
ISDATE(@invalidDateInUK)
,ISDATE(@validDateInUK)

SET LANGUAGE english;  
SELECT 
ISDATE(@invalidDateInUK)
,ISDATE(@validDateInUK)

Hopefully now you can see what might happen if you don’t know (or don’t set) the language? Say for example you ran this code below, you will see different results for the ISDATE() function depending on your language setting.

DECLARE @invalidDateInUK VARCHAR(10) = '2019-25-12'; 

SELECT 
ISDATE(@invalidDateInUK)

This of course makes the use of ISDATE() potentially risky if you aren’t explicit in the language settings. Is there any way we can improve or fix this? Well yes, there’s several options. The simplest being just set the language explicitly I hear you cry! Yes, in most cases this should suffice and you should only write the least amount of code you can to solve the problem. Unfortunately this isn’t possible in some scenarios, for example you cannot set the language in a view. We can get around this by using a function. So how can we improve the function itself so language doesn’t matter? Let’s try this:-

CREATE FUNCTION [Chronological].[IsValidDate] (
	@year INT
	,@month INT
	,@day INT
)
RETURNS BIT AS
BEGIN
	-- this function checks for a specific year, month and day to see
	-- if its a valid date

	-- first get the year as a varchar
	DECLARE @yearStr VARCHAR(4) = CAST(@year AS VARCHAR(4))

	-- generate a 2 character leading zero month
	DECLARE @monthStr VARCHAR(2) = CASE
		WHEN @month < 10 THEN '0' + CAST(@month AS VARCHAR(1))
		ELSE CAST(@month AS VARCHAR(2))
	END

	-- generate a 2 character leading zero day
	DECLARE @dayStr VARCHAR(2) = CASE
		WHEN @day < 10 THEN '0' + CAST(@day AS VARCHAR(1))
		ELSE CAST(@day AS VARCHAR(2))
	END

	-- now we can use the standard ISDATE() function
	-- note we are using the YYYYMMDD format of the ISO 8601 
	-- standard here as this should prevent problems with 
	-- dates being interpreted differently depending on the
	-- culture or language settings
	RETURN ISDATE(@yearStr + @monthStr + @dayStr)
END

Pretty simple and straightforward, the important part really is the last bit. We can use one of the ISO 8601 formats with the original ISDATE() function to achieve our goal. The preceding code is really only there just to generate the string in that specific format. If we explicitly ask for the year, month and day as simple integer parameters then we’re sorted πŸ˜‰

Hope you like this one, it’s not something that you should use if there’s a simpler solution to the problem. However, if you find yourself stuck needing ISDATE() in a view then this, or something like it may be the solution.

Data Warehousing Snippets – Fiscal year function

So, I figured I’d start a little series and post various little (mainly) data warehousing related functions, procedures and other bits of (hopefully) useful T-SQL code. I’ll put these all in a complete Visual Studio SQL project and publish to our GitHub repository (eventually) if you just want to download the whole thing.

Note that for some/most of the series, I will use specific schemas to categorise each function or procedure. This should lead to a nice clean database with a logical and easy to follow structure.

Since our function today is about dates, I’ve created a schema called ‘Chronological‘ to make it clear what this function is for and any other related functions, procedures or tables etc. If you want to create this schema, just run the T-SQL script below on your database. Or if you prefer, alter the function code further down to use another schema, such as ‘dbo‘.

CREATE SCHEMA [Chronological]

So, without further ado, first in the series (and today’s snippet) will about calculating the fiscal year:-

CREATE FUNCTION [Chronological].[FiscalYear] (
	@date DATE
	,@separator VARCHAR(1)
)
RETURNS VARCHAR(7)
AS
BEGIN
	-- first calculate the starting year, if the month is 
	-- is before April then the starting year is the previous year
	-- otherwise its just the current year
	DECLARE @startingFiscalYear INT = CASE 
		WHEN MONTH(@date) < 4 THEN YEAR(@date) - 1
		ELSE YEAR(@date)
	END	

	-- return our fiscal year, note that we just add 1 to 
	-- the starting year to get the ending year
	RETURN CAST(@startingFiscalYear AS VARCHAR(4)) + @separator + RIGHT(CAST(1 + @startingFiscalYear AS VARCHAR(4)), 2)
END

Note that I’ve used a DATE data type for the first parameter, although this function will work if you pass a DATETIME or SMALLDATETIME value. The second parameter is the separator you would like to use, it’s likely this will either be a ‘-‘ or ‘/’ as most fiscal years are display like 2019-20 or 2019/20.

Anyway, that’s it for today. I’ll try and post regular entries in this series so that eventually you should have a decent amount of infrastructure that you could use for all sorts of projects.

SSIS and Excel drivers 32bit vs 64bit

I recently had to revisit an SSIS package that required some changes to import data from an Excel file. It had been a long time since I’d last done one of those, I still remember the pain now…

If you’ve ever tried to read Excel files using SSIS then its likely you’ve come across the old 32bit vs 64bit driver problem. So in my case, I’ve got 64bit Office installed but when developing SSIS packages using SSDT you can only do it if you’ve got the 32bit driver installed. Don’t get me started on why…

This wouldn’t be a problem but unfortunately the 32bit driver will complain if you try and install it, saying something like:-

You cannot install the 32-bit version of Microsoft Access Database Engine as you have 64-bit Office products installed etc…

Well that’s great…

But wait, there is a way to install it without getting any complaints! Open up a command line (usually just type “cmd” and hit Enter) and change directory to wherever you downloaded the driver. Once you’re there just type the name of the file followed by a space then “/quiet” (at least for the latest version of the drivers as of 2019, or “/passive” for older versions of the drivers). So in your command line window/console you should have something like:-

C:\Users\SomeUser\Downloads\AccessDatabaseEngine.exe /quiet

Hit Enter and after about 30 seconds it should have installed, note that it will not inform you that it has installed. Open up your SSIS project and try again. Hopefully this time it should work and no more errors like:-

The β€˜Microsoft.ACE.OLEDB.12.0’ provider is not registered on the local machine.

So, assuming that you are developing SSIS packages that will need to read Excel files, and you’re on a 64 bit version of Windows, when you download the driver you must choose the 32 bit version NOT the 64 bit version!

The latest available version of the driver can downloaded from here https://www.microsoft.com/en-us/download/details.aspx?id=54920