List files in a data lake using Azure Synapse serverless SQL (no REST API required)

Photo by Andrea Piacquadio on Pexels.com

Ah the magic of the filepath function…

Whilst its possible to get a list of files from the data lake using the ADLS gen2 REST API, if you’re running queries in Synapse serverless SQL you can stay where you are and just use some T-SQL to get a similar result.

Say in your data lake container you’ve got a folder structure something like ‘SomeRootFolder/Organisation/SourceSystem/TableName‘ and under each ‘TableName‘ folder you’ve got some parquet (or CSV maybe) files. We can use the filepath function in conjunction with wildcards to easily get the names of each folder in the path without having to resort to splitting the path and picking out individual parts using string functions.

Note this function only works with wildcards (the * characters), it doesn’t return all the folders in the path unless you use it without a parameter! The wildcard positions in the path correspond to the filepath parameter, so the first wildcard in the example code below (after the ‘SomeRootFolder‘) would correspond to filepath(1) and so on for the other positions of the other wildcards. So, you can hard code the main base or root folder path up to the point where you want to then get the different potential names of the folders, this is where you would then use the wildcards.

Note in the code example below I’ve just used some additional code to sort the results by various columns. This could be useful if you need to process each file in name order or something, but its not required…

WITH [cteFilePaths] AS (
	SELECT DISTINCT
    [Path] = [result].[filepath]()
    ,[OrganisationName] = [result].[filepath](1)
    ,[SourceSystemName] = [result].[filepath](2)	
    ,[TableName] = [result].[filepath](3)

    FROM
    OPENROWSET(
        BULK 'https://storageaccount.dfs.core.windows.net/container/SomeRootFolder/*/*/*.parquet',
        FORMAT = 'PARQUET'
    ) AS [result]
), [cteSort] AS (
	SELECT
	[OrganisationName]
	,[SourceSystemName]
	,[TableName]
	,[Path]
	
	,[SortOrder] = ROW_NUMBER() OVER (
		PARTITION BY 
		[OrganisationName]
		,[SourceSystemName]
		
		ORDER BY 
		[TableName]		
	)
	
	FROM
	[cteFilePaths]
)

SELECT
[OrganisationName]
,[SourceSystemName]
,[TableName]
,[Path]
,[SortOrder]

FROM
[cteSort]

Hope that comes in useful, well, until next time…

Using Azure Synapse dedicated SQL pool to get a list of columns from a delta table in the data lake

Photo by cottonbro studio on Pexels.com

Ok, stick with me for this, I admit its a little hacky/messy but as (at the time of writing) Synapse dedicated SQL pools don’t support proper reading of delta tables from a data lake we have to do some fiddling. There are some posts knocking around from Microsoft with some SQL scripts that you can use as a work around to load delta tables (see here). However, in my case I was just after a list of the fields, not the data, in a particular delta table, so I could then create a ‘matching’ table in the dedicated SQL pool database. So this is what I came up with, maybe this might come in handy to someone…

Whilst there’s a fair bit of documentation on delta and how to use it, there’s not much I could find on how delta actually stores the meta data it uses to keep track of the changes in the underlying parquet files (short of digging through the Github code etc…). Although I must admit I didn’t look super hard as its reasonably straightforward since its pretty much a bunch of JSON files (check out the ‘_delta_log’ folder for these JSON files). If anyone knows a link to somewhere that details this meta data structure ‘officially’ please leave a comment below. Anyway, if we dig into these files you can see some of this meta data, and luckily for us part of that is schema information about the fields. We can use this in a rough way to pull out a list of fields and data types. Just for reference, here is a link to the docs for delta.

Photo by Arthur Brognoli on Pexels.com

Small word of warning… I’m not guaranteeing this code will consistently work in the future, like I said at the start of this post, it’s a little hacky and messy, but for now it appears to work ok. If the structure of the JSON files change in the future though this of course could break. So, consider that fair warning if you want to use this! I accept no blame for anything that goes wrong with the use of this code below…

-- Reads some of the delta table JSON files and gets a list of columns and data
-- types. Note that the types are not SQL server data types so if you want you could
-- add to this code something to translate them into SQL server data types ;-)
DECLARE @storageAccountName VARCHAR(32) = 'nameofstorageaccount'
    ,@containerName VARCHAR(32) = 'nameofcontainer'
    ,@folderPath VARCHAR(8000) = 'Some/Folder/Path/Here'
    ,@deltaTableName VARCHAR(128) = 'deltatablename'
    ,@destinationTableName VARCHAR(128) = NULL -- Set to NULL to just output the result or name of destination table

-- To build T-SQL for the copy, note that Synapse dedicated SQL pools
-- do not support blob data types for dynamic SQL (e.g. NVARCHAR(MAX))
-- so we can only use small strings of T-SQL!
DECLARE @sql NVARCHAR(4000)

-- Create the url pointing to the storage account and folder 
-- location for delta log folder
DECLARE @storageLocationUrl VARCHAR(8000) = CONCAT(
    'https://'
    ,@storageAccountName
    ,'.dfs.core.windows.net/'
    ,@containerName
    ,'/'
    ,@folderPath
    ,'/'
    ,@deltaTableName
    ,'/_delta_log'
)

-- Lets get some information from the last checkpoint file
IF OBJECT_ID('tempdb..#last_checkpoint') IS NOT NULL
DROP TABLE [#last_checkpoint]

CREATE TABLE [#last_checkpoint] (
    [FileContent] NVARCHAR(MAX) NOT NULL
) WITH (
    DISTRIBUTION = ROUND_ROBIN
    ,HEAP
)

SET @sql = ''
SET @sql += '
COPY INTO [#last_checkpoint] FROM ''' + @storageLocationUrl + '/_last_checkpoint'' 
WITH (
 FILE_TYPE = ''CSV''
 ,FIELDTERMINATOR = ''0x0b''
 ,FIELDQUOTE = ''0x0b''
 ,ROWTERMINATOR = ''0x0d''
)
';

PRINT @sql;
EXEC sp_executesql @stmt = @sql;

-- Now pick out the version from the file data
DECLARE @version INT = (
    SELECT
    JSON_VALUE([FileContent], '$.version')

    FROM
    [#last_checkpoint]
)

-- Now we can use this version number to build the filename for the check point file. It
-- has leading zero's so we can use the FORMAT function to sort that out for us ;-)
DECLARE @checkpointFilename VARCHAR(500) = FORMAT(@version, '00000000000000000000') + '.checkpoint.parquet'	

-- Build the SQL command to load the data for the checkpoint file
SET @sql = ''
SET @sql += '
COPY INTO [#checkpoint] 
FROM ''' + @storageLocationUrl + '/' + @checkpointFilename + ''' 
WITH (
 FILE_TYPE = ''PARQUET''
 ,AUTO_CREATE_TABLE = ''ON''
)
';

-- Load the checkpoint data
IF OBJECT_ID('tempdb..#checkpoint') IS NOT NULL
DROP TABLE [#checkpoint]

PRINT @sql;
EXEC sp_executesql @stmt = @sql;	

-- Now we can get the schema string from the checkpoint 'metaData' field (JSON contents)
-- and this contains field information (at least for the specific version)
DECLARE @schemaString NVARCHAR(MAX) = ''

SELECT TOP (1)
@schemaString = JSON_VALUE([metaData], '$.schemaString')

FROM 
[#checkpoint]

WHERE
[metaData] IS NOT NULL

ORDER BY
[metaData]

-- Load into destination table if a name has been supplied. If no destination
-- table name has been supplied then just execute the T-SQL select statement
IF @destinationTableName IS NULL SET @sql = ''
ELSE SET @sql = 'INSERT INTO [' + @destinationTableName + '] '

-- Use a some JSON functions to pick out the field names and types ;-)
SET @sql += '
SELECT
[cols].[ColumnName]
,[cols].[DataType]

FROM
OPENJSON(@fieldJSON) WITH (
    [Columns] NVARCHAR(MAX) ''$.fields'' AS JSON
) [f]

OUTER APPLY 
OPENJSON([f].[Columns]) WITH (
    [ColumnName] NVARCHAR(128) ''$.name''
    ,[DataType] NVARCHAR(128) ''$.type''
) [cols]	
';

DECLARE @parameterDefinition NVARCHAR(500) = '@fieldJSON NVARCHAR(MAX)';
EXEC sp_executesql @stmt = @sql
    ,@params = @parameterDefinition
    ,@fieldJSON = @schemaString;

Have fun, I’ll be back with some more data platform code, tips and tricks soon…

You’re using Azure Synapse, so where does Microsoft Fabric fit in?

Photo by Pixabay on Pexels.com

Ah the latest new fancy shiny thing! Last month Microsoft released Fabric to public preview and are currently promoting it hard with all sorts of articles, video’s and posts. If you’re already using Azure Synapse though or thinking about using it, what’s the story, what should you do, if anything? Well, that’s up to you, here I’ll just give you my opinions on this topic, just don’t blame me either way!

First off I’m going to assume if you’re reading this then you’re aware of Fabric and Synapse, it’s hard to miss if you’re involved in the world of data and data warehousing. If you don’t know what Fabric is, why are you even reading this? 😉 There are tons of articles on the internet and YouTube which explain it all, so I’m not going to go over that here, have a search on the web and enjoy…

This article is targeted more at anyone currently using Azure Synapse or starting to use it, as there are a number of questions I’m sure you’ve got. These are just my questions, but hopefully I’m not alone! Some of them came straight to mind on first hearing of Fabric, others from playing around with the public preview for a while now.

Photo by Suzy Hazelwood on Pexels.com

I thought Microsoft were pushing Azure and all its data processing tools, so how come a large chunk of the data platform side of Azure is now being moved outside of it?

I’ve got to say, the release of Fabric and its approach kind of took me by surprise, and I’m sure I’m not the only one. However, when I say ‘by surprise’ I mean I was suprised that this wasn’t in Azure, but just an evolution of the PowerBI service. Instead of adding this as a new version of Synapse in Azure, they’ve taken parts of Azure Synapse and instead built on top of the PowerBI service. While I remember, the PowerBI service is turning into Microsoft Fabric on the 1st of July, so mark that for your diary.

I can’t answer the reasons behind this decision as I don’t work for Microsoft, but I would imagine that they’re just trying to build on the success of PowerBI and appeal to those users who only use PowerBI and not Azure. I guess this way they’re hoping they can tempt people away from using their on-prem data warehouses and ETL/ELT processes. They’ve been kind of doing this with the data flows functionality which came before Fabric. I remember having a similar thought back then for those, confusing…

Not everyone might agree with this decision, I mean I for one never understood why they kept PowerBI sitting outside Azure anyway, I always felt it should have and could have been brought in. The current SaaS approach that they’re taking with Fabric could have remained, I just think it just could have been more integrated with Azure considering that lots of organisations have already invested heavily in Azure and its existing data platform technologies like Synapse and data factory.

Photo by Pixabay on Pexels.com

Is this why Azure Synapse has been floating around in a sort of half built/finished state for such a while?

I’d be guessing that the simple answer to this question is yes… This one for me is a little frustrating, as I’m sure it is for other Azure Synapse developers, we’ve been asking for all these kinds of features like delta, source control etc… for ages and now quietly they’ve gone and more or less implemented all of them, but in a different place!?! What gives Microsoft? What makes this even more frustrating is that it’s not like Fabric is totally new product or the next big shiny thing, Fabric is the next version of Synapse (gen3 if you will), Microsoft have even referred to it as such. Even using Fabric is very similar to Synapse, except there’s now less configuration required (as per the SaaS approach).

I’m sure the Microsoft developers we’re working on Fabric instead of Synapse features. It wouldn’t make sense to spend the time to implement the same features in two places, even with a shared code base there’d still be a lot of work. They’ve also said that Synapse will not be getting any new feature updates from this point onwards though, so that’s something to keep in mind. Maybe some features already in the works or promised before Fabric was anounced could still get released though, but I don’t know – don’t quote me on that! They’ve also stated that Synapse is not being discontinued (yet), but I’d say the writing is on the wall and it will go the way of the dodo in the not too distant future. Remember all the talk of U-SQL back in the day anyone?

Photo by Ivan Samkov on Pexels.com

Hang on, so there’s now three slightly different versions of data factory? Azure data factory, Azure Synapse pipelines and Fabric data factory? Eh?

This is starting to get messy now, a lot of the reasons for using the Azure based data factory have been stripped away now, at least in a certain sense. You must admit it does seem a bit odd having three slightly different versions of the same product all over the place. Some of the differencies for example are datasets and linked services, which have gone from Fabric, replaced with ‘connections’. I actually think this is a good move as datasets did seem a bit pointless. However, in terms of the linked services, the Fabric ‘connections’ which have replaced them are not parameterised (maybe they implement that in future releases). So, if you’ve created a generic set of processing pipelines and parameterised linked services in Azure data factory to do your ETL, how does this translate to Fabric? Erm, it doesn’t really… For now at least you’d need to create a different pipeline for each individual data source or source system you’re importing from.

Photo by David Morris on Pexels.com

What’s the deal with Azure databricks, if all the data platform stuff is now in Fabric but that’s still left behind in Azure?

To be honest I don’t know, although I suspect its similarity in terms of functionality to Azure Synapse, which already provides Apache Spark clusters and notebooks to do all sort of processing, is something to do with it. As with most of the other parts of Azure Synapse, Microsoft have kind of copied and pasted almost verbatim (with some little improvements like cluster startup times) the Synapse Spark experience into Fabric and called it Synapse Data Engineering, it would be like doing this again for databricks.

There are of course differences between the Azure Synapse Spark experience and databricks such as some more focus on machine learning in some databricks scenarios, plus other bits and bats. Although you could argue that Microsoft are ‘countering’ this in Fabric with Synapse Data Science. I think the functionality overlap if they brought databricks in would be too confusing, so I personally can see they left it in Azure. Although saying that, we’re now left in this strange situation where most of the data platform tools are outside Azure but some are still in it. I repeat Microsoft, why wasn’t Fabric part of Azure and PowerBI brought inside too, this would solve this and some of the other issues. I do think its fragmented the Microsoft data platform landscape a bit…

One thing for sure is that setting up and configuring databricks can be a royal pain in the ass… sorting out source control and mounting datalakes just starters. Whilst in Fabric there is virtually no setup or configuration to worry about. Unless you really need some of the extras in databricks, I’m starting to see a potential future where I don’t need to faff around with databricks anymore…

Photo by Andres Ayrton on Pexels.com

If I’ve just started working with Azure Synapse should I switch over to Fabric?

This is a big one, I’d say that you need to look into the differences between the two, how much work you’ve put into Synapse testing or prototyping already, plus potential cost analysis. These are all a little bit in the air at the moment as Fabric has only recently been released and its still in public preview (so some bits don’t work at all and others sometimes work and sometimes don’t). It’s still in a state to do an initial PoC, although it is likely some features may change, and some will be added later. You could create something in your PoC that is the rendered pointless by a new feature or removal of one! Just be aware!

With that in mind, lets have a look at a quick comparison of some key features. We’ll have some descriptions, pros and cons, plus a rough score for each one out of 10. I won’t go into super detail, but these would be my key points for evaluating:-

FeatureAzure SynapseScoreMicrosoft FabricScore
Setup and configurationSpin up the service and configure firewall. Some people have had to enable a service provider before creating a workspace.7SaaS so nothing to setup or configure, it just exists. Some issues connecting to firewalled Azure resources.8
CostDedicated SQL pools have fixed cost depending on DWU level chosen. Serverless SQL queries charged per query. Spark charged for size of the pool and how long its running. Can be cheap if you only use for short periods. Dedicated SQL can pause and resume saving costs, but this is not automatic pause and resume like it is with Azure SQL serverless databases, it’s manual (pain).9Still early days but you purchase Fabric capacity, which if you already have PowerBI capacity this is the same. Note that the Fabric capacities have a different SKU name but they do map to each other. Could do better with more clarity required.8
Source controlSome, but quite poor support… any views, procedures and functions that are in the lake databases and serverless SQL databases are not included. In fact nothing in the serverless SQL databases are under source control. Dedicated SQL pool databases can be kept under source control using Visual Studio database projects.
4All objects everywhere (eventually).10
On-prem
connectivity
Yes, installing the self hosted integration runtime service on a machine on-prem. Although you need a separate service – and thus a separate server or machine to install it on – for each Synapse workspace as they can’t share integration runtimes unlike normal Azure data factory.7Much simpler, just uses the on-prem gateway service installed on a server or machine on-prem.10
Data warehousingYou can use the dedicated SQL pool (i.e. the evolution of the old parallel data warehouse) with a limited T-SQL instruction set. Depending on the DWU tier you can do some massive amounts of processing in a short time. You can do some ‘logical’ data warehousing using serverless engine. However, this cannot update, delete or truncate tables (i.e. files in the data lake). There is limited write functionality to write tables/files to the data lake in both dedicated and serverless, but there isn’t much control. Also, the dedicated SQL pool has some issues writing parquet files with correct file extensions (we had to use the REST API to rename files after export!).8Parts of dedicated SQL pool and serverless have been combined into a new iteration of the old Synapse serverless SQL engine. So now we can create tables and update, insert, delete etc… using the underlying delta files. There is still a limited T-SQL instruction (for now) set similar to the dedicated SQL pool, but it’s a kind of best of both worlds. One big plus point is that you don’t need to write or export your data back to the data lake after processing as all the data is already in the data lake (OneLake).9
Client toolsYou can use good old SSMS if you’re working with dedicated SQL pool databases. It’s a different story with serverless SQL though as it is quite limited, no DML allowed. Its similar for the lake databases too, these can only be built using the Synapse studio designer which is very limited (and is likely to stay that way now). As I mentioned in the source control section you can add your own procs, views and functions to serverless or lake databases using SSMS or Azure data studio but these objects are not kept under source control.6Currently quite similar to Azure Synapse. You can only use external client tools like SSMS or Azure data studio for DDL, so create or drop tables, you cannot insert, update, delete etc… for this you need to be in Fabric and use the GUI. All the SQL endpoints are just read only for now. The Fabric designers and editors are a bit more evolved than the Azure Synapse studio designer but there is still some work required to make it all like a proper IDE or editor. There is talk that Microsoft are working to enable external tools so developers can use their preferred tool of choice eventually, but there’s nothing I could find that states what parts of Fabric this applies to (is it just some or all?).7
Spark (called Synapse data engineering in Fabric)The only real pain here is the startup time of the clusters, its like it was in data factory where it would take several minutes to spin up a cluster before you could do any work. They fixed this in data factory ages ago, but Synapse still has this issue.8Its almost a copy and paste job for the Spark GUI in Fabric compared to Synapse, some small improvements have been made, but the big plus here is the issue of cluster startup times, this has been reduced to mere seconds now.9
Data factoryThe data factory (or pipelines) element of Azure Synapse is virtually identical to the normal data factory. Although it seems to lag behind in terms of certain features. So no global parameters for instance and the issue with the self hosted integration runtimes and sharing is real pain. Also a lack of dark mode, which for a lot of developers nowadays is a must. I remember the days when I had three monitors all glaring at me and it felt like I was getting a sun tan 😉 So considering its similarity, but feature lag compared to normal data factory this is frustrating.7This one is interesting as data factory has largely stayed the same except for a few crucial differencies. Microsoft have completely binned off the old data factory Spark powered data flows and replaced them with the ‘PowerBI’ data flows and called them ‘gen2’ data flows. This could be a learning curve for developers who are used to the old data factory data flows. Linked services and datasets have gone, replaced with just ‘connections’. For me this is good as datasets were almost pointless, especially if you were doing generic linked services and pipelines. However, currently the connections functionality in Fabric lacks any kind of dynamic functionality or parameterisation, if this doesn’t change it could be a bit of blocker for some migrations. Two steps forward, one step back I guess.8
PowerBISynapse was sold to us as a more integrated data platform that was bringing together several related parts of a modern data platform, which included PowerBI. In reality though it’s not much more than viewing PowerBI reports in an iframe and links to download PowerBI desktop. Kind of useful, but not full integration.5PowerBI has had some major integration into the elements of Fabric. Just creating a warehouse or lakehouse for example now automatically creates a PowerBI dataset that incorporates all the tables you add to the warehouse or lakehouse. The real bonus here though is the new direct lake mode, PowerBI can now query the OneLake (the data lake where all the delta tables are) directly so there isn’t a need to import or refresh your data model (or use direct query). This means time saved, no in memory storage required and very fast query performance. The nightmare of scheduling refreshes is almost at an end!10
TotalsAzure Synapse61Microsoft Fabric79

I’m sure I’ve probably missed some stuff, let me know if you’ve got any thoughts on this. However even taking into account that Fabric is in preview, it’s looking very tempting to try a PoC. I would definitly hold off developing and putting anything production ready in it though as there could be changes before Fabric goes GA. My main concerns against Fabric now are potential cost, which I guess only a PoC would be able to determine.

Photo by Andrea Piacquadio on Pexels.com

I’ve got all this Azure Synapse infrastructure, should we migrate?

If you’ve got all your Azure Synapse processing setup and working now then I’d say there’s not much incentive to migrate, at least not right now. Microsoft have promised some Synapse to Fabric migration tools and there’s the dedicated SQL pool ‘link’ to Fabric warehouse feature which might be useful. You could take advantage of that if your organisation uses PowerBI so you don’t need to schedule data model refreshes, however the rest of the functionality Fabric has over Synapse such as improved source control etc… won’t give you any visible performance improvements etc… It also remains to be seen if a high scale DWU dedicated SQL pool can outperform the speed of the data lake (OneLake) or not. It could be very close, or not at all.

Hope that helps, and if Microsoft are reading, next time speak to me before doing anything 😉

Data Warehousing Snippets – a slightly different ISDATE() function

The T-SQL function ISDATE() is great for determining if a string is a valid date or not. I imagine you’ve used it many times if you’ve been writing T-SQL for a long time. However, there is something that can make it act quite unexpected… the language settings!

Eh? The language settings? What has that to do with dates???

Well, it has to do with the various formats that constitute a valid date in different parts of the world. For example, in the U.S.A. they often refer to dates by month first followed by day like April 1st. In the U.K. we’re the other way around, so you’d likely see 1st April. Taking this back to T-SQL we have the following date ‘2019-25-01’, this is a perfectly valid date. However, we also could have ‘2019-01-25’ which is also a perfectly valid date. The problem is when using the ISDATE() function it takes into account the language. Try this T-SQL below and you’ll see which dates are valid or not. Note that we’re using the SET LANGUAGE to explicitly set the language (duh!) for the session:-

DECLARE @invalidDateInUK VARCHAR(10) = '2019-25-01';
DECLARE @validDateInUK VARCHAR(10) = '2019-12-01';  
  
SET LANGUAGE italian;  
SELECT 
ISDATE(@invalidDateInUK)
,ISDATE(@validDateInUK)

SET LANGUAGE english;  
SELECT 
ISDATE(@invalidDateInUK)
,ISDATE(@validDateInUK)

Hopefully now you can see what might happen if you don’t know (or don’t set) the language? Say for example you ran this code below, you will see different results for the ISDATE() function depending on your language setting.

DECLARE @invalidDateInUK VARCHAR(10) = '2019-25-12'; 

SELECT 
ISDATE(@invalidDateInUK)

This of course makes the use of ISDATE() potentially risky if you aren’t explicit in the language settings. Is there any way we can improve or fix this? Well yes, there’s several options. The simplest being just set the language explicitly I hear you cry! Yes, in most cases this should suffice and you should only write the least amount of code you can to solve the problem. Unfortunately this isn’t possible in some scenarios, for example you cannot set the language in a view. We can get around this by using a function. So how can we improve the function itself so language doesn’t matter? Let’s try this:-

CREATE FUNCTION [Chronological].[IsValidDate] (
	@year INT
	,@month INT
	,@day INT
)
RETURNS BIT AS
BEGIN
	-- this function checks for a specific year, month and day to see
	-- if its a valid date

	-- first get the year as a varchar
	DECLARE @yearStr VARCHAR(4) = CAST(@year AS VARCHAR(4))

	-- generate a 2 character leading zero month
	DECLARE @monthStr VARCHAR(2) = CASE
		WHEN @month < 10 THEN '0' + CAST(@month AS VARCHAR(1))
		ELSE CAST(@month AS VARCHAR(2))
	END

	-- generate a 2 character leading zero day
	DECLARE @dayStr VARCHAR(2) = CASE
		WHEN @day < 10 THEN '0' + CAST(@day AS VARCHAR(1))
		ELSE CAST(@day AS VARCHAR(2))
	END

	-- now we can use the standard ISDATE() function
	-- note we are using the YYYYMMDD format of the ISO 8601 
	-- standard here as this should prevent problems with 
	-- dates being interpreted differently depending on the
	-- culture or language settings
	RETURN ISDATE(@yearStr + @monthStr + @dayStr)
END

Pretty simple and straightforward, the important part really is the last bit. We can use one of the ISO 8601 formats with the original ISDATE() function to achieve our goal. The preceding code is really only there just to generate the string in that specific format. If we explicitly ask for the year, month and day as simple integer parameters then we’re sorted 😉

Hope you like this one, it’s not something that you should use if there’s a simpler solution to the problem. However, if you find yourself stuck needing ISDATE() in a view then this, or something like it may be the solution.

Data Warehousing Snippets – Fiscal year function

So, I figured I’d start a little series and post various little (mainly) data warehousing related functions, procedures and other bits of (hopefully) useful T-SQL code. I’ll put these all in a complete Visual Studio SQL project and publish to our GitHub repository (eventually) if you just want to download the whole thing.

Note that for some/most of the series, I will use specific schemas to categorise each function or procedure. This should lead to a nice clean database with a logical and easy to follow structure.

Since our function today is about dates, I’ve created a schema called ‘Chronological‘ to make it clear what this function is for and any other related functions, procedures or tables etc. If you want to create this schema, just run the T-SQL script below on your database. Or if you prefer, alter the function code further down to use another schema, such as ‘dbo‘.

CREATE SCHEMA [Chronological]

So, without further ado, first in the series (and today’s snippet) will about calculating the fiscal year:-

CREATE FUNCTION [Chronological].[FiscalYear] (
	@date DATE
	,@separator VARCHAR(1)
)
RETURNS VARCHAR(7)
AS
BEGIN
	-- first calculate the starting year, if the month is 
	-- is before April then the starting year is the previous year
	-- otherwise its just the current year
	DECLARE @startingFiscalYear INT = CASE 
		WHEN MONTH(@date) < 4 THEN YEAR(@date) - 1
		ELSE YEAR(@date)
	END	

	-- return our fiscal year, note that we just add 1 to 
	-- the starting year to get the ending year
	RETURN CAST(@startingFiscalYear AS VARCHAR(4)) + @separator + RIGHT(CAST(1 + @startingFiscalYear AS VARCHAR(4)), 2)
END

Note that I’ve used a DATE data type for the first parameter, although this function will work if you pass a DATETIME or SMALLDATETIME value. The second parameter is the separator you would like to use, it’s likely this will either be a ‘-‘ or ‘/’ as most fiscal years are display like 2019-20 or 2019/20.

Anyway, that’s it for today. I’ll try and post regular entries in this series so that eventually you should have a decent amount of infrastructure that you could use for all sorts of projects.

T-SQL dynamic calendar in a view

A while back I had a problem crop up where I needed to create a calendar but could only do this in a view. At first I thought, sure no problem, just wheel out the old recursive CTE (common table expression) pattern with some dates like so:-

;WITH cteDates AS (
	SELECT
	CAST('1900-01-01' AS DATE) AS DateValue

	UNION ALL

	SELECT
	DATEADD(DAY, 1, DateValue)

	FROM
	cteDates

	WHERE
	DateValue <= GETDATE()
)

SELECT
DateValue

FROM
cteDates OPTION (MAXRECURSION 32767) -- limit is 32767!

Note that I’m starting at 1st January 1900 (using the ISO 8601 date format of ‘YYYY-MM-DD‘ – see more about T-SQL dates here) as this calendar needs to link to birth dates. Then we finish at today’s date of course.

There is one small problem however, the limit for recursion when using a CTE in SQL Server is 32767. If you run this query you’ll soon see the error message:-

The statement terminated. The maximum recursion 32767 has been exhausted before statement completion.

The CTE solution gets us up to ‘1989-09-18’ (18th September 1989), which would be OK if we were still in the eighties. Of course we could use a much later start date and never hit the limit, but if you need to start at 1st January 1900 and go up to the year 2019 then it IS a problem.

How do we get around this?

In a stored procedure this is not a problem, a simple loop would do the trick. However, remember that we need to write our code as a view so we’re limited by a number of factors. Mainly the fact that we can’t use variables. Essentially we need another method to dynamically generate rows, enter the T-SQL VALUES function which should be familiar to all SQL developers.

Err… OK, how does that solve the problem??

First of all (if you didn’t know already) we can use the VALUES function to generate rows of data. However, writing thousands of VALUES statements is obviously not going to by dynamic or practical. We must break it down a little further… What we can do is break a date into parts, so:-

  • Year
  • Month
  • Day

So if we can generate years, months and days for all the dates we need that might just work. Something like this, but of course we’d still need over 100 values for all the years, in the example I’ve stopped at 1909 but I’m sure you get the idea:-

SELECT
YearValue

FROM (
	VALUES 
	(1900), (1901), (1902), (1903), (1904), (1905), (1906), (1907), (1908), (1909)
) v(YearValue)

So, we must break down years into something smaller and more manageable. Remember primary school (or elementary school for those folk outside of Britain) when you learned about units, tens, hundreds and thousands?

Oh please no, not maths from school…!

SELECT
Thousands.MilleniumValue
,Hundreds.CenturyValue
,Tens.DecadeValue
,Units.YearValue

,TheYear = Thousands.MilleniumValue 
	+ Hundreds.CenturyValue 
	+ Tens.DecadeValue 
	+ Units.YearValue

FROM (
	VALUES 
	(1000), (2000)
) Thousands(MilleniumValue)

CROSS JOIN (
	VALUES 
	(0), (100), (200), (300), (400), (500), (600), (700), (800), (900)
) Hundreds(CenturyValue)

CROSS JOIN (
	VALUES 
	(0), (10), (20), (30), (40), (50), (60), (70), (80), (90)
) Tens(DecadeValue)

CROSS JOIN (
	VALUES 
	(0), (1), (2), (3), (4), (5), (6), (7), (8), (9)
) Units(YearValue)

ORDER BY
TheYear DESC

Right, things are starting to get interesting. The above script will generate a year sequence using a combination of the units, tens, hundreds and thousands values. Note the CROSS JOIN‘s which join all values to all other values, so for each ‘unit’ we get all the values for each ‘ten’, ‘hundred’ and ‘thousand’ value and so on… You have noticed that this will give us year values from 1000 to 2999 which is a little more than we want, but we can filter those unwanted years out later as you’ll see. If we apply the same logic for months and days we end up with:-

SELECT
TheYear = Thousands.MilleniumValue 
	+ Hundreds.CenturyValue 
	+ Tens.DecadeValue 
	+ Units.YearValue

,MonthOfTheYear.MonthValue
,DayOfTheMonth.DayValue

FROM (
	VALUES 
	(1000), (2000)
) Thousands(MilleniumValue)

CROSS JOIN (
	VALUES 
	(0), (100), (200), (300), (400), (500), (600), (700), (800), (900)
) Hundreds(CenturyValue)

CROSS JOIN (
	VALUES 
	(0), (10), (20), (30), (40), (50), (60), (70), (80), (90)
) Tens(DecadeValue)

CROSS JOIN (
	VALUES 
	(0), (1), (2), (3), (4), (5), (6), (7), (8), (9)
) Units(YearValue)

-- 12 months in a year
CROSS JOIN (
	VALUES 
	(1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11), (12)
) MonthOfTheYear(MonthValue)

-- max of 31 days in any month
CROSS JOIN (
	VALUES
	(1), (2), (3), (4), (5), (6), (7), (8), (9), (10)
	,(11), (12), (13), (14), (15), (16), (17), (18), (19), (20)
	,(21), (22), (23), (24), (25), (26), (27), (28), (29), (30)
	,(31)
) DayOfTheMonth(DayValue)

ORDER BY
TheYear DESC

Now we can put it all together and get a reasonably fast and simple query to generate date values which we use as the basis for a simple calendar view:-

SELECT
DateValue
,DateNumber
,YearValue
,MonthValue
,DayValue

FROM (
	SELECT	
	CAST(CAST(rd.DateNumber AS VARCHAR(8)) AS DATE) AS DateValue
	,rd.DateNumber
	,rd.YearValue
	,rd.MonthValue
	,rd.DayValue

	FROM (
		SELECT
		-- convert our values into an integer e.g. 19000101
		-- we can use this to convert to a date later and/or a calendar key value
		((((Thousands.MilleniumValue + Hundreds.CenturyValue + Tens.DecadeValue + Units.YearValue) * 100) + MonthOfTheYear.MonthValue) * 100) + DayOfTheMonth.DayValue AS DateNumber

		-- add the units, tens, hundreds and thousands to get our year
		,Thousands.MilleniumValue + Hundreds.CenturyValue + Tens.DecadeValue + Units.YearValue AS YearValue
		
		-- month and day values
		,MonthOfTheYear.MonthValue
		,DayOfTheMonth.DayValue

		FROM (
			VALUES 
			(1000), (2000)
		) Thousands(MilleniumValue)

		CROSS JOIN (
			VALUES 
			(0), (100), (200), (300), (400), (500), (600), (700), (800), (900)
		) Hundreds(CenturyValue)

		CROSS JOIN (
			VALUES 
			(0), (10), (20), (30), (40), (50), (60), (70), (80), (90)
		) Tens(DecadeValue)

		CROSS JOIN (
			VALUES 
			(0), (1), (2), (3), (4), (5), (6), (7), (8), (9)
		) Units(YearValue)

		-- 12 months in a year
		CROSS JOIN (
			VALUES 
			(1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11), (12)
		) MonthOfTheYear(MonthValue)

		-- max of 31 days in any month
		CROSS JOIN (
			VALUES
			(1), (2), (3), (4), (5), (6), (7), (8), (9), (10)
			,(11), (12), (13), (14), (15), (16), (17), (18), (19), (20)
			,(21), (22), (23), (24), (25), (26), (27), (28), (29), (30)
			,(31)
		) DayOfTheMonth(DayValue)

		WHERE
		-- filter out date values we don't need
		((((Thousands.MilleniumValue + Hundreds.CenturyValue + Tens.DecadeValue + Units.YearValue) * 100) + MonthOfTheYear.MonthValue) * 100) + DayOfTheMonth.DayValue 
		BETWEEN 19000101 AND CAST(CONVERT(VARCHAR(8), GETDATE(), 112) AS INT)	
	) rd

	WHERE
	-- only get values where its a valid date!	
	ISDATE(CAST(rd.DateNumber AS VARCHAR(8))) = 1
) d

ORDER BY
d.DateValue DESC

There we go, the basis of a simple calendar but in a view, no table or stored procedure required. The only thing to do now is add more fields for different aspects of the date e.g. fiscal year, fiscal month of the year etc…

P.S. remember to remove the ORDER BY clause if you do make this into a view, either that or change the SELECT into a SELECT TOP (100) PERCENT WITH TIES

Quick T-SQL functions #2

Here’s another little function for some string parsing, whilst basic and limited has it’s uses. I needed something to parse some very simple text that we were getting from somewhere as part of an import. We needed to get the string values between a ‘start’ and ‘end’ string. It was never going to change format so I didn’t need to over complicate it, the text we were getting was similar to this for example:-

entry:value1:end…entry:value2:end…entry:value3:end

and so on…

CREATE FUNCTION [dbo].[fnFindStringMatchesBetweenSimpleStartAndEndStrings] (
	@text NVARCHAR(MAX)
	,@startString NVARCHAR(1024)
	,@endString NVARCHAR(1024)
)
RETURNS @t TABLE (
	StringMatch NVARCHAR(MAX)
)
AS
BEGIN
	-- we'll use this to store our current 'start index' position			
	DECLARE @startIndex INT = CHARINDEX(@startString, @text)

	-- and this will of course be the end index position
	DECLARE @endIndex INT = CHARINDEX(@endString, @text, @startIndex + LEN(@startString)) 		
	
	-- now loop through and get the rest, until no more are found
	WHILE @startIndex > 0 AND @endIndex > 0
	BEGIN
		-- save our match			
		INSERT INTO @t (
			StringMatch
		) VALUES (
			SUBSTRING(@text, @startIndex + LEN(@startString), @endIndex - (@startIndex + LEN(@startString)))
		)		

		-- next positions
		SET @startIndex = CHARINDEX(@startString, @text, @endIndex)
		SET @endIndex = CHARINDEX(@endString, @text, @startIndex + LEN(@startString))
	END

	-- send back our matches (if any)
	RETURN
END

So to use the function to find the ‘value**’ strings inbetween the ‘entry:’ and ‘:end’ strings you would use:-

SELECT
StringMatch

FROM
dbo.fnFindStringMatchesBetweenSimpleStartAndEndStrings(N'entry:value1:end...entry:value2:end...entry:value3:end', N'entry:', N':end')

The function just returns a simple table with a single column ‘StringMatch‘ with all the values:-

  • value1
  • value2
  • value3

As I mentioned earlier this function is simple and limited, so you couldn’t use it for parsing text that could have nested ‘start’ and ‘end’ strings for example. That being said, I am a firm believer in writing the least amount of code to solve a specific purpose and in this particular case to extend the function to parse nested strings would have added a little performance hit and for no reason – essentially I would be writing redundant code.

What would you do in this case, would you make it more advanced and write the redundant code or not?

T-SQL snippets #1

Here’s another little snippet which comes in handy when wanting to know the state of your SQL servers memory. Using sys.dm_os_sys_memory it’s super easy…

SELECT
(total_physical_memory_kb / 1024) / 1024 AS TotalPhysicalMemoryInGB
,(available_physical_memory_kb / 1024) / 1024 AS AvailablePhysicalMemoryInGB
,((total_physical_memory_kb / 1024) / 1024) - ((available_physical_memory_kb / 1024) / 1024) AS UsedPhysicalMemoryInGB
,(total_page_file_kb / 1024) / 1024 AS TotalPageFileInGB
,(available_page_file_kb / 1024) / 1024 AS AvailablePageFileInGB
,system_cache_kb / 1024 AS SystemCacheInMB
,kernel_paged_pool_kb / 1024 AS KernelPagedPoolInMB
,kernel_nonpaged_pool_kb / 1024 AS KernelNonpagedPoolInMB
,system_high_memory_signal_state AS SystemHighMemorySignalState
,system_low_memory_signal_state AS SystemLowMemorySignalState

FROM
master.sys.dm_os_sys_memory

Note the divide by 1024, this just takes the value which is in kilobytes and gives us a value in megabytes since 1024 kilobytes = 1 megabyte. If we divide this figure again by 1024 we’ll get the value in gigabytes since 1024 megabytes = 1 gigabyte.

Note that you’ll find some places where values are given slightly differently and/or are rounded down, for example… 1000 megabytes = 1 gigabyte. Of course this is not mathematically or technically correct as it should be a base 10 number (i.e. decimal) representing a base 2 number (i.e. binary). However, it has become an adopted standard so commonly you’ll find things like hard disk manufacturers listing disk space or drive capacity in this way, whereas some operating systems give the correct figure.

Why has this happened? Who knows, but I suspect it’s probably because its simpler to remember and calculate with 1000 rather than 1024 🙂

T-SQL stored procedure dependencies in a database

I recently needed to get a what tables, views and other stored procedures that every stored procedure in a specific database was referencing. Pretty straight forward, we just need to use some of the system tables/views. There’s lots of examples on the web but here’s my little take on the subject…

SELECT
ss.name AS StoredProcedureSchema
,so.name AS StoredProcedureName

,sed.referenced_database_name AS ReferencingDatabaseName
,sed.referenced_schema_name AS ReferencingSchemaName
,sed.referenced_entity_name AS ReferencingObjectName
,rso.type_desc AS ReferencingObjectType

FROM
sys.sql_expression_dependencies sed

-- join objects so we can get details like the name of 
-- the stored procedure
INNER JOIN
sys.objects so
ON
sed.referencing_id = so.object_id

-- and its schema
INNER JOIN
sys.schemas ss
ON
so.schema_id = ss.schema_id

-- get information about the objects that
-- the stored procedure is referencing
INNER JOIN
sys.objects rso
ON
sed.referenced_id = rso.object_id

-- and their schema as well
INNER JOIN
sys.schemas rss
ON
rso.schema_id = rss.schema_id

WHERE
so.type = 'P' -- just stored procedures

You’ll find that the ‘ReferencingDatabaseName‘ will be NULL if your stored procedures are only referencing objects in the same database.

There’s quite a lot of other information regarding dependencies that can be gleaned from SQL Server, so stay tuned for more little tit bits like this soon… 🙂

Quick T-SQL functions #1

Whilst messing around with some T-SQL code for parsing various bits of text I found I needed a function to convert string in upper case format (e.g. “THIS IS AN EXAMPLE”) to camel case, capitalised first letter and no spaces. So we’d end up with “ThisIsAnExample” as the resulting string output.

I thought I’d give the SQL Server 2017 string functions STRING_SPLIT and STRING_AGG a bit of a whirl since I’d not needed to use them for anything new yet. Turns out they helped make something I thought would be a bit of a faff into just a few lines 🙂

Its very basic but does the job and could easily be extended to perform more complex conversions with complicated strings.

CREATE FUNCTION [dbo].[fnConvertStringToCamelCase] (
	@string NVARCHAR(MAX)
	,@separator NCHAR(1)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
	DECLARE @result NVARCHAR(MAX) = (
		SELECT
		STRING_AGG(Word, '') -- glue rows back together into a single value

		FROM (
			SELECT
			UPPER(LEFT(value, 1)) + RIGHT(LOWER(value), LEN(value) - 1) AS Word

			FROM
			STRING_SPLIT(@string, @separator) -- split string into rows
		) s
	)

	RETURN @result
END