Using Azure Synapse dedicated SQL pool to get a list of columns from a delta table in the data lake

Photo by cottonbro studio on Pexels.com

Ok, stick with me for this, I admit its a little hacky/messy but as (at the time of writing) Synapse dedicated SQL pools don’t support proper reading of delta tables from a data lake we have to do some fiddling. There are some posts knocking around from Microsoft with some SQL scripts that you can use as a work around to load delta tables (see here). However, in my case I was just after a list of the fields, not the data, in a particular delta table, so I could then create a ‘matching’ table in the dedicated SQL pool database. So this is what I came up with, maybe this might come in handy to someone…

Whilst there’s a fair bit of documentation on delta and how to use it, there’s not much I could find on how delta actually stores the meta data it uses to keep track of the changes in the underlying parquet files (short of digging through the Github code etc…). Although I must admit I didn’t look super hard as its reasonably straightforward since its pretty much a bunch of JSON files (check out the ‘_delta_log’ folder for these JSON files). If anyone knows a link to somewhere that details this meta data structure ‘officially’ please leave a comment below. Anyway, if we dig into these files you can see some of this meta data, and luckily for us part of that is schema information about the fields. We can use this in a rough way to pull out a list of fields and data types. Just for reference, here is a link to the docs for delta.

Photo by Arthur Brognoli on Pexels.com

Small word of warning… I’m not guaranteeing this code will consistently work in the future, like I said at the start of this post, it’s a little hacky and messy, but for now it appears to work ok. If the structure of the JSON files change in the future though this of course could break. So, consider that fair warning if you want to use this! I accept no blame for anything that goes wrong with the use of this code below…

-- Reads some of the delta table JSON files and gets a list of columns and data
-- types. Note that the types are not SQL server data types so if you want you could
-- add to this code something to translate them into SQL server data types ;-)
DECLARE @storageAccountName VARCHAR(32) = 'nameofstorageaccount'
    ,@containerName VARCHAR(32) = 'nameofcontainer'
    ,@folderPath VARCHAR(8000) = 'Some/Folder/Path/Here'
    ,@deltaTableName VARCHAR(128) = 'deltatablename'
    ,@destinationTableName VARCHAR(128) = NULL -- Set to NULL to just output the result or name of destination table

-- To build T-SQL for the copy, note that Synapse dedicated SQL pools
-- do not support blob data types for dynamic SQL (e.g. NVARCHAR(MAX))
-- so we can only use small strings of T-SQL!
DECLARE @sql NVARCHAR(4000)

-- Create the url pointing to the storage account and folder 
-- location for delta log folder
DECLARE @storageLocationUrl VARCHAR(8000) = CONCAT(
    'https://'
    ,@storageAccountName
    ,'.dfs.core.windows.net/'
    ,@containerName
    ,'/'
    ,@folderPath
    ,'/'
    ,@deltaTableName
    ,'/_delta_log'
)

-- Lets get some information from the last checkpoint file
IF OBJECT_ID('tempdb..#last_checkpoint') IS NOT NULL
DROP TABLE [#last_checkpoint]

CREATE TABLE [#last_checkpoint] (
    [FileContent] NVARCHAR(MAX) NOT NULL
) WITH (
    DISTRIBUTION = ROUND_ROBIN
    ,HEAP
)

SET @sql = ''
SET @sql += '
COPY INTO [#last_checkpoint] FROM ''' + @storageLocationUrl + '/_last_checkpoint'' 
WITH (
 FILE_TYPE = ''CSV''
 ,FIELDTERMINATOR = ''0x0b''
 ,FIELDQUOTE = ''0x0b''
 ,ROWTERMINATOR = ''0x0d''
)
';

PRINT @sql;
EXEC sp_executesql @stmt = @sql;

-- Now pick out the version from the file data
DECLARE @version INT = (
    SELECT
    JSON_VALUE([FileContent], '$.version')

    FROM
    [#last_checkpoint]
)

-- Now we can use this version number to build the filename for the check point file. It
-- has leading zero's so we can use the FORMAT function to sort that out for us ;-)
DECLARE @checkpointFilename VARCHAR(500) = FORMAT(@version, '00000000000000000000') + '.checkpoint.parquet'	

-- Build the SQL command to load the data for the checkpoint file
SET @sql = ''
SET @sql += '
COPY INTO [#checkpoint] 
FROM ''' + @storageLocationUrl + '/' + @checkpointFilename + ''' 
WITH (
 FILE_TYPE = ''PARQUET''
 ,AUTO_CREATE_TABLE = ''ON''
)
';

-- Load the checkpoint data
IF OBJECT_ID('tempdb..#checkpoint') IS NOT NULL
DROP TABLE [#checkpoint]

PRINT @sql;
EXEC sp_executesql @stmt = @sql;	

-- Now we can get the schema string from the checkpoint 'metaData' field (JSON contents)
-- and this contains field information (at least for the specific version)
DECLARE @schemaString NVARCHAR(MAX) = ''

SELECT TOP (1)
@schemaString = JSON_VALUE([metaData], '$.schemaString')

FROM 
[#checkpoint]

WHERE
[metaData] IS NOT NULL

ORDER BY
[metaData]

-- Load into destination table if a name has been supplied. If no destination
-- table name has been supplied then just execute the T-SQL select statement
IF @destinationTableName IS NULL SET @sql = ''
ELSE SET @sql = 'INSERT INTO [' + @destinationTableName + '] '

-- Use a some JSON functions to pick out the field names and types ;-)
SET @sql += '
SELECT
[cols].[ColumnName]
,[cols].[DataType]

FROM
OPENJSON(@fieldJSON) WITH (
    [Columns] NVARCHAR(MAX) ''$.fields'' AS JSON
) [f]

OUTER APPLY 
OPENJSON([f].[Columns]) WITH (
    [ColumnName] NVARCHAR(128) ''$.name''
    ,[DataType] NVARCHAR(128) ''$.type''
) [cols]	
';

DECLARE @parameterDefinition NVARCHAR(500) = '@fieldJSON NVARCHAR(MAX)';
EXEC sp_executesql @stmt = @sql
    ,@params = @parameterDefinition
    ,@fieldJSON = @schemaString;

Have fun, I’ll be back with some more data platform code, tips and tricks soon…

Azure data factory – rename a file in the data lake using the gen2 REST API

I recently needed to rename a file in a datalake, ideally I wanted to do this using Azure data factory as it was part of a large import running in data factory. Searching the web there are quite a few posts about how to use the ADLS gen2 REST API to rename files in a data lake, however I struggled to find any examples on how to do this using Azure data factory and a web activity. I got stung on a number of things in data factory whilst trying to set this up, mostly related to authorisation and the odd missing slash! So I figured I’d do a quick post here to help anyone else encountering similar issues and hopefully save other some of the pains I experienced 😉

Before we go any further, for reference purposes we’re using the ADLS data lake gen2 REST API, more specifically the ‘path create’ method. See the link here to the Microsoft reference documentation https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create

Okay, now that’s out of the way, lets look at the authorisation part. The easiest way I found to do this using data factory is to just use the managed identity of the data factory you’re doing this with. To do this you need to go to your data lake in question and add the relevant data factory identity as a ‘storage blob data contributor’ so it can read/write/create the files. On my particular lake the data factory also has ‘contributor’ permissions as its doing a lot more other stuff than just renaming files.

Now back to data factory, creating a new pipeline I’ve added the following parameters for this example shown below. Hopefully the names are self explanatory! The only things to note are that the file path parameters should not contain the name of the container/filesystem, these are just the paths underneath. So you might have the following parameter values for example:-

  • StorageAccountName = mydatalake
  • ContainerName = somecontainername
  • OldFilePath = somefolder/subfolder/myfile.csv
  • NewFilePath = somefolder/subfolder/newfilename.csv

As you can see above, we’ve added a ‘web activity’ onto the pipeline canvas. I’ve called mine ‘Rename file in datalake’ but of course you could name it whatever you like. Next lets change some of the web activity’s settings:

  • URL = this is the full url of the new file in the data lake, NOT the location of the current existing file!
  • Method = this is the web method for the request, we need to use ‘PUT
  • Body = this particular request does not require a body, however data factory complains about an empty body so we use a workaround. Click the ‘add dynamic content‘ under the Body value and set the expression to be @toLower(”) then this will produce an empty body and stop data factory complaining 🙂
  • Authentication = set this to use the ‘System Assigned Managed Identity‘ as the this will make the web request to the API as the data factory’s own managed identity (make sure you’ve added permissions for the data factory’s account on the datalake!) and we don’t need to worry about getting or setting tokens and all that jazz
  • Resource = set this to ‘https://storage.azure.com
  • Headers = we can get away with just a single header here, the ‘x-ms-rename-source‘ header. This is the folder location of the current existing file. In addition, and although not technically required for the call to work, we maybe should add the API version header incase things change in the future. If you want to do this add another header called ‘x-ms-version‘ and set the value to the current version shown in the Microsoft documentation link given previously (at the time of writing the API version is ‘2021-10-04’)

Now we’ll take a look at the URL setting and the dynamic expression we’re using the build the URL. Here I’ve used the ‘concat’ function to build the string value we need:

Now, we need another slightly different expression for the ‘x-ms-rename-source‘ header value, this is not a url value, it’s just the location of the existing file (but it includes the name of the container too). See below:

That’s it. As long as you’ve set the correct permissions for the data factory to access the data lake and got the file locations and URL’s correct you’re good to go. Hope that helps someone, let me know any comments if I’ve missed anything or if you found this useful. See you next time…