
Ok, stick with me for this, I admit its a little hacky/messy but as (at the time of writing) Synapse dedicated SQL pools don’t support proper reading of delta tables from a data lake we have to do some fiddling. There are some posts knocking around from Microsoft with some SQL scripts that you can use as a work around to load delta tables (see here). However, in my case I was just after a list of the fields, not the data, in a particular delta table, so I could then create a ‘matching’ table in the dedicated SQL pool database. So this is what I came up with, maybe this might come in handy to someone…
Whilst there’s a fair bit of documentation on delta and how to use it, there’s not much I could find on how delta actually stores the meta data it uses to keep track of the changes in the underlying parquet files (short of digging through the Github code etc…). Although I must admit I didn’t look super hard as its reasonably straightforward since its pretty much a bunch of JSON files (check out the ‘_delta_log’ folder for these JSON files). If anyone knows a link to somewhere that details this meta data structure ‘officially’ please leave a comment below. Anyway, if we dig into these files you can see some of this meta data, and luckily for us part of that is schema information about the fields. We can use this in a rough way to pull out a list of fields and data types. Just for reference, here is a link to the docs for delta.

Small word of warning… I’m not guaranteeing this code will consistently work in the future, like I said at the start of this post, it’s a little hacky and messy, but for now it appears to work ok. If the structure of the JSON files change in the future though this of course could break. So, consider that fair warning if you want to use this! I accept no blame for anything that goes wrong with the use of this code below…
-- Reads some of the delta table JSON files and gets a list of columns and data
-- types. Note that the types are not SQL server data types so if you want you could
-- add to this code something to translate them into SQL server data types ;-)
DECLARE @storageAccountName VARCHAR(32) = 'nameofstorageaccount'
,@containerName VARCHAR(32) = 'nameofcontainer'
,@folderPath VARCHAR(8000) = 'Some/Folder/Path/Here'
,@deltaTableName VARCHAR(128) = 'deltatablename'
,@destinationTableName VARCHAR(128) = NULL -- Set to NULL to just output the result or name of destination table
-- To build T-SQL for the copy, note that Synapse dedicated SQL pools
-- do not support blob data types for dynamic SQL (e.g. NVARCHAR(MAX))
-- so we can only use small strings of T-SQL!
DECLARE @sql NVARCHAR(4000)
-- Create the url pointing to the storage account and folder
-- location for delta log folder
DECLARE @storageLocationUrl VARCHAR(8000) = CONCAT(
'https://'
,@storageAccountName
,'.dfs.core.windows.net/'
,@containerName
,'/'
,@folderPath
,'/'
,@deltaTableName
,'/_delta_log'
)
-- Lets get some information from the last checkpoint file
IF OBJECT_ID('tempdb..#last_checkpoint') IS NOT NULL
DROP TABLE [#last_checkpoint]
CREATE TABLE [#last_checkpoint] (
[FileContent] NVARCHAR(MAX) NOT NULL
) WITH (
DISTRIBUTION = ROUND_ROBIN
,HEAP
)
SET @sql = ''
SET @sql += '
COPY INTO [#last_checkpoint] FROM ''' + @storageLocationUrl + '/_last_checkpoint''
WITH (
FILE_TYPE = ''CSV''
,FIELDTERMINATOR = ''0x0b''
,FIELDQUOTE = ''0x0b''
,ROWTERMINATOR = ''0x0d''
)
';
PRINT @sql;
EXEC sp_executesql @stmt = @sql;
-- Now pick out the version from the file data
DECLARE @version INT = (
SELECT
JSON_VALUE([FileContent], '$.version')
FROM
[#last_checkpoint]
)
-- Now we can use this version number to build the filename for the check point file. It
-- has leading zero's so we can use the FORMAT function to sort that out for us ;-)
DECLARE @checkpointFilename VARCHAR(500) = FORMAT(@version, '00000000000000000000') + '.checkpoint.parquet'
-- Build the SQL command to load the data for the checkpoint file
SET @sql = ''
SET @sql += '
COPY INTO [#checkpoint]
FROM ''' + @storageLocationUrl + '/' + @checkpointFilename + '''
WITH (
FILE_TYPE = ''PARQUET''
,AUTO_CREATE_TABLE = ''ON''
)
';
-- Load the checkpoint data
IF OBJECT_ID('tempdb..#checkpoint') IS NOT NULL
DROP TABLE [#checkpoint]
PRINT @sql;
EXEC sp_executesql @stmt = @sql;
-- Now we can get the schema string from the checkpoint 'metaData' field (JSON contents)
-- and this contains field information (at least for the specific version)
DECLARE @schemaString NVARCHAR(MAX) = ''
SELECT TOP (1)
@schemaString = JSON_VALUE([metaData], '$.schemaString')
FROM
[#checkpoint]
WHERE
[metaData] IS NOT NULL
ORDER BY
[metaData]
-- Load into destination table if a name has been supplied. If no destination
-- table name has been supplied then just execute the T-SQL select statement
IF @destinationTableName IS NULL SET @sql = ''
ELSE SET @sql = 'INSERT INTO [' + @destinationTableName + '] '
-- Use a some JSON functions to pick out the field names and types ;-)
SET @sql += '
SELECT
[cols].[ColumnName]
,[cols].[DataType]
FROM
OPENJSON(@fieldJSON) WITH (
[Columns] NVARCHAR(MAX) ''$.fields'' AS JSON
) [f]
OUTER APPLY
OPENJSON([f].[Columns]) WITH (
[ColumnName] NVARCHAR(128) ''$.name''
,[DataType] NVARCHAR(128) ''$.type''
) [cols]
';
DECLARE @parameterDefinition NVARCHAR(500) = '@fieldJSON NVARCHAR(MAX)';
EXEC sp_executesql @stmt = @sql
,@params = @parameterDefinition
,@fieldJSON = @schemaString;
Have fun, I’ll be back with some more data platform code, tips and tricks soon…