
Ah the magic of the filepath function…
Whilst its possible to get a list of files from the data lake using the ADLS gen2 REST API, if you’re running queries in Synapse serverless SQL you can stay where you are and just use some T-SQL to get a similar result.
Say in your data lake container you’ve got a folder structure something like ‘SomeRootFolder/Organisation/SourceSystem/TableName‘ and under each ‘TableName‘ folder you’ve got some parquet (or CSV maybe) files. We can use the filepath function in conjunction with wildcards to easily get the names of each folder in the path without having to resort to splitting the path and picking out individual parts using string functions.
Note this function only works with wildcards (the * characters), it doesn’t return all the folders in the path unless you use it without a parameter! The wildcard positions in the path correspond to the filepath parameter, so the first wildcard in the example code below (after the ‘SomeRootFolder‘) would correspond to filepath(1) and so on for the other positions of the other wildcards. So, you can hard code the main base or root folder path up to the point where you want to then get the different potential names of the folders, this is where you would then use the wildcards.
Note in the code example below I’ve just used some additional code to sort the results by various columns. This could be useful if you need to process each file in name order or something, but its not required…
WITH [cteFilePaths] AS (
SELECT DISTINCT
[Path] = [result].[filepath]()
,[OrganisationName] = [result].[filepath](1)
,[SourceSystemName] = [result].[filepath](2)
,[TableName] = [result].[filepath](3)
FROM
OPENROWSET(
BULK 'https://storageaccount.dfs.core.windows.net/container/SomeRootFolder/*/*/*.parquet',
FORMAT = 'PARQUET'
) AS [result]
), [cteSort] AS (
SELECT
[OrganisationName]
,[SourceSystemName]
,[TableName]
,[Path]
,[SortOrder] = ROW_NUMBER() OVER (
PARTITION BY
[OrganisationName]
,[SourceSystemName]
ORDER BY
[TableName]
)
FROM
[cteFilePaths]
)
SELECT
[OrganisationName]
,[SourceSystemName]
,[TableName]
,[Path]
,[SortOrder]
FROM
[cteSort]
Hope that comes in useful, well, until next time…