photo of a we have more signage
Photo by RDNE Stock project on Pexels.com

Here’s another little fun one, well, at least for me. Maybe not terribly useful, but interesting if nothing else…

When loading data from a file into a table using pyspark I personally like to add some columns at the start of the table just to record when they were loaded, updated (if its a history table for example) or maybe something else. I find this useful when querying the table from a warehouse. Now, of course its quite easy to add some columns to a data frame when you’re loading a file in, maybe something a bit like this:-

from pyspark.sql import functions as F

df = (
    spark.read.csv("some file.csv")
        .withColumn("DateTimeLoaded", F.current_timestamp())
        .withColumn("InputFileName", F.input_file_name())
)

This is fine and all, but the columns are at the end of the data frame on the right hand side, like this:-

SomeColumnOtherColWhatColumnDateTimeLoadedInputFileName
abc
axy
road construction
Photo by Connor Forsyth on Pexels.com

This triggers my OCD, I would like them at the start, but how? Simple, just take the data frame then select the columns you want, but add yours before you ‘select’ (using a select ‘*’) all the others… imagine it like what you’d do with a good old SQL select statement, so something like “SELECT MyColumn = 123, * FROM SomeTable”. We can use the same logic here…

def add_columns_to_beginning_of_data_frame(df):
    from pyspark.sql.types import TimestampType
    from pyspark.sql.functions import lit, current_timestamp

    return df.select(
        # Add a 'DateTimeLoaded' column at the start, with the value
        # of current timestamp value (i.e. the current date and time)
        lit(current_timestamp()).cast(TimestampType()).alias("DateTimeLoaded"),

        # Now add another date time column called 'DateTimeUpdated', but we only 
        # want it to be NULL for the moment
        lit(None).cast(TimestampType()).alias("DateTimeUpdated"),

        # Now, all the other existing columns come after ;-)
        "*")

Maybe its a bit wasteful and unrequired, and I should be happy with them at the end instead – I’m trying to fight my OCD on this, but it sometimes over powers me 😉

Maybe don’t use it on big file imports… Anyway, I said it would just be a little fun one this time, see you next time!

Leave a Reply

Discover more from Aventius

Subscribe now to keep reading and get access to the full archive.

Continue reading