
Here’s another little fun one, well, at least for me. Maybe not terribly useful, but interesting if nothing else…
When loading data from a file into a table using pyspark I personally like to add some columns at the start of the table just to record when they were loaded, updated (if its a history table for example) or maybe something else. I find this useful when querying the table from a warehouse. Now, of course its quite easy to add some columns to a data frame when you’re loading a file in, maybe something a bit like this:-
from pyspark.sql import functions as F
df = (
spark.read.csv("some file.csv")
.withColumn("DateTimeLoaded", F.current_timestamp())
.withColumn("InputFileName", F.input_file_name())
)
This is fine and all, but the columns are at the end of the data frame on the right hand side, like this:-
| SomeColumn | OtherCol | WhatColumn | DateTimeLoaded | InputFileName |
|---|---|---|---|---|
| a | b | c | ||
| a | x | y |

This triggers my OCD, I would like them at the start, but how? Simple, just take the data frame then select the columns you want, but add yours before you ‘select’ (using a select ‘*’) all the others… imagine it like what you’d do with a good old SQL select statement, so something like “SELECT MyColumn = 123, * FROM SomeTable”. We can use the same logic here…
def add_columns_to_beginning_of_data_frame(df):
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import lit, current_timestamp
return df.select(
# Add a 'DateTimeLoaded' column at the start, with the value
# of current timestamp value (i.e. the current date and time)
lit(current_timestamp()).cast(TimestampType()).alias("DateTimeLoaded"),
# Now add another date time column called 'DateTimeUpdated', but we only
# want it to be NULL for the moment
lit(None).cast(TimestampType()).alias("DateTimeUpdated"),
# Now, all the other existing columns come after ;-)
"*")
Maybe its a bit wasteful and unrequired, and I should be happy with them at the end instead – I’m trying to fight my OCD on this, but it sometimes over powers me 😉
Maybe don’t use it on big file imports… Anyway, I said it would just be a little fun one this time, see you next time!