I have a use case where I try to collect information about every column in a dataframe (e.g. counting the amount of None values in each column):
def fn_count_null_values_for_columns(p_df):return p_df.select( *(psf.count_if(psf.col(c).isNull()).alias(c) for c in p_df.columns))The function as I have written it would gather said information into a single row. However I would want this function to add some additional Information (e.g. a bool that is True if there are no Null values in the column).
Thus I was wondering if it is possible to collect information in columns instead of rows (without having to do an expansive .melt -> .pivot step everytime.
That I easily could calculate another statistic based on the already existing one and each column could have its own data type.
EDIT:As requested I here is an example of the source Data:
cols = ['id', 'col_1', 'col_2', 'col_3', 'col_4']dats = [['0000', None, None, None, 0], ['0001', 10, 2, "Wow", 0], ['0002', 20, None, "Fake", 0], ['0003', 30, 2, "Test", 0]]df = spark.createDataFrame(dats, cols)This gives you:
The function (fn_count_null_values_for_columns) creates the following dataframe:
Now I want to be able to gather the data directly into a column like this (allowing me to easily add further columns based on previous aggregates):
OR I would also be fine with adding a second row to the result (if I don't have to do the whole aggregation function again and then Union the two rows)

