Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

PySpark collect per-column count into column instead of row

$
0
0

I have a use case where I try to collect information about every column in a dataframe (e.g. counting the amount of None values in each column):

def fn_count_null_values_for_columns(p_df):return p_df.select(    *(psf.count_if(psf.col(c).isNull()).alias(c) for c in p_df.columns))

The function as I have written it would gather said information into a single row. However I would want this function to add some additional Information (e.g. a bool that is True if there are no Null values in the column).

Thus I was wondering if it is possible to collect information in columns instead of rows (without having to do an expansive .melt -> .pivot step everytime.

That I easily could calculate another statistic based on the already existing one and each column could have its own data type.

EDIT:As requested I here is an example of the source Data:

cols = ['id', 'col_1', 'col_2', 'col_3', 'col_4']dats = [['0000', None, None, None, 0],    ['0001',   10,    2, "Wow", 0],    ['0002',   20, None, "Fake", 0],    ['0003',   30,    2, "Test", 0]]df = spark.createDataFrame(dats, cols)

This gives you:

Source Data

The function (fn_count_null_values_for_columns) creates the following dataframe:

Data as a single row

Now I want to be able to gather the data directly into a column like this (allowing me to easily add further columns based on previous aggregates):enter image description here

OR I would also be fine with adding a second row to the result (if I don't have to do the whole aggregation function again and then Union the two rows)


Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>