Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

Deriving correlation coefficient on a grouped Pyspark dataframe

$
0
0

I have been struggling implementing the following function.

I would like to apply first groupBy operation to customer_name and for each group, I would like to calculate Pearson correlation coefficient between price and units. So, the final dataframe should have two columns, customer_name and correlation. I would like to use pyspark.ml.stat.Correlation library to calculate correlation coefficient. Would you please help figuring out the code. Here is an example.

spark = SparkSession.builder.appName("CustomFunctionExample").getOrCreate()# Sample data (replace with your actual data)data = [    ("2021-01-06", "a1", "b1", 8.0, 8.0),    ("2021-03-13", "a1", "b1", 1.0, 0.0),    ("2021-06-20", "a1", "b5", 2.0, 0.0),    ("2021-10-27", "a1", "b5", 8.0, 8.0),    ("2021-01-06", "a1", "b2", 2.0, 2.0),    ("2021-03-13", "a2", "b2", 9.0, 9.0),    ("2021-06-06", "a2", "b4", 3.0, 3.0),    ("2021-10-06", "a2", "b4", 8.0, 8.0)]schema = ["date", "customer_name", "upc", "price", "units"]df = spark.createDataFrame(data, schema)

Expecting a pyspark dataframe with customer_name and corr_coeff columns.


Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>