I have been struggling implementing the following function.
I would like to apply first groupBy operation to customer_name and for each group, I would like to calculate Pearson correlation coefficient between price and units. So, the final dataframe should have two columns, customer_name and correlation. I would like to use pyspark.ml.stat.Correlation library to calculate correlation coefficient. Would you please help figuring out the code. Here is an example.
spark = SparkSession.builder.appName("CustomFunctionExample").getOrCreate()# Sample data (replace with your actual data)data = [ ("2021-01-06", "a1", "b1", 8.0, 8.0), ("2021-03-13", "a1", "b1", 1.0, 0.0), ("2021-06-20", "a1", "b5", 2.0, 0.0), ("2021-10-27", "a1", "b5", 8.0, 8.0), ("2021-01-06", "a1", "b2", 2.0, 2.0), ("2021-03-13", "a2", "b2", 9.0, 9.0), ("2021-06-06", "a2", "b4", 3.0, 3.0), ("2021-10-06", "a2", "b4", 8.0, 8.0)]schema = ["date", "customer_name", "upc", "price", "units"]df = spark.createDataFrame(data, schema)Expecting a pyspark dataframe with customer_name and corr_coeff columns.