I have the following DataFrame and using Pyspark, I'm trying to get the following answers:
- Total Fare by Pick
- Total Tip by Pick
- Avg Drag by Pick
- Avg Drag by Drop
| Pick | Drop | Fare | Tip | Drag |
|---|---|---|---|---|
| 1 | 1 | 4.00 | 4.00 | 1.00 |
| 1 | 2 | 5.00 | 10.00 | 8.00 |
| 1 | 2 | 5.00 | 15.00 | 12.00 |
| 3 | 2 | 11.00 | 12.00 | 17.00 |
| 3 | 5 | 41.00 | 25.00 | 13.00 |
| 4 | 6 | 50.00 | 70.00 | 2.00 |
My Query is so far like this:
from pyspark.sql import functions as funcfrom pyspark.sql.functions import descdf.groupBy('Pick', 'Drop') \ .agg( func.sum('Fare').alias('FarePick'), func.sum('Tip').alias('TipPick'), func.avg('Drag').alias('AvgDragPick'), func.avg('Drag').alias('AvgDragDrop')) \ .orderBy('Pick').show()However, I don't think it seems to be correct. I'm abit stuck on (4) because the groupby does not seem correct. Can anyone suggest correction here?