I have a sample pyspark dataframe that can be created like this
sample_df = spark.createDataFrame([ ('2020-01-01', '2021-01-01', 1), ('2020-02-01', '2021-02-01', 1), ('2021-01-15', '2022-01-15', 2), ('2022-01-15', '2023-01-15', 2), ('2022-02-01', '2023-02-01', 3), ('2022-03-01', '2023-03-01', 3), ('2023-03-01', '2024-03-01', 4), ], ['item_date', 'max_window', 'expected_grouping_index'])After sorting by item_date, I want to assume the first item starts a grouping. Any following item that is less than or equal to the first items max_window (which will always be the same number of days added to the item_date for the entire df, about 365 days in this example) will be given the same grouping_index.
If an item does not fall inside the grouping, it will start a new grouping and be given another arbitrary grouping_index. Then all following items will be assessed based on that items max_window. And so on.
The grouping_index is just a means goal, I eventually want to only keep the first row in each group.
How can I achieve this without a UDF or converting to a pandas df?