I'd like to ask around if anyone knows how to do rolling indexing in polars?I have personally tried a few solutions which did not work for me (I'll show them below):
What I'd like to do: Indexing the number of occurrences within the past X days by NameExample: Let's say I'd like to index occurrences within the past 2 days:
| Name | Date | Counter |
|---|---|---|
| John | 1 Jan 23 | 1 |
| John | 1 Jan 23 | 2 |
| John | 1 Jan 23 | 3 |
| John | 1 Jan 23 | 4 |
| John | 2 Jan 23 | 5 |
| John | 2 Jan 23 | 6 |
| John | 2 Jan 23 | 7 |
| John | 2 Jan 23 | 8 |
| John | 3 Jan 23 | 5 |
| John | 3 Jan 23 | 6 |
| New Guy | 1 Jan 23 | 1 |
In this case, the counter resets to "1" starting from the past X days (e.g. for 3 Jan 23, it starts "1" from 2 Jan 23), or if a new name is detected
What I've tried:
df.groupby_rolling(index_column='Date', period='2d', by='Name', check_sorted=False).agg((pl.col("Date").rank(method='ordinal')).alias("Counter"))The above does not work because it outputs:
| Name | Date | Counter |
|---|---|---|
| John | 1 Jan 23 | 1,2,3,4 |
| John | 1 Jan 23 | 1,2,3,4 |
| John | 1 Jan 23 | 1,2,3,4 |
| John | 1 Jan 23 | 1,2,3,4 |
| John | 2 Jan 23 | 1...8 |
| John | 2 Jan 23 | 1...8 |
| John | 2 Jan 23 | 1...8 |
| John | 2 Jan 23 | 1...8 |
| John | 3 Jan 23 | 1...6 |
| John | 3 Jan 23 | 1...6 |
| New Guy | 1 Jan 23 | 1 |
df.with_columns( Counter=pl.col("mask").rolling_sum(window_size='2d', by="Date") )Where I made a column "Mask" which is just a column of "1"s, and tried to sum them, but it outputs:
| Name | Date | Mask | Counter |
|---|---|---|---|
| John | 1 Jan 23 | 1 | 4 |
| John | 1 Jan 23 | 1 | 4 |
| John | 1 Jan 23 | 1 | 4 |
| John | 1 Jan 23 | 1 | 4 |
| John | 2 Jan 23 | 1 | 8 |
| John | 2 Jan 23 | 1 | 8 |
| John | 2 Jan 23 | 1 | 8 |
| John | 2 Jan 23 | 1 | 8 |
| John | 3 Jan 23 | 1 | 6 |
| John | 3 Jan 23 | 1 | 6 |
And it also cannot handle "New Guy" correctly because rolling_sum cannot do by="Name", "Date"
df.with_columns(Counter = pl.col("Date").rank(method='ordinal').over(["Name", "Date"]) )The above code works correctly, but can only be used for indexing within the same day (i.e. period="1d")
Additional Notes: I also did this in Excel, and also using a brute/raw method of using a "for"-loop. Both worked perfectly, however they struggled with huge amounts of data.
What I read:Some references to help in answers: (Most didn't work because they have fixed rolling window instead of a dynamic window by "Date")
How to implement rolling rank in Polars version 0.19
https://github.com/pola-rs/polars/issues/4808
How to do group_by_rolling grouped by day by hour in polars in Python?
How to groupby and rolling in polars?
https://docs.pola.rs/py-polars/html/reference/series/api/polars.Series.rank.html
https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_rolling.html