I have a dataframe with creation_timestamps and personal_id's. Each personal_id has one or more application_id's. An application_id can have a bunch of rows, but they all have the same creation_timestamp.
I want to create a column that indicates the days between two application_id's for a given personal_id. All the offer rows within the application_id should show the same days between the two application_id's.
Here's the code I've tried to modify in a bunch of ways:
# Calculate days between consecutive applications for each personal_iddf['days_between_applications'] = ( df.groupby(['personal_id'])['creation_timestamp'] .diff() .dt.days)
Here's a preview of the current result:
personal_id | application_id | creation_timestamp | days_between_applications | |
---|---|---|---|---|
1007 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
1008 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | 0.0 |
1010 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | 0.0 |
1006 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | 0.0 |
7094153 | c1de3552cfb0 | 1f64dd61aee2 | 2023-08-07 11:01:45.588173 | 1533.0 |
7094147 | c1de3552cfb0 | 1f64dd61aee2 | 2023-08-07 11:01:45.588173 | 0.0 |
This is what I'm trying to achieve:
personal_id | application_id | creation_timestamp | days_between_applications | |
---|---|---|---|---|
1007 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
1008 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
1010 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
1006 | c1de3552cfb0 | e18e5d77199f | 2019-05-16 06:53:57.817842 | NaN |
7094153 | c1de3552cfb0 | 1f64dd61aee2 | 2023-08-07 11:01:45.588173 | 1533.0 |
7094147 | c1de3552cfb0 | 1f64dd61aee2 | 2023-08-07 11:01:45.588173 | 1533.0 |
I've tried to forward fill, sort and then forward fill, groupby personal_id and application_id etc, but nothing has worked.