I have the following setup:
I have sparse information about queries hitting my endpoint at certain timepoints in a csv file. I parse this csv file with dates according to date_format='ISO8601'
in the index column. Now what I want to do is this: I want to count the queries in certain intervals and put them into a dataframe that represents from start to enddate how many queries in said distinct intervals have hit the endpoint.
The problem is this: Using resample() I can aggregate and count the queries in the time intervals that contain information. But I can't find a way to extend this interval to always stretch from start to end date (with intervals filled with '0' by default).
I tried a combination of reindexing and resampling:
csv:
datetime,user,query2024-03-02T00:00:00Z,user1,query12024-03-18T03:45:00Z,user1,query22024-03-31T12:01:00Z,user1,query3
myscript.py:
df = pd.read_csv(infile, sep=',', index_col='datetime', date_format='ISO8601', parse_dates=True)df_timerange = df[start_date:end_date]df_period = pd.date_range(start=start_date, end=end_date, freq='1M')df_sampled = df_timerange['query'].resample('1M').count().fillna(0)df_sampled = df_timerange.reindex(df_period)
However this will just produce a dataframe where index dates range from 2023-04-30T07:37:39.750Z
to 2024-03-31T07:37:39.750Z
in frequencies of 1 month, but the original data from the csv (df_timerange
) is somehow not represented (all values are NaN)... Also I wonder why the dates start at this weird time: 07:37:39.750
. My guess is that the reindexing didn't hit the timepoints where df_timerange
contains values so they are just skipped? Or the timezone generated by pd.date_range() is not ISO8601 and this causes a mismatch.. Again, I'm not too experienced with panda dataframes to make sense of it.
Minimal reproducible example:
Run this with python 3.11:
from datetime import datetime, timezoneimport pandas as pdstart_date = datetime(2023, 4, 15, 4, 1, 40, tzinfo=timezone.utc)end_date = datetime(2024, 4, 15, 0, 0, 0, tzinfo=timezone.utc)df = pd.read_csv('test.csv', sep=',', index_col='datetime', date_format='ISO8601', parse_dates=True)df_timerange = df[start_date:end_date]df_period = pd.date_range(start=start_date, end=end_date, freq='1M')df_sampled = df_timerange['query'].resample('1M').count().fillna(0)df_sampled = df_timerange.reindex(df_period)print(df_sampled)