I am now using Pandas to handle some data. After I used group by
in pandas, the simplified DataFrame's format is [MMSI(Vessel_ID), BaseTime, Location, Speed, Course,...].
I use
for MMSI, group in grouped_df: print(MMSI) print(group)
to print the data.
For example, one group of data is:
MMSI BaseDateTime LAT LON SOG COG1507 538007509.0 2022-12-08T00:02:25 49.29104 -123.19135 0.0 9.6 1508 538007509.0 2022-12-08T00:05:25 49.29102 -123.19138 0.0 9.6
I want to add a column which is the time difference of two points.
Below is the Output I want
MMSI BaseDateTime LAT LON SOG COG Time-diff1507 538007509.0 2022-12-08T00:02:25 49.29104 -123.19135 0.0 9.6 3.0(hours)1508 538007509.0 2022-12-08T00:05:25 49.29102 -123.19138 0.0 9.6 Na
So I use the code below to try to get the result:
for MMSI, group in grouped_df: group = group.sort_values(by='BaseDateTime') group['new-time'] = group.shift(-1)['BaseDateTime'] group.dropna() for x in group.index: group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum() # When Time-diff >= 2hours split them into different group
I can use print to show group['GROUP'] and group['time-diff']. The result is not shown after I tried to visit grouped_df again. There's a warning showing that my group
in grouped_df
is just a copy of a slice from a DataFrame and it recommend me using .loc[row_indexer,col_indexer] = value
instead. But in this case I don't know how to use .loc
to visit the specific [row,col].
At the very beginning, I tried to use
grouped_df['new-time'] = grouped_df.shift(-1)['BaseDateTime'] grouped_df.dropna()
But it shows
'DataFrameGroupBy' object does not support item assignment
Now my solution is create an empty_df and then concatenate the groups in grouped
_df
step by step like this:
df['time-diff'] = pd.Series(dtype='float64')df['GROUP'] = pd.Series(dtype='int')grouped_df = df.groupby('MMSI')for MMSI, group in grouped_df: # ... as the same as the code above group = group.sort_values(by='BaseDateTime') group['new-time'] = group.shift(-1)['BaseDateTime'] group.dropna() for x in group.index: group.loc[x,'time-diff'] = get_timediff(group.loc[x,'new-time'],group.loc[x,'BaseDateTime']) # A function to calculate the time difference group['GROUP'] = group['time-diff'].fillna(np.inf).ge(2).cumsum() # ... as the same as the code above frame = [empty_df, group] empty_df = pd.concat(frames)
I am not satisfied with this solution but I didn't find a proper way to change the value in grouped_df
.
I'm now trying to use the solution from this question to handle the DataFrame before group by.
Can someone help me?