i want to derive derive_this_column
with help of duplicate_thype and lat_lng
column.
so what it does is it is finding the duplicate values of lat_lng
and it defines first_duplicate and last_duplicate
based on occurences.
however, if u see last line and first line both lat_lng pair is same so we have mark them as first_ and last_duplicates, however within that range we have another 2 pair of first_ and last_ duplicates which is causing the data discrepancy,so,whenever i find any pair of first and last_ duplicate inside a first_ and last_duplicats having same lat_lng values i wanna mark them as null
i tried this code however, it is giving everything as null.
import pandas as pdRK_df['first_duplicate'] = RK_df['duplicate_type'].eq('first_duplicate') & RK_df['duplicate_type'].notna()RK_df['last_duplicate'] = RK_df['duplicate_type'].eq('last_duplicate') & RK_df['duplicate_type'].notna()# Find pairs of first and last duplicates with the same lat_lngduplicates_within_range = RK_df.groupby('lat_lng')['first_duplicate', 'last_duplicate'].transform('sum')# Mark duplicates_within_range as null if both first_duplicate and last_duplicate are presentRK_df['derive_this_column'] = RK_df.apply( lambda row: 'null' if row['first_duplicate'] and row['last_duplicate'] and duplicates_within_range.loc[row.name, 'first_duplicate'] > 1 else row['duplicate_type'], axis=1)# Drop the intermediate columns used for calculationRK_df.drop(['first_duplicate', 'last_duplicate'], axis=1, inplace=True)