Consider I have a dataset which has a date column that is generated everyday as given below.
DF_A ID name qty date1 abc 20 17/01/20221 abc 10 18/01/20222 def 10 24/01/20222 def 40 25/01/20222 def 67 26/01/2022DF_BID name price_dt price1 abc 18/01/2022 23.561 abc 17/01/2022 10.561 abc 16/01/2022 44.331 abc 15/01/2022 56.112 def 25/01/2022 2.982 def 26/01/2022 4.922 def 27/01/2022 4.882 def 24/01/2022 3.332 def 23/01/2022 8.472 def 22/01/2022 3.89
I'm joining the DF_A with DF_B and I only need the recent price_dt record that is less than the date column. This can be done my joining the 2 DF's and dropping the duplicates by sorting the price_dt DESC but the challenge is the DF size is so huge and joining is not feasible. so im looking to reduce the rows in DF_B before joining.
Code that I tried
DF_C = pd.merge(DF_A,DF_B,on='ID',how='left')# (This actually give 40 rows which is not optimum way of doing for larger dataset)Expected_DF = DF_C.sort_values(by=['price_dt'], ascending=False)Expected_DF = Expected_DF.drop_duplicates(subset=['ID','name','date'],keep='first')
Expected_DF:
ID name qty date price_dt price1 abc 20 17/01/2022 16/01/2022 44.331 abc 10 18/01/2022 17/01/2022 10.562 def 10 24/01/2022 23/01/2022 8.472 def 40 25/01/2022 24/01/2022 3.332 def 67 26/01/2022 25/01/2022 2.98
Im looking for feasible method when I can reduce the memory usage instead of fetching all the matching records from DF_B