I want to optimize code which regroup my pandas dataframe (dk) by joins:
dk = pd.DataFrame({'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4}})If there two groups with difference joins have one same point, set to both groups one join. And so for all dataframe. I did it with simple code:
dk['new'] = dk['join']for i in dk.index: for j in range(i+1, dk.shape[0]): if dk['Point'][i] == dk['Point'][j]: dk['new'][j] = dk['join'][i] dk.loc[(dk['join'] == dk['join'][j]), 'new'] = dk['new'][i] Result that I want:
df = {'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4},'new': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 4}}But I need to release it for big data which has more than 450k rows. Do you have any idea how to optimize it or other modules for this problem? Thanks in advance