How to make the script below more efficient? This is a follow-up to my previous post Python nested loop issue
It currently takes the best part of two hours to process input tables consisting in about 15000 and 1500 rows. Manually processing my data in Excel takes me an order of magnitude less time - not ideal!
I understand that iterrows
is a bad approach to the problem, and that vectorisation is the way forward, but I am a bit dumbfounded at how it would work in regards to the second for loop.
The following script extract takes two dataframes,
qinsy_file_2
segy_vlookup
(ignore the naming on that one).
For every row in the qinsy_file_2
, it iterates through segy_vlookup
to calculate distances between coordinates in each file. If this distance is less than a pre-given value (here named buffer
), it will get transcribed to a new dataframe out_df
(otherwise it will pass over the row).
# Loop through Qinsy filefor index_qinsy,row_qinsy in qinsy_file_2.iterrows(): # Loop through SEGY navigation for index_segy,row_segy in segy_vlookup.iterrows(): # Calculate distance between points if ((((segy_vlookup["CDP_X"][index_segy] - qinsy_file_2["CMP Easting"][index_qinsy])**2) + ((segy_vlookup["CDP_Y"][index_segy] - qinsy_file_2["CMP Northing"][index_qinsy])**2))**0.5)<= buffer: # Append rows less than the distance modifier to the new dataframe out_df=pd.concat([out_df,row_qinsy]) break else: pass
So far I have read through the following:
- How to iterate over rows in a Pandas DataFrame? (and others of a similar name)
- Looking for faster way to iterate over pandas dataframe
- What is the most efficient way to loop through dataframes with pandas?
- https://www.learndatasci.com/solutions/how-iterate-over-rows-pandas/
- https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac