EDIT: hopefully clarified the problem and corrected the first dataframe to match the result dataframe
example dataframe.
df = pd.DataFrame({'recipe':['meal 1','meal 2', 'meal 3', 'meal 4','meal 5'],'vegetable':['carrot','carrot','beets','carrot','artichoke'],'fruit':['banana','apple','banana','banana','banana'],'protein':['beef','chicken','beef','fish','fish'],'calories':[10, 50, 100, 150, 200]})
Assuming it's ordered (here by calories ASC) I'm trying to add a new column named 'master meal' to the DataFrame.This column will contain the name of the first recipe that shares a significant overlap in ingredients with the current recipe. A significant overlap is defined as sharing at least two ingredients.
If a recipe has already been used as a 'master meal' or has a 'master meal' assigned to it, it should not be considered for subsequent rows.
in this example, the result would be:
df = pd.DataFrame({'recipe':['meal 1','meal 2', 'meal 3', 'meal 4','meal 5'],'vegetable':['carrot','carrot','beets','carrot','artichoke'],'fruit':['banana','apple','banana','banana', 'banana'],'protein':['beef','chicken','beef','fish', 'fish'],'calories':[10, 50, 100, 150, 200],'master meal': ['meal 1',None,'meal 1','meal 1', None]})
(ie. 'meal 5' won't get the master meal value set to 'meal 4' because 'meal 4' has been tagged already)
I was able to build something with apply() where I compared each row to the rest of the data frame, but as you can imagine, it didn't work too well when applied to a bigger dataset.I scratched my head all day to find a vectorized approach without success.
Maybe you have a better idea? I don't know how I can avoid looping through the dataframe or if so, doing it efficiently.