I have several 'condition' columns in a dataset. These columns are all eligible to receive the same coded input. This is only to allow multiple conditions to be associated with a single record - which column the code winds up in carries no meaning.
In the sample below there are only 5 unique values across the 3 condition columns, although if you consider each column separately, there are 3 unique values in each. So when I apply one-hot encoding to these variables together I get 9 new columns, but I only want 5 (one for each unique value in the collective set of columns).
Here is a sample of the original data:
| cond1 | cond2 | cond3 | target ||-------|-------|-------|--------|| I219 | E119 | I48 | 1 || I500 | | | 0 || I48 | I500 | F171 | 1 || I219 | E119 | I500 | 0 || I219 | I48 | | 0 |
Here's what I tried:
import pandas as pddf = pd.read_csv('micro.csv', dtype='object')df['cond1'] = pd.Categorical(df['cond1'])df['cond2'] = pd.Categorical(df['cond2'])df['cond3'] = pd.Categorical(df['cond3'])dummies = pd.get_dummies(df[['cond1', 'cond2', 'cond3']], prefix = 'cond')dummies
Which gives me:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_I48 | cond_I500 | cond_F171 | cond_I48 | cond_I500 ||-----------|----------|-----------|-----------|----------|-----------|-----------|----------|-----------|| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 || 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 || 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 || 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 || 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
So I have multiple coded columns for any code that appears in more than one column (I48 and I500).. I would like only a single column for each so I can check for correlations between individual codes and my target variable.
Is there a way to do this? This is the result I'm after:
| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_F171 ||-----------|----------|-----------|-----------|-----------|| 1 | 1 | 0 | 1 | 0 || 0 | 0 | 1 | 0 | 0 || 0 | 1 | 1 | 0 | 1 || 1 | 0 | 1 | 1 | 0 || 1 | 1 | 0 | 0 | 0 |