Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 13891

Pandas one-hot encoding with multiple like columns

$
0
0

I have several 'condition' columns in a dataset. These columns are all eligible to receive the same coded input. This is only to allow multiple conditions to be associated with a single record - which column the code winds up in carries no meaning.

In the sample below there are only 5 unique values across the 3 condition columns, although if you consider each column separately, there are 3 unique values in each. So when I apply one-hot encoding to these variables together I get 9 new columns, but I only want 5 (one for each unique value in the collective set of columns).

Here is a sample of the original data:

| cond1 | cond2 | cond3 | target ||-------|-------|-------|--------|| I219  | E119  | I48   | 1      || I500  |       |       | 0      || I48   | I500  | F171  | 1      || I219  | E119  | I500  | 0      || I219  | I48   |       | 0      |

Here's what I tried:

import pandas as pddf = pd.read_csv('micro.csv', dtype='object')df['cond1'] = pd.Categorical(df['cond1'])df['cond2'] = pd.Categorical(df['cond2'])df['cond3'] = pd.Categorical(df['cond3'])dummies = pd.get_dummies(df[['cond1', 'cond2', 'cond3']], prefix = 'cond')dummies

Which gives me:

| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_I48 | cond_I500 | cond_F171 | cond_I48 | cond_I500 ||-----------|----------|-----------|-----------|----------|-----------|-----------|----------|-----------|| 1         | 0        | 0         | 1         | 0        | 0         | 0         | 1        | 0         || 0         | 0        | 1         | 0         | 0        | 0         | 0         | 0        | 0         || 0         | 1        | 0         | 0         | 0        | 1         | 1         | 0        | 0         || 1         | 0        | 0         | 1         | 0        | 0         | 0         | 0        | 1         || 1         | 0        | 0         | 0         | 1        | 0         | 0         | 0        | 0         |

So I have multiple coded columns for any code that appears in more than one column (I48 and I500).. I would like only a single column for each so I can check for correlations between individual codes and my target variable.

Is there a way to do this? This is the result I'm after:

| cond_I219 | cond_I48 | cond_I500 | cond_E119 | cond_F171 ||-----------|----------|-----------|-----------|-----------|| 1         | 1        | 0         | 1         | 0         || 0         | 0        | 1         | 0         | 0         || 0         | 1        | 1         | 0         | 1         || 1         | 0        | 1         | 1         | 0         || 1         | 1        | 0         | 0         | 0         |

Viewing all articles
Browse latest Browse all 13891

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>