One-hot encoding labels

Another little snippet that I sometimes forget 🙂

Suppose you have a dataframe with a column that has data in a string format and you need to transform that into a way that a machine learning algorithm can use. One good way is with one-hot encoding which will take the values in the column and create new columns with 1’s & 0’s representing the original data.

Have a look at this dataframe snippet:

prodrev
0Solaris 11.32
1Solaris 11.31
2Solaris 11.44
3Solaris 11.45
4Solaris 11.31

One-hot encoding can be use to transform that:

OH = pd.DataFrame(OH_encoder.fit_transform(mydf[['prod']]))
display(OH.head())
pd.concat([mydf, OH], axis=1)
prodrev0123
0Solaris 11.320.01.00.00.0
1Solaris 11.310.01.00.00.0
2Solaris 11.440.00.01.00.0
3Solaris 11.450.00.01.00.0
4Solaris 11.310.01.00.00.0

And that’s fine for machine learning. Your models and pipelines will just handle it once it added to the original dataframe. But what if you actually want to poke around in the dataframe and use the data yourself? Then it would be really useful to have the label of the column reflect what the data in it actually is. No problem just get the feature names from the encoder and rename the columns on the one-hot dataframe before concating it to the original frame:

column_name = OH.get_feature_names(['prod'])
OH_cols_train.columns=column_name
df3 = pd.concat([df2, OH_cols_train], axis=1)
prodrevprod_Solaris 11.1prod_Solaris 11.3prod_Solaris 11.4
0Solaris 11.320.01.00.0
1Solaris 11.310.01.00.0
2Solaris 11.440.00.01.0
3Solaris 11.450.00.01.0

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top