7. Mastering Machine Learning: Converting Categories to Binary Brilliance with Python --Part 1
Hey there! Today, let's dive into a cool concept in machine learning - one that helps our machines understand the world a little better! π€π‘
So, imagine you've got this awesome dataset filled with different categories like colors (red, blue, green), or types of animals (cat, dog, bird) or like in our dataset countries (USA,Spain,Germany)! ππΆπ
But hold up, our machines prefer speaking in numbers, not categories. Strange, right? π
That's where the magic of "Encoding categorical data" comes in! It's like teaching our machines a new language, translating categories into numbers so they can understand and work their magic. π©β¨
We do this because it helps the machine learning model make sense of the different categories in the data. When we convert categories into numbers, it's like giving the model a language it understands. This way, it can make better predictions or decisions based on the data. Without doing this, the model might get confused by the categories and not give accurate results. So, encoding categorical data is basically like translating categories into a language that the machine learning model can easily use and understand. ππ€
Let's break it down with a simple example:
Imagine you're teaching a robot about countries: USA, Spain, and Germany. πΊπΈπͺπΈπ©πͺ
You decide to give each country a number:
USA: 1
Spain: 2
Germany: 3
So, when you want to tell the robot about a country, you just tell it the number. For example, if you say "2," the robot knows it's Spain. π
But, the robot might think the numbers mean something about the countries, like 2 being bigger than 1. We don't want that confusion right? That's why it's our responsibility to build our Machine Learning model without such biases. So, it's always good to take care of categorical data using the concept called "Encoding categorical data". π οΈ
There are different ways to turn categories into numbers, and today we're focusing on one called "One-Hot Encoding". π―
Now, what's One-Hot Encoding?
One-hot encoding is a method used in machine learning to convert categorical data into a numerical format. It involves creating binary (0 or 1) dummy variables for each category in the dataset. π€π‘
Sure! Let's explain one-hot encoding using the example of countries. π
Let's put this understanding in our dataset where we have a list of countries: USA, Spain, and Germany. πΊπΈπͺπΈπ©πͺ
With one-hot encoding, you create a separate column for each country. Then, you put a 1 in the column corresponding to the country and 0s in all the other columns. ππ
Here's how it works:
USA: In the USA column, you put a 1 because it's the USA. In the Spain and Germany columns, you put 0s.
Spain: In the Spain column, you put a 1 because it's Spain. In the USA and Germany columns, you put 0s.
Germany: In the Germany column, you put a 1 because it's Germany. In the USA and Spain columns, you put 0s.
This way, each country gets its own column, and the presence of a 1 indicates which country it is, while the 0s show it's not that country.
So, after one-hot encoding, your data might look like this:
Country | Germany | Spain | USA |
USA | 0 | 0 | 1 |
Spain | 0 | 1 | 0 |
Germany | 1 | 0 | 0 |
In simple terms, one-hot encoding creates a special code for each country, making it easy for computers to understand and work with categorical data, like countries, in machine learning.
Before we move forward, let's revisit the code where we dealt with missing data using the SimpleImputer library and post its execution our x_features given output as below :
Okay, now, remember how we talked about turning country names into special codes or binary format? ππ»
Our goal is to give each country a unique code, like in this table below. But don't worry if the actual codes you get are different β the key is making sure each country has its own special code! ππ’
USA | 0.0 | 0.0 | 1.0 |
Spain | 0.0 | 1.0 | 0.0 |
Germany | 1.0 | 0.0 | 0.0 |
Let's tackle our favorite part of coding β the fun stuff we love to do! ππ»
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
column_transformer = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])], remainder='passthrough')
x_features = np.array(column_transformer.fit_transform(x_features))
Now let's break down each line of the code:
from sklearn.compose import ColumnTransformer
- This line imports the
ColumnTransformer
class from thesklearn.compose
module. TheColumnTransformer
class is used for applying different transformations to different columns in a dataset.
from sklearn.preprocessing import OneHotEncoder
- This line imports the
OneHotEncoder
class from thesklearn.preprocessing
module. TheOneHotEncoder
class is used to encode categorical integer features as one-hot numeric arrays.
column_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
This line creates a
ColumnTransformer
object namedcolumn_transformer
.The
transformers
parameter specifies a list of tuples, where each tuple contains:A name for the transformer (in this case,
'encoder'
).The transformer itself (in this case, an instance of
OneHotEncoder()
).A list of column indices of country column in our dataset to apply the transformer to (in this case,
[0]
).
Tuple : A tuple in Python is a collection of elements enclosed within parentheses () and separated by commas. Unlike lists, tuples are immutable, meaning their elements cannot be modified after creation. Tuples are often used to group related data together. If you want defination for Tuple then here it is "A tuple in Python is a fixed-size, immutable collection of elements enclosed within parentheses and separated by commas."
Example: (1, 2, 'hello', True)
And the last parameter passed is the
remainder
parameter specifies what to do with the columns not specified intransformers
. Here,'passthrough'
means that columns not specified will be passed through without any transformation.
x_features = np.array(column_transformer.fit_transform(x_features))
This line applies the
ColumnTransformer
to the datax_features
.The
fit_transform()
method fits the transformer to the data and then transforms it.The transformed data is converted to a NumPy array and stored in the variable
x_features
.
Overall, this code imports necessary modules, creates a ColumnTransformer
object to apply one-hot encoding to the first column of the input data (x_features
), and then transforms the data accordingly.
Now, let's run this block of code in our current session. Let's give it a go! ππ»
Yay! Did you see that? The categorical values got swapped out with those cool binary numbers, just like we wanted! How awesome is that? ππ»
π Here's the complete code we've been working on, all wrapped up and ready to go! π The .ipynb file is just a quick download away on my GitHub page. π₯ Simply swing by here: GitHub - PythonicCloudAI/ML_W_PYTHON/Data_Preprocessing.ipynb.
# **Step 1: Import the necessary libraries**
import pandas as pd
import scipy
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
# **Step 2: Data Importing**
df = pd.read_csv('SampleData.csv')
# **Step 3 : Seggregation of dataset into Matrix of Features & Dependent Variable Vector**
x_features = df.iloc[:, :-1].values
y_dependent = df.iloc[:, -1].values
print(x_features)
print(y_dependent)
# **Step 4 : Handling missing Data**
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x_features[:,1:3])
x_features[:,1:3] = imputer.transform(x_features[:,1:3])
print (x_features)
# **Step 5 : Encoding Categorical Data**
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
column_transformer = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])], remainder='passthrough')
x_features = np.array(column_transformer.fit_transform(x_features))
print(x_features)
That's a wrap for this chapter, folks! π Hope you've enjoyed it and are eager for the next one, where we'll continue more on encoding. Stay tuned for more fun and learning! ππ