Alright, in our journey through machine learning, we've just covered importing the necessary libraries and importing our dataset. Before diving deeper into the coding part, let's introduce some important machine learning terminology: "Matrix of Features" and "Dependent Variables Vector". ๐ ๏ธ
Here's a breakdown:
Matrix of Features: It's essentially all the columns in your dataset that you're going to feed into your machine learning model as input.
Dependent Variables Vector: This is the target your model is aiming for, like the bullseye in archery! ๐ฏ Typically, it's the last column in your dataset, representing what your machine learning model is going to predict.
So, if we inspect our dataset using these terms:
Now that we've got a grip on those definitions, let's teach our machine learning model how to segregate data based on the Matrix of Features and Dependent Variable Vector using Python! ๐๐ป Here's how:
x_features = df.iloc[:, :-1].values
y_dependent = df.iloc[:, -1].values
Here's a more simplified and engaging explanation:
Starting with Data (
df
):- In last chapter if you remember we loaded our data from a CSV file named 'SampleData.csv' using
pd.read
_csv('SampleData.csv')
, which creates a DataFrame calleddf
. This DataFrame likely holds all the information we need for our machine learning task.
- In last chapter if you remember we loaded our data from a CSV file named 'SampleData.csv' using
Selecting Features (
x_features
):To prepare our data for machine learning, we need to separate our features (the characteristics used for prediction) from our target variable.
df.iloc[:, :-1]
selects all rows and all columns except the last one from our DataFrame. These are our features.By calling
.values
, we convert this selection into a simple NumPy array, which we store inx_features
.
Identifying the Target (
y_dependent
):Now, we isolate our target variable, which is what we're trying to predict.
df.iloc[:, -1]
selects all rows but only the last column from our DataFrame. This is our target variable.We convert this selection into a NumPy array using
.values
and store it iny_dependent
.
In summary, these lines of code extract the features into x_features
, and isolate the target variable into y_dependent
. This setup prepares our data for training a machine learning model in a clear and organized manner.
Next let's print this variables which we created above to see if we are getting desired result what we wanted to have :
Now that we've successfully separated our data into two distinct variables โ x_features (our Matrix of Features) and y_dependent (our Dependent Variables Vector) โ it's time to tackle the issue of missing data. ๐ ๏ธโจ
If you take a peek at our dataset, you might notice some cells standing out like little empty spots in a colorful painting, highlighted in a special color. ๐จ๐ These are our missing values that need a bit of attention!
But fear not! With the right techniques, we'll fill in these gaps and ensure our dataset is squeaky clean and ready for analysis. Let's roll up our sleeves and dive into handling missing data like pros! ๐ช๐ป
Having missing data in your dataset can cause problems because it means you don't have complete information for some of your samples. This can lead to inaccurate or biased results when you try to analyze or make predictions with your data.
Now just for a while Imagine you have a dataset like above containing information about individuals, including their country, age, and salary, and you want to predict whether they will make a purchase or not. Here's how missing data could affect your analysis:
Incomplete Information: If some individuals have missing data for their age (Spain) or salary (USA), for example, you won't have a complete picture of their characteristics. This can make it difficult to accurately predict whether they will make a purchase because you're missing important information that could influence their decision.
Bias in Analysis: Let's say that individuals with missing salary data (USA) tend to be younger. If you only analyze the data for those with complete salary information, you might end up with a biased view of the relationship between salary and purchase behavior. This could lead to misleading conclusions and inaccurate predictions.
Reduced Predictive Power: Missing data can also reduce the effectiveness of your predictive model. If a significant portion of your dataset has missing values, your model may not have enough information to learn meaningful patterns and make accurate predictions. As a result, the predictive power of your model could be compromised.
In summary, having missing data in your dataset can lead to incomplete information, biased analysis, and reduced predictive power. It's important to handle missing data appropriately, either by imputing missing values or using techniques that can accommodate missingness, to ensure the reliability and accuracy of your analysis and predictions. So let's handle them.
Alright, when it comes to handling those pesky missing values in our dataset, we've got options! ๐ค๐ก
If our dataset is large and a few missing records won't make a dent, we could just wave goodbye to them and delete them altogether. ๐โโ๏ธ๐ But what if our dataset is precious and we can't afford to lose any data? That's where our trusty technique comes in: replacing missing values with the average of all the other values in that column! ๐ฏ๐
To work this magic, we're going to enlist the help of a special tool from Python called SimpleImputer, nestled in the sklearn.impute library. ๐ ๏ธ๐ Here's how we do it for the 'Age' and 'Salary' columns in our dataset:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x_features[:,1:3])
x_features[:,1:3] = imputer.transform(x_features[:,1:3])
Now let's break down the code :
Importing Libraries: We start by importing a specific function
SimpleImputer
from thesklearn.impute
module. This function helps us handle missing data.Creating an Imputer Object: We create an instance of the
SimpleImputer
class and store it in a variable namedimputer
. This object will help us fill in missing values in our data.Configuring the Imputer: We specify the parameters for our imputer:
missing_values=np.nan
: This tells the imputer what value to consider as missing. In this case, it'snp.nan
, which represents missing numerical data.strategy='mean'
: This parameter tells the imputer how to fill in the missing values. Here, we're using the strategy of replacing missing values with the mean (average) of the non-missing values in the same column.
Fitting the Imputer: We use the
fit
method of theimputer
object to compute the mean of the non-missing values for each column in the specified subset of features (x_features[:,1:3]
).Transforming the Data: We then use the
transform
method of theimputer
object to replace the missing values in the specified subset of features (x_features[:,1:3]
) with the mean values computed during the fitting step.
All set! When we peek at our original variable x_features now, it's like looking into a crystal-clear pond โ no more missing values, thanks to our cool code! ๐โจ
We've nailed it! ๐ Our Matrix of Features data is now spotless โ no more missing values! It's like tidying up a cluttered room and making it squeaky clean!
๐ Here's the complete code we've been working on, all wrapped up and ready to go! ๐ The .ipynb file is just a quick download away on my GitHub page. ๐ฅ Simply swing by here: GitHub - PythonicCloudAI/ML_W_PYTHON/Data_Preprocessing.ipynb.
# **Step 1: Import the necessary libraries**
import pandas as pd
import scipy
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
# **Step 2: Data Importing**
df = pd.read_csv('SampleData.csv')
# **Step 3 : Seggregation of dataset into Matrix of Features & Dependent Variable Vector**
x_features = df.iloc[:, :-1].values
y_dependent = df.iloc[:, -1].values
print(x_features)
print(y_dependent)
# **Step 4 : Handling missing Data**
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x_features[:,1:3])
x_features[:,1:3] = imputer.transform(x_features[:,1:3])
print (x_features)
That's a wrap for this chapter, folks! ๐ Hope you've enjoyed it and are eager for the next one, where we'll explore encoding. Stay tuned for more fun and learning! ๐๐