Day 3: Basics of Data Preprocessing & Introduction to Pandas Library

Ever heard of correcting someone’s weakness before they create a huge blunder? Helps a lot in real life, is it? Well it is an important aspect that we should go well prepared for any circumstance, ensuring our weakness are taken care of in advance.

Today, we’ll explore the critical process of data preprocessing and cleaning, which ensures that data is ready for analysis or modeling. This step is vital because even the best algorithms can’t fix bad data. Data Pre-Processing is a similar anology as we discussed one paragraph above this one. The importance of the following concept lies much bigger as it helps the model to identify potential outliers, thus ensuring more accuracy while moving forward.

Core-Definition

Data preprocessing involves transforming raw data into a format suitable for analysis. Now typically a raw data might have some missing values, some outliers and even inconsistent scales (such as human_age = 1000 years). Ouliers are basically the data that seems to be abnormaland lies well beyond scope of current data. It might be the case of experimental errors that needs to be eradicated before moving forward.

Ensuring removal of all inconsistencies mentioned above ensures that the dataset is clean and reliable. This further helps various algorithms to converge together much efficiently and thus ensuring highest possible accuracy.

The following are the steps which are performed during data pre-processing:

Data Preprocessing: What it is, Steps, & Methods Involved | Airbyte

Data Cleaning, which handles missing values and outliers by “taking care“ of them.
Data Transformation, where we normalize or scale the data to match certain consistency.
Data Reduction, which means that keep only the data which stands “relevant“.
Data Integration, where we create new features from the existing data that we have.

We will have a look at how each of these things are done in a practical way as this lesson for the day progesses. For this purpose, we are going to use “Pandas“ Python library. Let me introduce Pandas (certainly not the bamboo eating ones) to you.

Gentle Introduction to Pandas:

For a detailed tutorial on Pandas, I recommend you the below content from w3Schools.

https://www.w3schools.com/python/pandas/default.asp

Pandas is a Python library used whenever we “work” with data. Pandas stands for (Python Data Analysis); created by Wes McKinney in 2008.

Whenever we talk of data analysis, the first thing that strikes the mind is Pandas library. As we already discussed, it is the task of utmost priority to clean up problematic data from the final dataset.

💡

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

Now, why do the missing values might occur? Ofcourse, no one is 100% Mr.Perfect while curating the data, not even big models. So it is the job of Python Library Pandas, which has some methods which we can use for our pre-processing operation.

Handling Missing Data

Potential causes for missing values may be:

Human error while entering data
IoT Devices sometimes malfunction while collecting data.
During Transfer of data from one system to another, there is a possibility of data getting corrupted.

Now as we have seen the problems, lets focus on solutions / key startegies. The best way is to replace the missing data with something relevant values which may stand in and around other values present in the column.

When we focus on problems, we get problems. When we focus on solutions, we get solutions.

First one is to remove rows or columns with too many missing entries.

import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace=True)  # Drops rows with any missing value

💡

If you want to change the original DataFrame, use the inplace = True argument. Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame. The inpace method modifies the input directly as specified, without allocating or requring any further additional input from the user.

Suppose you want to replace missing values in “Age“ column with “Mean“ (Avearge of all ages) of all values, then:

df['Age'].fillna(df['Age'].mean(), inplace=True)   # Replace with Mean

Just add following chunk of code. This will replace an empty cell with the mean of all “Age“ values. You may also replace the empty cells with “Median“ (Middle Value of the column) or “Mode“ (Most Frequently occuring value from column), depending on your choice of selection.

df['Income'].fillna(df['Income'].median(), inplace=True)   # Replace with Median

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)     # Replace with Mode

Using Mean, Median or Mode to replace missing values is one of the most ideal senario. Instead of putting in random numbers, these statistical measures of dispersion, are something that revolves around other items in the table.

Outlier Detection and Handling

Outliers are extreme values that deviate significantly from the rest of the data. They might cause big problems whenever they mark their presence in the dataset. They might skew our statistical calculations or results as well as tend to mislead the machine learning models.

We are going to use Interquartile Range (IQR) for outlier detection.

Well, if you understood IQR from above screenshots, then have a look at the practical example:

Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['Salary'] < lower_bound) | (df['Salary'] > upper_bound)]

Other than IVR, another approach that we can use is Z-Score.

A z-score of 0 indicates that the data point’s score is the same as the mean score. A positive z-score indicates that the data point is above average, while a negative z-score indicates that the data point is below average.

Here is the link to the full article explaining Z-Score in simple language.

https://www.geeksforgeeks.org/z-score-in-statistics/

Outliers typically have a Z-Score > 3 or < -3.

from scipy.stats import zscore
df['z_score'] = zscore(df['Salary'])
outliers = df[df['z_score'].abs() > 3]

The most recommended way to handle outliers is to replace them nearest boundary values. To do so, we take help of lower_bound and upper_bound.

The first step is to identify and drop the outlier values by using the following code snippet.

df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]

Next, Replace extreme values with the nearest boundary.

df['Salary'] = np.clip(df['Salary'], lower_bound, upper_bound)

Scaling and Normalization

https://www.geeksforgeeks.org/data-normalization-with-pandas/

Scaling is something we tend to care very less about. Features with different scales can distort model performance. For example: Age ranges from 0–100, but income ranges from 10,000–100,000.

So to overome this issue, we use a technique called “Normalization“ (Min-Max Scaling) and “Standardization“ (Z-Score).

Normalization scales values between 0 and 1.

Standardization scales values to have a mean of 0 and standard deviation of 1.

For the Normalization and Scaling, we normally use “sklearn” library, but for the time being, we are sticking up with Pandas.

Conclusive Points

Data preprocessing ensures data quality and consistency for analysis.
Handle missing values using techniques like mean, median, or mode imputation.
Outliers can distort results; handle them with capping or removal.
Scaling and normalization bring features to comparable ranges.

Well as far as today’s topic is concerned, things are getting up spicy. We have our first introduction to Pandas library and we have learnt the baiscs of Data Pre-Processing. Do practice above codes with random examples. Until then “Happy Learning“.