Machine learning algorithms work with data, ML algorithms without data are like a BMW car without gasoline . It is very important that you feed the algorithm the right data in a suitable format ,scale and features.
In this post I try to elaborate the data preprocessing steps for preparing raw data for machine learning algorithms in Python. This steps make data suitable for learning and predicting steps.
Before the First step, I review the packages that we need in our practice. A package is a collection of modules , and every module contains functions, classes and other staffs related to one subject. Here we need pandas, matplotlib, seaborn, sklearn packages. Before we dig into preprocessing steps, I will simply describe these packages here :
Pandas : Pandas is a library of functions to read, modify and rewrite the dataset and to get some useful information such as minimum , maximum, mean and median of dataset’s columns.
Matplotlib , Seaborn : matplotlib is a library that consists of classes and functions for graphical representation of data and of mathematical functions. Diagrams play a key role in data preprocessing, because we use diagrams to have better understanding of dataset, and we make decision (to add/remove a feature) based on them. Seaborn is a visualization python package based on matplotlib. It provides a easy-to-use functions for drawing informative and useful statistical diagrams.
Sklearn: Is a fullstack machine learning python package that not only includes classification, regression and clustering algorithms, but also it has very helpful modules for preprocessing and feature extraction. It’s based on and compatible with NumPy, SciPy and Matplotlib. After finishing dataset preprocessing, We will use one of ML algorithms from this package to see how much our preprocessing is effective on final result.
Step 1 : Reading the dataset and retrieve its information
Thanks to panadas library, it is very easy to read the whole dataset at a once with a single line of python code using read_cvs()
function by passing the CVS file local adress to it. However, There are many other parameters one can pass to function that changes the its behavior . For example, we can specify which columns we need to read so that we don’t have to edit the data frame later.
import os import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns import sklearn as sk df_train = pd.read_csv("task1/data.csv") // read the given dataset df_train.head() // shows the top 5 records of dataset
Pandas describe()
is useful method that shows statistical information of a data frame such as minimum, maximum, percentile, mean, std. So we call it on our data frame to have some overview of it.
print(df_train.describe()) //show statistical details of dataset
The post will be updated soon…