{"id":165,"date":"2020-06-28T12:11:15","date_gmt":"2020-06-28T07:41:15","guid":{"rendered":"http:\/\/m-shaeri.ir\/blog\/?p=165"},"modified":"2023-02-22T17:02:51","modified_gmt":"2023-02-22T13:32:51","slug":"machine-learning-data-preprocessing","status":"publish","type":"post","link":"https:\/\/mshaeri.com\/blog\/machine-learning-data-preprocessing\/","title":{"rendered":"Machine Learning Data Preprocessing"},"content":{"rendered":"\n<p>Machine learning algorithms work with data, ML algorithms without data are like a BMW car without gasoline . It is very important that you feed the algorithm the right data in a suitable  format ,scale and features.<\/p>\n\n\n\n<p>In this post I try to elaborate the data preprocessing steps for preparing raw data for machine learning algorithms in Python.  This steps make data suitable for learning and predicting steps.<\/p>\n\n\n\n<p>Before the First step, I review the packages that we need in our practice. A  package is a collection of modules , and every module contains functions, classes and other staffs related to one subject. Here we need pandas,  matplotlib, seaborn, sklearn packages. Before we dig into preprocessing steps, I will simply describe these packages here :<\/p>\n\n\n\n<p><strong>Pandas :<\/strong> Pandas is a library of functions to read, modify and rewrite the dataset and to get some useful information such as minimum , maximum, mean and median of dataset&#8217;s columns.<\/p>\n\n\n\n<p style=\"text-align:left\"><strong>Matplotlib , Seaborn : <\/strong> matplotlib is a library that consists of classes and functions for graphical representation of data and of mathematical functions. Diagrams play a key role in data preprocessing, because we use diagrams to have better understanding of dataset, and we make decision (to add\/remove a feature) based on them. Seaborn is a visualization python package based on matplotlib. It provides a easy-to-use functions for drawing informative and useful statistical diagrams.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"840\" height=\"280\" src=\"http:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2020\/02\/Seaborn.jpg\" alt=\"\" class=\"wp-image-173\" srcset=\"https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2020\/02\/Seaborn.jpg 840w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2020\/02\/Seaborn-300x100.jpg 300w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2020\/02\/Seaborn-768x256.jpg 768w\" sizes=\"(max-width: 840px) 100vw, 840px\" \/><figcaption>statistical plots drawn by Seaborn library<\/figcaption><\/figure>\n\n\n\n<p><strong>Sklearn:<\/strong> Is a fullstack machine learning python package that not only includes classification, regression and clustering algorithms, but also it has very helpful modules for preprocessing and feature extraction. It&#8217;s based on and compatible with NumPy, SciPy and Matplotlib.  After finishing dataset preprocessing, We will use one of ML algorithms from this package to see how much our preprocessing is effective on final result. <\/p>\n\n\n\n<p class=\"has-large-font-size\"><strong>Step 1 <\/strong>: Reading the dataset and retrieve its information<\/p>\n\n\n\n<p>Thanks to panadas library, it is very easy to read the whole dataset at a once with a single line of python code using <strong><code>read_cvs()<\/code><\/strong> function by passing the CVS file local adress to it. However, There are many other parameters one can pass to function that changes the its behavior . For example, we can specify which columns we need to read so that we don\u2019t have to edit the data frame later. <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\" style=\"background-color:#eeeeee;border:1px #dddddd solid;\"><font color=\"green\">import<\/font> os\n<font color=\"green\">import<\/font> matplotlib.pyplot <font color=\"green\">as<\/font> plt\n<font color=\"green\">import<\/font> pandas <font color=\"green\">as<\/font> pd\n<font color=\"green\">import<\/font> numpy <font color=\"green\">as<\/font> np\n<font color=\"green\">import<\/font> seaborn <font color=\"green\">as<\/font> sns\n<font color=\"green\">import<\/font> sklearn <font color=\"green\">as<\/font> sk\n\n\ndf_train <font color=\"blue\">=<\/font> pd.read_csv(<font color=\"brown\">\"task1\/data.csv\"<\/font>)  <font color=\"gray\">\/\/ read the given dataset<\/font>\ndf_train.head() <font color=\"gray\">\/\/ shows the top 5 records of dataset<\/font>\n<\/pre>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2020\/02\/output1.jpg\" alt=\"\" class=\"wp-image-193\" width=\"671\" height=\"149\" srcset=\"https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2020\/02\/output1.jpg 621w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2020\/02\/output1-300x67.jpg 300w\" sizes=\"(max-width: 671px) 100vw, 671px\" \/><figcaption>head() function shows the top 5 records of dataset<\/figcaption><\/figure>\n\n\n\n<p>Pandas&nbsp;<strong><code>describe()<\/code><\/strong>&nbsp;is useful method that shows statistical information  of a data frame such as minimum, maximum, percentile, mean, std. So we call it on our data frame to have some overview of it.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\" style=\"background-color:#eeeeee;border:1px #dddddd solid;\"><font color=\"green\">print<\/font>(df_train.describe())<font color=\"gray\"> \/\/show statistical details of dataset<\/font>\n<\/pre>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2020\/02\/output2.jpg\" alt=\"\" class=\"wp-image-199\" width=\"663\" height=\"162\" srcset=\"https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2020\/02\/output2.jpg 609w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2020\/02\/output2-300x73.jpg 300w\" sizes=\"(max-width: 663px) 100vw, 663px\" \/><figcaption>Statistical information of dataset by describe() function<\/figcaption><\/figure>\n\n\n\n<p>The post will be updated soon&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Machine learning algorithms work with data, ML algorithms without data are like a BMW car without gasoline . It is very important that you feed &hellip; <\/p>\n","protected":false},"author":1,"featured_media":211,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[40,29,37,36,39,38,31],"_links":{"self":[{"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/posts\/165"}],"collection":[{"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/comments?post=165"}],"version-history":[{"count":41,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/posts\/165\/revisions"}],"predecessor-version":[{"id":217,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/posts\/165\/revisions\/217"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/media\/211"}],"wp:attachment":[{"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/media?parent=165"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/categories?post=165"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/tags?post=165"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}