{"id":76,"date":"2016-04-01T20:37:54","date_gmt":"2016-04-01T16:07:54","guid":{"rendered":"http:\/\/m-shaeri.ir\/blog\/?p=76"},"modified":"2022-01-08T13:17:22","modified_gmt":"2022-01-08T09:47:22","slug":"gradient-descent-algorithm-for-linear-regression","status":"publish","type":"post","link":"https:\/\/mshaeri.com\/blog\/gradient-descent-algorithm-for-linear-regression\/","title":{"rendered":"Gradient Descent Algorithm for Linear Regression"},"content":{"rendered":"<p style=\"text-align: justify;\">Last week I started to implement <strong>graphical Gradient Descent algorithm <\/strong>for linear regression (supervised machine learning).\u00a0Gradient Descent is a general algorithm and is used not only in linear regression, It&#8217;s actually used in non-linear regression and all over the place in machine learning.<\/p>\n<p style=\"text-align: center;\"><a title=\"Download Gradient Descent Algorithm for two variable linear regression\" href=\"http:\/\/m-shaeri.ir\/prjct\/GD4LR.zip\">Download The Program<\/a><\/p>\n<p style=\"text-align: justify;\">During implementation I learnt more details of this approach. I firmly believe that there is\u00a0an obvious\u00a0gap between theoretical insights and implementation in every algorithm regarding\u00a0the programming language that we\u00a0use in\u00a0implementing\u00a0process, the data structure that we store the algorithm\u2019s data and even the hardware that we are implementing the algorithm on.\u00a0 Sometimes an algorithm with worse performance in theory performs better in practice due to the situations in current problem.<\/p>\n<p style=\"text-align: justify;\">At first, I think this algorithm is highly vulnerable in environments with noisy data. For example, if we have training data with 9 records as below:<\/p>\n<div align=\"center\">\n<table border=\"1\" width=\"150\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\">\n<tbody>\n<tr bgcolor=\"gray\">\n<td style=\"text-align: center;\" valign=\"top\"><b>x<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>y<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>1<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>2<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>2<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>2<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>2.5<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>3<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>3<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>6<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>3.5<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>6.5<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>4<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>8<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>4.5<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>8.4<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>5<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>9<\/b><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" valign=\"top\"><b>5.5<\/b><\/td>\n<td style=\"text-align: center;\" valign=\"top\"><b>10<\/b><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p style=\"text-align: justify;\">In order to predict y for any given x, we should produce a Hypothesis function based on these records. Hypothesis\u00a0is a function that predicts y for a given value of x, based on given training set; It\u2019s a linear formula consists of two parameters <b>theta0<\/b> and <b>tetha1:<\/b><\/p>\n<p style=\"text-align: center;\" align=\"center\"><b>h(x)= theta0<\/b>\u00a0+\u00a0<b>tetha1 . x<\/b><\/p>\n<p style=\"text-align: justify;\">In other word, the Gradient Descent\u2019s task is to find best tetha0 and thetha1 that minimize the cost function and boost the prediction accuracy as much as possible. The cost function is :<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2016\/04\/cost.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"   aligncenter wp-image-132 size-full\" src=\"http:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2016\/04\/cost.jpg\" alt=\"\" width=\"340\" height=\"40\" srcset=\"https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2016\/04\/cost.jpg 340w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2016\/04\/cost-300x35.jpg 300w\" sizes=\"(max-width: 340px) 100vw, 340px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">For this data set, the <b>theta0<\/b> and <b>tetha1<\/b> are -0.89 and 2.02 respectively, so the hypothesis function is:<\/p>\n<p style=\"text-align: center;\" align=\"center\">h(x)=-0.89+2.02x<\/p>\n<p style=\"text-align: justify;\">The red line in the figure below shows the h(x) function produced by Gradient Descent Algorithm:<\/p>\n<figure id=\"attachment_587\" aria-describedby=\"caption-attachment-587\" style=\"width: 785px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionOK.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-587\" src=\"https:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionOK.jpg\" alt=\"LinearRegression\" width=\"785\" height=\"854\" srcset=\"https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionOK.jpg 785w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionOK-276x300.jpg 276w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionOK-768x836.jpg 768w\" sizes=\"(max-width: 785px) 100vw, 785px\" \/><\/a><figcaption id=\"caption-attachment-587\" class=\"wp-caption-text\">Linear Regression predicts the data inclination<\/figcaption><\/figure>\n<p style=\"text-align: justify;\">But, if we add (x=4,y=19) and (x=3,y=18) as noises to the pervious set , the algorithm will return <b>theta0<\/b>=1.22 and <b>tetha1<\/b>=2.06. So the new hypothesis will be :<\/p>\n<p style=\"text-align: justify;\" align=\"center\"><b>h(x)=1.22+2.06x<\/b><\/p>\n<p style=\"text-align: justify;\">After adding two out layer points to our data the predicted line should appear like the figure below .It\u2019s clear from the figure that the red line which shows new h(x) function doesn&#8217;t represent the actual data inclination, and prediction can\u2019t be reliable :<\/p>\n<figure id=\"attachment_588\" aria-describedby=\"caption-attachment-588\" style=\"width: 780px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionNOK.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-588\" src=\"https:\/\/m-shaeri.ir\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionNOK.jpg\" alt=\"LinearRegression\" width=\"780\" height=\"856\" srcset=\"https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionNOK.jpg 780w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionNOK-273x300.jpg 273w, https:\/\/mshaeri.com\/blog\/wp-content\/uploads\/2016\/04\/LinearRegressionNOK-768x843.jpg 768w\" sizes=\"(max-width: 780px) 100vw, 780px\" \/><\/a><figcaption id=\"caption-attachment-588\" class=\"wp-caption-text\">Noises affects the prediction<\/figcaption><\/figure>\n<p style=\"text-align: justify;\">In addition, performance of the algorithm is highly depends on the Alpha (learning rate). In fact, if the learning rate is too small the algorithm may take a very long time to converge. And if we choose very large learning rate, gradient descent will not converge at all.<\/p>\n<p style=\"text-align: center;\"><a title=\"Download Gradient Descent Algorithm for two variable linear regression\" href=\"http:\/\/m-shaeri.ir\/prjct\/GD4LR.zip\">Download The Program<\/a><\/p>\n<p style=\"text-align: justify;\">\n","protected":false},"excerpt":{"rendered":"<p>Last week I started to implement graphical Gradient Descent algorithm for linear regression (supervised machine learning).\u00a0Gradient Descent is a general algorithm and is used not &hellip; <\/p>\n","protected":false},"author":1,"featured_media":204,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1,28,35],"tags":[11,30,29,31],"_links":{"self":[{"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/posts\/76"}],"collection":[{"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/comments?post=76"}],"version-history":[{"count":35,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/posts\/76\/revisions"}],"predecessor-version":[{"id":590,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/posts\/76\/revisions\/590"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/media\/204"}],"wp:attachment":[{"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/media?parent=76"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/categories?post=76"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mshaeri.com\/blog\/wp-json\/wp\/v2\/tags?post=76"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}