Last week I started to implement **graphical Gradient Descent algorithm **for linear regression (supervised machine learning). Gradient Descent is a general algorithm and is used not only in linear regression, It’s actually used in non-linear regression and all over the place in machine learning.

During implementation I learnt more details of this approach. I firmly believe that there is an obvious gap between theoretical insights and implementation in every algorithm regarding the programming language that we use in implementing process, the data structure that we store the algorithm’s data and even the hardware that we are implementing the algorithm on. Sometimes an algorithm with worse performance in theory performs better in practice due to the situations in current problem.

At first, I think this algorithm is highly vulnerable in environments with noisy data. For example, if we have training data with 9 records as below:

x |
y |

1 |
2 |

2 |
2 |

2.5 |
3 |

3 |
6 |

3.5 |
6.5 |

4 |
8 |

4.5 |
8.4 |

5 |
9 |

5.5 |
10 |

In order to predict y for any given x, we should produce a Hypothesis function based on these records. Hypothesis is a function that predicts y for a given value of x, based on given training set; It’s a linear formula consists of two parameters **theta0** and **tetha1:**

**h(x)= theta0** + **tetha1 . x**

In other word, the Gradient Descent’s task is to find best tetha0 and thetha1 that minimize the cost function and boost the prediction accuracy as much as possible. The cost function is :

For this data set, the **theta0** and **tetha1** are -0.89 and 2.02 respectively, so the hypothesis function is:

h(x)=-0.89+2.02x

The red line in the figure below shows the h(x) function produced by Gradient Descent Algorithm:

But, if we add (x=4,y=19) and (x=3,y=18) as noises to the pervious set , the algorithm will return **theta0**=1.22 and **tetha1**=2.06. So the new hypothesis will be :

**h(x)=1.22+2.06x**

After adding two out layer points to our data the predicted line should appear like the figure below .It’s clear from the figure that the red line which shows new h(x) function doesn’t represent the actual data inclination, and prediction can’t be reliable :

In addition, performance of the algorithm is highly depends on the Alpha (learning rate). In fact, if the learning rate is too small the algorithm may take a very long time to converge. And if we choose very large learning rate, gradient descent will not converge at all.