2.Linear regression with one variable

type

status

date

slug

summary

Model representation

This is an example of a supervised learning algorithm,because we’re given the quotes “right answer” for each of our examples.And moreover,this is an example of a regression problem,where the term regression refers to the fact that we are predicting a real-valued output namely the price.

More formally,in supervised learning,we have a data set and this data set is called a training set. So for housing prices example,we have a training set of different housing prices and our jobs is to learn from this data how to predict prices of the houses.

Let’s define some notation that we’re using.I’m gonna use m to denote the number of training examples. So in this data set,if I have,you know,let’s say 47 rows in this table,then I have 47 training examples,and m equals 47.Let me use x to denote the input variables,often also called the features.And I’m gonna use y to denote my output variables,or the target variable which I’m going to predict,and so that’s the second column.Looking on the notation,I’m going to use (x,y) to denote a single training example. So,a single row in this table corresponds to a single training example,and to refer to a specific training example,I’m going to use this notation .And we’re going to use this to refer to the training example.

So here’s how this supervised learning algorithm works.We saw that with the training set like our training set of housing prices, and we feed that to our learning algorithm. Is the job of a learning algorithm to then output a function,which by convention is usually denoted h ,and h stands for hypothesis(假设).And what the job of the hypothesis is a function that takes as input the size of a house,so it takes in the value of x,and it tries to output the estimated value of y for the corresponding house. So h is a function that maps from x to y.

When designing a learning algorithm,the next thing we need to decide is how do we represent this hypothesis h.We’re going to choose our initial choice for representing the hypothesis.We’re going to represent h as follows .And as a shorthand,sometimes instead of writing,h subscript .

Plotting this in the pictures,all this means is that,we are going to predict that y is a linear function of x.So,that’s the data set.And what this function is doing,is predicting that y is some straight line function of x.

And why is a linear function? Sometimes we’ll want to fit more complicated perhaps non-linear functions as well. But since this linear case is the simple building block,we will start with this example first of fitting linear functions,and we will build on this to eventually have more complex models and more complex learning algorithms.

Let us also give this particular model a name.This model is called linear regression or this for example,is actually linear regression with one variable,with the variable being x.That’s the predicting all the prices as functions of one variable x.And another name for this model is univariate(单变量) linear regression.

Cost function

Mathematical definition

In this section we’ll define the concept of loss definition function.This will let us figure out how to fit the best possible straight line to our data.

In linear regression we have a training set like that shown here.

To introduce a little bit more terminology,these and are the parameters of the model.What we’re going to do is talking about how to go about choosing these two parameters values.With different choices of parameters and , we get different hypotheses functions.

In linear regression we have a training set like maybe the one plotted here.What we want to do is come up with values for the parameters and . So that the straight line we get out of this corresponds to a straight line that somehow fits the data well.

Like may that line below.

So how do we come up with and that corresponds to a good fit to the data? The idea is we’re going to choose our parameters and so that ,meaning the value we predict on input x, that this at least close to the values y for the examples in our training set. So in our training set we’re given a number of examples where we know x decides the house and we know the actual price of what it’s sold for. So let’s try to choose values for the parameters so that at least in the training set,given the x in the training set,we make reasonably accurate predictions for the y values.Let’s formalize this. So linear regression, where we’re going to do is that I’m going to want to solve a minimization problem. So I’m going to write minimize over .And, I want to be small. I want difference between and y to be small. And one thing I’m gonna do is try to minimize the square difference between the output of the hypothesis and the actual price of the house. So let’s fill in some details.What I want really is to sum over my training set. Sum from i equals 1 to m of the square difference between this is the prediction of my hypothesis when it is input the size of house number i, minus the actual price that house number i wil sell for and I want to minimize the sum of my training set sum from i equals 1 through m of the difference of this squared error, square difference between the predicted price of the house and the price that it will actually sell for.

And to make some of us make the math a little bit easier, I’m going to actually look at, you know, 1 over m times that.

So we’re going to try to minimize my average error, which we’re going ot minimize one by 2m.Putting the 2,the constant of one half in front of it just makes some of the math a little easier.

So minimize one half of something, should give you the same values of the parameters of , as minimize that function.And just make sure this equation is clear.

Just to recap, we’re posing this problem as find the values of the parameters so that the average already one over 2M times the sum of square error between predictions on the training set minus the actaul values of the houses on the training set is minimized. So this is going to be overall objective function for linear regression. And just to, you know rewrite this out a little bit more cleanly. What I ‘m going to do by convention is we usually define a cost function which is going to be exactly this. That formula that I have up here.

And what I want to do is minimize over and over function . Just write this out, this is the cost function. So this cost function is also called the squared error function or sometimes called the squared error cost function and it turns out why do we take the squares of the errors? It turns out that the squared error cost function is a reasonable choice and will work well for most problem, for most regression problems. There are other cost functions that will work pretty well, but the squared error cost function is probably the most commonly used one for regression problems

Intuition 1

Hypothesis

Parameters

Cost Function

Goal

simplified

Hypothesis

Parameters

Cost Function

Goal

In order to better visualize the cost function , we are going to work with a simplified hypothesis function like that show on the right. So we’re gonna use the simplified hypothesis which is just times . If you want, you can think of this as setting the parameter . So we have only one parameter .

Using this simplified definition of hypothesis cost function, let’s try to understand the cost function concept better.

It turns out that two key functions we want to understand.The first is the hypothesis function, and the second is a cost function. So notice the hypothesis function .For a fixed , this is a function of x. So the hypothesis is a function of what is the size of the house x. In contrast, the cost function that’s a function of the parameter which controls the slope of the straight line. Let’s plot these functions and try to understand them both better.

(for fixed

this is a function of x)

(function of the parameter )

for difference values of , we can compute these . And it turns out that we computed range of values and get something like that point. And by computing the range of values, you can actually slowly create out what this function looks like .

To recap, for each value of corresponds to a differnce hypothesis, or to a difference straight line fit on the left. For each values of , we could then derive a difference value of .And we could then use this to trace out this plot on the right. Now you remember the optimization objective for our learning algorithm is we want to choose the value of that minimize .

Well, looking at this curve, the value of minimize is . And low and behold, that is indeed the best possible straight line fit throught the data by setting . And just for this particular training set, we actually end up fitting it perfectly. And that’s why minimizing corresponds to finding a straight line that fits the data well

Intuition 2

In this section we will delve deeper and get even better intuition about what the cost function is doing.

💡

We assumes that you’re familiar with contour plots in the section.

Hypothesis

Parameters

Cost Function

Goal

Here’s our problem formulation as usual,with the hypothesis, parameters, cost function and our optimization objective.

Unlike before, we’re going to keep both of the parameters as we generate our visualizations for the cost function.

So same as last time, we want to understand the hypothesis and the cost function .

So here’s the training set of housing prices, and let’s make some hypothesis like that.

This is not a particularly good hypothesis.But if I set ,then I end up with this hypothesis down here and that corresponds to that straight line. Now given these value of ,we want to plot the coresponding cost function.

When we have two parameters, it turns out the cost function also has a similar sort of bowl shape. And in fact, depending on your training set, you might get a cost function that maybe looks something like this.

So this is a 3D surface plot, where the axes are . As you vary , the two parameters, you get difference values of the cost function . And the height of this surface above a particular point of indicates the values of and you can see it sort of has this bowl like shape. This bowl shaped surface as that’s what the cost function looks like.

For the purpose of illustration in the rest of this video,we’re going to use contour plots to show you these surfaces.

Here’s an example of a contour figure show on the right, where the axis are and . And each of these ovals what each of these ellipses shows is a set of points that takes on the same value for .

Let’s look at some examples.

If we set , we will get the line on the left. Now this line is really not such a good fit to the data, and so you find that its cost is a value that’s out here(the red point on the right), that ‘s you know pretty far from the minimum. This is pretty high cost because this is just not that good a fit to the data.

Now here’s a different hypotheis that’s still not a great fit for the data but may be slightly better, so here that’s the point . And this pair of parameters corresponds to that hyothesis on the left, corresponds to flat line . And this hypothesis again has some cost and that cost is plotted as the height of the function at that point.

So with these figures, I hope that gives you a better undetstanding of what values of the cost function how that corresponds to different hypothesis, as how better hypothesis may corresponds to points that are closer to the minimum of this cost function .

Now of course what we really want is an efficient algorithm, a efficient piece of software for antomatically finding the value of that minimizes the cost function . And in fact, we’ll see it later that when we look at more complicated examples, we’ll have high dimensional figures with more parameters. It turns out we’ll see in a few examples where this figures cannot really be plotted, and because much barder to visualize. And so, what we want is t have software to find the value of that minimizes this funciton

We start to talk about an algorithm for automatically finding that values of and that minimizes the cost function

Gradient descent

We’ve previously defined the cost function ,in this section I want to tell you about an algorithm called gradient descent(梯度下降) for minimizing the cost function .

It truns out gradient descent is a more general algorithm and is used not only in linear regression. It’s actually used all over the place in machine learning. And in later sections, we’ll use gradient descent to minimize other functions as well, not just the cost function , for linear regression.

So, in this section, I’m going to talk about gradient descent for minimizing some arbitrary(任意) function . And in later sections, we’ll take those algorithm and apply it specifically to the cost function that we had defined for linear regression.

Here’e the problem setup. We’re going to see that we have some function , maybe it’s a cost function from linear regression maybe it’s some other function we want to minimize. And we want to come up with a algorithm for minimizing that as a function . Just as I said, it turns out that gradient descent actually applies to more general functions. So imagine if you have a function. that’s a function of , and you want to minimize over up to of this . It turns out gradient descent is an algorithm for solving this more general problem, but for the sake of brevity(简洁), we’re just going to pretend we have only two parameters.

Here’s the idea for gradient descent. What we’re going to do is we’re going to start off some initial guesses for and . Doesn’t really matter what they are, but a common choice would be we set . Just initial them to 0. What we’re going to do in gradient descent is we’ll keep changng and a little bit to try to reduce until hopefully we wind at a minimum or maybe a local minimum.

So,let’s see pictures of what gradient descent does.

Let’s say I try to minimize this function. So notice the axes, this is and on the horizontal axes, and is a vertical axis. And so the height of the surface shows , and we want to minimize this fuction. So we’re going to start off with at some point. Imagine picking some value for , and that corresponds to starting at some point on the surface of this function. So whatever value of gives you some point here, I did initial them to , but sometimes you initialize it to other values as well.

Now I want us to imagine that this figures shows a hill. Imagine this is like a landspace of some grassy park with two hills like so. And I want you to imagine that you are physically standing at that point on the hill, on this red hill in your park.

In gradient regression, what we’re going to do is to spin 360 degrees around and just look all around us and ask,”If I were to take a little baby step in some direction, and I want to go downhill as quickly as possible, what direction do I take that litte baby step in if I want to physically walk down this hill as rapidly as possible?” Turns out that if we’re standing at that point on the hill, you look all around, you find that the best direction to taka litte step downhill is roughly that direction.

Okay. And now you’re at this new point on your hill. You’re going to, again, look all around, and then say “What direction should I step in order to take a little baby step downhill?” And if you do that and take another step, you take a step in that direction, and then you keep going.

You know, from this new point, you look around, decide what direction will take you downhill most quickly, take another step, another step,and so on, until you converge to this, local minimumdown here.

Gradient descent has an interesting propert. This first time we ran gradient descent, we were starting at this point over here. Now imagine, we initial gradient descent just a couple steps on the right. Imagine we initial gradient descent with that point on the upper right. If you were to repeat this process, and start from the point,and look all around, take a little step in the direction of steepest descent. You would do that. Then look around, take another step, and so on.

And if you start it just a couple steps to the right, gradient descent will have taken you to this second local optimum over on the right.

So if you had start at this first point, you would have wound up at this local optimum. But if you start just a little bit, a slightly different location, you would have wound up at a very different local optimum. And this is a property of gradient descent that we’ll say a little bit more about later.

So, that’s the intuition in pictures. Let’s look at the math.

This is the definition of the gradient descent algorithm, we’re going to just repeatedly do this until convergence. We’re going to update my parameter by taking and subtracting from it times this term over here.

So, let’s see. There are a lot of details in this equation.

First, this notation , we’re use to denote assignment(赋值), so it’s the assignment operator. Whereas in contrast, if I use the equals sign, and I write , then this is a truth assertion(真假判定). So I write , then I’m asserting that the value of a equals to the value of b. So that’s the first part of the definition.

This here is a number that is called the learning rate. And what does is, it basically controls how big a step we take downhill with gradient descent. So if very large the that corresponds to a very aggressive gradient descent procedure where we’re trying to take huge steps downhill. And if is very small, then we’re taking little, little baby steps downhill. And, I’ll come back and say more about this later, about how to set and so on.

And finally, this term here. That’s a derivative term(导数项),we don’t talk about it right now, but we will derive this derivative term and tell you exactly what this is later.

Now there’s one more subtlety about gradient descent, which is, in gradient descent, we’re going to update and . So this update takes place where and . So you are going to update and update .

And the subtlety of how you implement gradient descent is, for this expression, for this update equation, you want to simultaneously update and . What I mean by that is that in this equation, we’re going to update colonequal minus something and update colonequal minus something. And the way to implement this is, you should compute the right hand side, compute that thing for and and then simultaneously at the same time update and .

So let me say what I mean by that. This is c correct implementation of gradient descent meaning simultaneous updates. We’re going to set temp0 and temp1 equal that. So basically compute the right hand sides. And then having computed the right hand sides and stored them together in temp0 and temp1, we’re going to update and simultaneously. So it’s the correct implementation.

In contrast, here’s an imcorrect implementation that does not do a simultaneous update.

Gradient descent intuition

Let’s delve deeper, and in this section, get better intuition about what the algorithm is doing and why the steps of the gradient descent algorithm might make sense.

What I want to do is give you better intuition about what each of these two terms(, derivation term) is doing, and why, when put together, this entire update makes sense.

In order to convey these intuitions, I want to do is use a slightly simpler example where we want to minimize the function of just one parameter. So we have a cost function .Just so we can have 1D plots which are a litte bit simpler to look at. Let’s try to understand what gradient descent would do on this function.

So let’s say here’s my function . Now let’s say I’ve initialized gradient descent with at the localtion.

So image that we start off at that point on my function. What gradient descent will do is it will update

We’re going to compute this derivative.

What a derivative, at this point, does, is basically saying, you know, let’s take the tangent to that point,like that straight line, the red line, just touching this function. And let’s look at the slope of this red line. That’s where the derivative is, what’s the slope of the line that is just tangent to the function. And the slope of the line is of course is just the height divided1 by this horizontal thing.

Now this line has a positive slope, so it has a positive derivative. And so, my update to is going to be gives the update that minus times some positive number. , the learning rate is always a positive number. And so I’m gonna to take , this update as minus something like, end up moving to the left,decease . And we can see this is the right thing to do, because I actually went ahead in this direction to get me closer to the minimum over there.

So, gradient descent so far seems to be doing the right thing.When we initialize the parameter on the left,we can see again the thing we want to do, to try to get me closer to the minimum.

So, this hopefully explains the inituition behind what the derivative term is doing. Let’s next take a look at the learning rate term , and try to figure out what that’s doing. So here’s my gradient descent update rule. And let’s look at what can happen if is either too small or if is too large.

It’s just like the picture below.

Now I have another question for you:

What if your parameter is already at a local minimum, what do you think one step of gradient descent will do?

So let’s suppose you initialize at a local minimum. It turns out that at local minimum your derivative would be equal to zero. And so, in your gradient descent update, you have gives update that . And so, what this means is that, if you’re already in a local optimum, it leaves unchanged. So if your parameter is at local minimum, one step of gradient descent does absolutely nothing. It doesn’t change parameter, which is what you want, cause it keeps your solution at the local optimum. This also explains why gradient descent can converge teh local minimum even with the learning rate fixed.

As we approach a local minimum, gradient descent will automatically take smaller steps(the derivative term gets smaller). So, no need to decrease over time.

So that’s the gradient descent algorithm, you can use it to minimize, to try to minimize any cost function , not the cost function to be defined for linear regressive

Gradient descent for linear regression

In this section we’re going to put together gradient descent with our cost function, and that will give us an algorithm for linear regression for fitting a straight line to our data.

What we’re going to do is apply gradient descent to minimize the squared error cost function.

Now, in order to apply gradient descent, in order to write this piece of code, the key term we need is this derivative term over here. So, we need to figure out what is partial derivative term. Plug in the definition of the cost function . This turns out to be this

And it turns out we need to figure out what the partial derivation of two cases for j equals 0 and for j equals 1. So want to figure out what is this partial derivative for both the case and the case. And I’m just going to write out the answers.

After these definitions or after what we’ve worked out to be the derivatives which is really just the slope of the cost function , we can now plug them back in to our gradient descent algorithm.

So let’s see how gradient descent works. One of the issues we solved gradient descent is that it can be susceptible to local optima. So, when I first explained gradient descent, we saw how, depending on where you’re initializing you can end up different local optima.

But, it turns out that the cost function for linear regression is always going to be a bow-shaped function like this. The technical term for this is that this is called a convex function(凸函数). And so, this function doesn’t have any local optima except for the one global optima. And does gradient descent on this type of cost function which you get whenever you’re using linear regression, it will always convert to the global optimum, because there are no other local optima other than global optimum. So now, let’s see this algorithm in action.

As usual, here are plots of the hypothesis function and of cost function . And so, let’s see how to initialize the parameters at this value. You know, let’s say, instead, usually you initialize your parameters at . For illustration in this specific in gradient descent, we have initialize (900,-0.1). And so, this corresponds to . So out here on the cost function.

Now if we take one step of gradient descent, we end up going from this point out here a little bit to the down left to that second point over there. And, you notice that my line changed1 a little bit.

And, as I take another step at gradient descent, my line on the left will change. And I have also moved a new point on my cost function.

And as I think further step of gradient descent, I’m going down in cost. So my parameter is following this trajectory. And if you look on the left, this corresponds to hypothesis that seem to be getting to be better and better fits for the data

Until eventually, I have now wound up at the global minimum. And this global minimum corresponds to this hypothesis which gievs me a good fit to the data. And so that’s gradient descent, and we’ve just run it and got a good fit to my data set of housing prices. And you can now use it to predict.

Finally, just to give this another name, it turns out that the algorithm that we just went over is sometimes called batch gradient descent. And it turns out in machine learning I feel like us machine learning people were not always created and given me some algorithms. But the term batch gradient descent means that refers to the fact that, in every step of gradient descent we’re looking at all of the training examples. So in gradient descent, you know when computing derivatives, we’re computing these sums. So, in every separete gradient descent, we end up computing something like this that sums over our m training examples. And so the term batch gradient descent refers to the fact that, when looking at the entire batch of training examples

Written on the back: The cover picture is reproduced from the artist of pixiv Valya