by Suraj Rampure (suraj.rampure@berkeley.edu)
Introduction
Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}, and want to fit the model
ŷ =θ1x+θ2
that minimizes L2 loss.
This is a problem you've likely seen multiple times.
- In previous statistics courses, like Data 8, you've derived expressions for θ1,θ2 (usually called a,b) in terms of r, known as the "correlation coefficient"
- In this course, you've learned the tools to formulate this as a scalar optimization problem (take the derivative of the loss, set to 0 and solve for the parameters)
- You've also learned the rules to formulate this as a matrix calculus optimization problem, and can use the normal equation to find a solution
Here, we will:
- Provide a warmup to the idea of optimizing scalar loss functions by finding the θ that optimizes L2 loss of y=θx
- Derive the solution to θ1,θ2 by taking partial derivatives and solving
- Show the connection between the solutions for θ1,θ2 and the formulas for linear regression given in traditional statistics courses
- Create a feature matrix ϕ and weights vector θ and show that the solution θ∗=(ϕTϕ)−1ϕTy yields the same solution as in (2)
Note: In lecture, this was referred to as the "slope-intercept model."
1. Warmup
Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}, and want to fit the model
ŷ =θx
where x,θ,ŷ are all scalars.
Our total loss is then
L(θ)=∑i=1n(yi−θxi)2=(y1−θx1)2+(y2−θx2)2+...+(yn−θxn)2
Taking the partial derivative with respect to θ and setting equal to 0:
δLδθ=∑i=1n2(yi−θxi)(−xi)=2∑i=1n(θx2i−xiyi)=0θ∑i=1nx2i=∑i=1nxiyi⟹θ=∑ni=1xiyi∑ni=1x2i
This result should be familiar.
Now, we can move onto the problem at hand.
2. 2D Linear Regression, as an Optimization Problem
Suppose we're given some set of points {(x1,y1),(x2,y2),...,(xn,yn)}, and want to fit the model
ŷ =θ1x+θ2
Using L2 loss, we have:
L(θ1,θ2)=∑i=1n(yi−θ1xi−θ2)2
Since we have two θs, we will need to take partial derivatives with respect to each. We will then end up with a system of two equations and two unknowns, allowing us to solve for θ1 and θ2. Up until now, we've only dealt with single parameter equations (i.e. just one θ), and we only needed to take a single partial derivative and optimize over one variable.
(Note: We will simplify by using μx=1n∑ni=1xi and μy=∑ni=1yi.)
Taking the partial derivative with respect to θ1 and setting equal to 0:
δLδθ1=∑i=1n2(yi−θ1xi−θ2)(−xi)=2∑i=1n(θ1x2i+θ2xi−xiyi)=0
⟹θ1∑i=1nx2i+nμxθ2=∑i=1nxiyi(1)
Taking the partial derivative with respect to θ2 and setting equal to 0:
δLδθ2=∑i=1n2(yi−θ1xi−θ2)(−1)=2∑i=1n(θ1xi+θ2−yi)=0θ1∑i=1nxi+nθ2=∑i=1nyinμxθ1+nθ2=nμy⟹μxθ1+θ2=μy(2)
where in the last step, we used the fact that ∑ni=1θ2=nθ2.
Now, we have a system of two equations and two unknowns in a,b. To solve, we can isolate for θ2:
θ2=μy−θ1μx
and substitute this into equation (1):
θ1∑i=1nx2i+nμx(μy−θ1μx)=∑i=1nxiyiθ1(∑i=1nx2i−nμ2x)+nμxμy=∑i=1nxiyiθ1=∑ni=1xiyi−nμxμy∑ni=1x2i−nμ2x
This gives us our final solution
θ1=∑ni=1xiyi−nμxμy∑ni=1x2i−nμ2xθ2=μy−θ1μx
Technically, to be complete, we'd need to check δ2Lδx2 and δ2Lδy2 and ensure that they're both positive, but we can take that leap of faith for now.
3. Connection to Previous Statistics Courses
This definition for θ1 and θ2 we just learned should look very similar to the definition for linear regression you might have learned in previous courses, such as Data 8.
In such courses, we define r, the correlation coefficient, as:
r=1n∑i=1n(xi−μxσx)(yi−μyσy)
where μ and σ represent the empirical mean and standard deviation, respectively.
From there, the parameters a,b are defined as
a=rσyσx
b=μy−aμx
It is easy to see that this definition of b matches the θ2 we solved for in the previous section (we had b=μy−aμx and θ2=μy−θ1μx). Let's show that a and θ1 are also the same:
rσyσx=1nσyσx∑i=1n(xi−μxσx)(yi−μyσy)=1nσ2x∑i=1n((xi−μx)(yi−μy))=1nσ2x(∑i=1nxiyi−μy∑i=1nxi−μx∑i=1nyi+∑i=1nμxμy)=∑ni=1xiyi−nμxμy−nμxμy+nμxμyn∑ni=1(xi−μx)2n=∑ni=1xiyi−nμxμy∑ni=1x2i−2μx∑ni=1xi+∑ni=1μ2x=∑ni=1xiyi−nμxμy∑ni=1x2i−nμ2x=θ1
as required.
4. Matrix Formulation
Now, instead of dealing with purely scalar values, we will introduce vectors and matrices.
Given our feature matrix ϕ and values vector y, we want to find a vector θ that best approximates y=ϕθ; specifically we want the θ that minimizes ||y−ϕθ||22. This solution is given by θ∗=(ϕTϕ)−1ϕTy.
The matrix formulation is more robust than the ones we've seen previously in that we can select features that are more complicated than linear - for example, we could find parameters to minimize estimation error for ax+bx2+cex−tan(x2) if we wanted to. However, we're going to keep the problem the same, and try and model using y=θ1x+θ2.
(Important note: Now, θ is a vector, but θ1,θ2 are still scalars!)
Our matrices and vectors are defined as follows:
ϕ=⎡⎣⎢⎢⎢⎢x1x2⋮xn111⎤⎦⎥⎥⎥⎥,θ=[θ1θ2],y=⎡⎣⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥
Now, let's compute (ϕTϕ)−1ϕTy to find the matrix least squares solution to θ1 and θ2. (Be warned: this will be relatively algebra heavy. Feel free to skim over the results.)
ϕTϕ=[x11x21......xn1]⎡⎣⎢⎢⎢⎢x1x2⋮xn11⋮1⎤⎦⎥⎥⎥⎥
=⎡⎣⎢⎢⎢⎢⎢∑i=1nx2i∑i=1nxi∑i=1nxin⎤⎦⎥⎥⎥⎥⎥=⎡⎣⎢⎢⎢∑i=1nx2inμxnμxn⎤⎦⎥⎥⎥
Recall, the inverse of a 2x2 matrix [acbd] is given by 1ad−bc[d−c−ba].
Then:
⟹(XTX)−1=1n∑ni=1x2i−n2μ2x⎡⎣⎢⎢⎢n−nμx−nμx∑i=1nx2i⎤⎦⎥⎥⎥
⟹(XTX)−1XTy=1n∑ni=1x2i−n2μ2x⎡⎣⎢⎢⎢n−nμx−nμx∑i=1nx2i⎤⎦⎥⎥⎥[x11x21......xn1]⎡⎣⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥
=1n∑ni=1x2i−n2μ2x⎡⎣⎢⎢⎢n−nμx−nμx∑i=1nx2i⎤⎦⎥⎥⎥⎡⎣⎢⎢⎢∑i=1nxiyinμy⎤⎦⎥⎥⎥
=⎡⎣⎢⎢⎢⎢⎢n∑ni=1xiyi−n2μxμyn∑ni=1x2i−n2μ2x−nμx∑ni=1xiyi+nμy∑ni=1x2in∑ni=1x2i−n2μ2x⎤⎦⎥⎥⎥⎥⎥
⟹[θ1θ2]=⎡⎣⎢⎢⎢⎢⎢∑ni=1xiyi−nμxμy∑ni=1x2i−nμ2x−μx∑ni=1xiyi+μy∑ni=1x2i∑ni=1x2i−nμ2x⎤⎦⎥⎥⎥⎥⎥
Here, we see that θ1=∑ni=1xiyi−nμxμy∑ni=1x2i−nμ2x, as we saw earlier.
Also, we have
μy−θ1μx=μy−∑ni=1xiyi−nμxμy∑ni=1x2i−nμ2xμx=μy∑ni=1x2i−nμ2xμy−μx∑ni=1xiyi+nμ2xμy∑ni=1x2i−nμ2x=−μx∑ni=1xiyi+μy∑ni=1x2i∑ni=1x2i−nμ2x
as we saw earlier. We've now shown that the solution to the normal equation (ϕTϕ)−1ϕTy gives the same values for θ1,θ2 as the scalar optimization method did.
Conclusion
This entire time, we've been looking at the same problem: finding optimal parameters that minimize the L2 loss of
ŷ =θ1x+θ2
We looked at three solutions to the same problem, and showed that they're all equivalent. We should expect this to be the case, but it's nice to see these connections laid out explicitly.