by Suraj Rampure (suraj.rampure@berkeley.edu)

Introduction

Suppose we're given some set of points $\{ (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) \}$ , and want to fit the model

y ̂ = θ 1 x + θ 2

$\hat{y} = \theta_1x + \theta_2$

that minimizes $L_2$ loss.

This is a problem you've likely seen multiple times.

In previous statistics courses, like Data 8, you've derived expressions for $\theta_1, \theta_2$ (usually called $a, b$ ) in terms of $r$ , known as the "correlation coefficient"
In this course, you've learned the tools to formulate this as a scalar optimization problem (take the derivative of the loss, set to $0$ and solve for the parameters)
You've also learned the rules to formulate this as a matrix calculus optimization problem, and can use the normal equation to find a solution

Here, we will:

Provide a warmup to the idea of optimizing scalar loss functions by finding the $\theta$ that optimizes $L_2$ loss of $y = \theta x$
Derive the solution to $\theta_1, \theta_2$ by taking partial derivatives and solving
Show the connection between the solutions for $\theta_1, \theta_2$ and the formulas for linear regression given in traditional statistics courses
Create a feature matrix $\phi$ and weights vector $\theta$ and show that the solution $\theta^* = (\phi^T \phi)^{-1} \phi^T y$ yields the same solution as in (2)

Note: In lecture, this was referred to as the "slope-intercept model."

1. Warmup

Suppose we're given some set of points $\{ (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) \}$ , and want to fit the model

y ̂ = θ x

$\hat{y} = \theta x$

where $x, \theta, \hat{y}$ are all scalars.

Our total loss is then

L (θ) = \sum i = 1 n (y i - θ x i) 2 = (y 1 - θ x 1) 2 + (y 2 - θ x 2) 2 + . . . + (y n - θ x n) 2

$L(\theta) = \sum_{i = 1}^n (y_i - \theta x_i)^2 = (y_1 - \theta x_1)^2 + (y_2 - \theta x_2)^2 + ... + (y_n - \theta x_n)^2$

Taking the partial derivative with respect to $\theta$ and setting equal to $0$ :

δ L δ θ = \sum i = 1 n 2 (y i - θ x i) (- x i) = 2 \sum i = 1 n (θ x 2 i - x i y i) = 0 θ \sum i = 1 n x 2 i = \sum i = 1 n x i y i ⟹ θ = \sum n i = 1 x i y i \sum n i = 1 x 2 i

$\frac{\delta L}{\delta \theta} = \sum_{i = 1}^n 2(y_i - \theta x_i) (- x_i) \\ = 2\sum_{i = 1}^n (\theta x_i^2 - x_i y_i) = 0 \\ \theta \sum_{i = 1}^n x_i^2 = \sum_{i = 1}^n x_i y_i \\ \implies \theta = \frac{\sum_{i = 1}^n x_i y_i}{\sum_{i = 1}^n x_i^2}$

This result should be familiar.

Now, we can move onto the problem at hand.

2. 2D Linear Regression, as an Optimization Problem

Suppose we're given some set of points $\{ (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) \}$ , and want to fit the model

y ̂ = θ 1 x + θ 2

$\hat{y} = \theta_1x + \theta_2$

Using $L2$ loss, we have:

L (θ 1, θ 2) = \sum i = 1 n (y i - θ 1 x i - θ 2) 2

$L(\theta_1, \theta_2) = \sum_{i = 1}^n (y_i - \theta_1x_i - \theta_2)^2$

Since we have two $\theta$ s, we will need to take partial derivatives with respect to each. We will then end up with a system of two equations and two unknowns, allowing us to solve for $\theta_1$ and $\theta_2$ . Up until now, we've only dealt with single parameter equations (i.e. just one $\theta$ ), and we only needed to take a single partial derivative and optimize over one variable.

(Note: We will simplify by using $\mu_x = \frac{1}{n} \sum_{ i = 1 } ^n x_i$ and $\mu_y = \sum_{i = 1}^n y_i$ .)

Taking the partial derivative with respect to $\theta_1$ and setting equal to 0:

δ L δ θ 1 = \sum i = 1 n 2 (y i - θ 1 x i - θ 2) (- x i) = 2 \sum i = 1 n (θ 1 x 2 i + θ 2 x i - x i y i) = 0

$\frac{\delta L}{\delta \theta_1} = \sum_{i = 1}^n 2(y_i - \theta_1x_i - \theta_2)(-x_i) \\ = 2\sum_{i = 1}^n (\theta_1x_i^2 + \theta_2x_i - x_iy_i ) = 0$

⟹ θ 1 \sum i = 1 n x 2 i + n μ x θ 2 = \sum i = 1 n x i y i (1)

$\implies \theta_1 \sum_{i = 1}^n x_i^2 + n\mu_x\theta_2 = \sum_{i = 1}^n x_iy_i \: \: \: (1)$

Taking the partial derivative with respect to $\theta_2$ and setting equal to 0:

δ L δ θ 2 = \sum i = 1 n 2 (y i - θ 1 x i - θ 2) (- 1) = 2 \sum i = 1 n (θ 1 x i + θ 2 - y i) = 0 θ 1 \sum i = 1 n x i + n θ 2 = \sum i = 1 n y i n μ x θ 1 + n θ 2 = n μ y ⟹ μ x θ 1 + θ 2 = μ y (2)

$\frac{\delta L}{\delta \theta_2} = \sum_{i = 1}^n 2(y_i - \theta_1x_i - \theta_2)(-1) \\ = 2 \sum_{i = 1}^n (\theta_1x_i + \theta_2 - y_i) = 0 \\ \theta_1 \sum_{i = 1}^n x_i + n\theta_2 = \sum_{i = 1}^n y_i \\ \\ n\mu_x \theta_1 + n\theta_2 = n\mu_y \\ \implies \mu_x \theta_1 + \theta_2 = \mu_y \: \: \: (2)$

where in the last step, we used the fact that $\sum_{i = 1}^n \theta_2 = n\theta_2$ .

Now, we have a system of two equations and two unknowns in $a, b$ . To solve, we can isolate for $\theta_2$ :

θ 2 = μ y - θ 1 μ x

$\theta_2 = \mu_y - \theta_1 \mu_x$

and substitute this into equation $(1)$ :

θ 1 \sum i = 1 n x 2 i + n μ x (μ y - θ 1 μ x) = \sum i = 1 n x i y i θ 1 (\sum i = 1 n x 2 i - n μ 2 x) + n μ x μ y = \sum i = 1 n x i y i θ 1 = \sum n i = 1 x i y i - n μ x μ y \sum n i = 1 x 2 i - n μ 2 x

$\theta_1 \sum_{ i = 1 } ^n x_i ^2 + n\mu_x (\mu_y - \theta_1 \mu_x) = \sum_{i = 1}^n x_i y_i \\ \theta_1 \left( \sum_{i = 1}^n x_i^2 - n \mu_x^2 \right) + n\mu_x \mu_y = \sum_{i = 1}^n x_i y_i \\ \theta_1 = \frac{\sum_{i = 1}^n x_i y_i - n \mu_x \mu_y}{\sum_{i = 1}^n x_i^2 - n \mu_x^2}$

This gives us our final solution

θ 1 = \sum n i = 1 x i y i - n μ x μ y \sum n i = 1 x 2 i - n μ 2 x θ 2 = μ y - θ 1 μ x

$\theta_1 = \frac{\sum_{i = 1}^n x_i y_i - n \mu_x \mu_y}{\sum_{i = 1}^n x_i^2 - n \mu_x^2} \\ \theta_2 = \mu_y - \theta_1 \mu_x$

Technically, to be complete, we'd need to check $\frac{\delta^2 L}{\delta x^2}$ and $\frac{\delta^2 L}{\delta y^2}$ and ensure that they're both positive, but we can take that leap of faith for now.

3. Connection to Previous Statistics Courses

This definition for $\theta_1$ and $\theta_2$ we just learned should look very similar to the definition for linear regression you might have learned in previous courses, such as Data 8.

In such courses, we define $r$ , the correlation coefficient, as:

r = 1 n \sum i = 1 n (x i - μ x σ x) (y i - μ y σ y)

$r = \frac{1}{n} \sum_{i = 1}^n \left(\frac{x_i - \mu_x}{\sigma_x}\right) \left(\frac{y_i - \mu_y}{\sigma_y}\right)$

where $\mu$ and $\sigma$ represent the empirical mean and standard deviation, respectively.

From there, the parameters $a, b$ are defined as

a = r σ y σ x

$a = r \frac{\sigma_y}{\sigma_x}$

b = μ y - a μ x

$b = \mu_y - a \mu_x$

It is easy to see that this definition of $b$ matches the $\theta_2$ we solved for in the previous section (we had $b = \mu_y - a\mu_x$ and $\theta_2 = \mu_y - \theta_1 \mu_x$ ). Let's show that $a$ and $\theta_1$ are also the same:

r σ y σ x = 1 n σ y σ x \sum i = 1 n (x i - μ x σ x) (y i - μ y σ y) = 1 n σ 2 x \sum i = 1 n ((x i - μ x) (y i - μ y)) = 1 n σ 2 x (\sum i = 1 n x i y i - μ y \sum i = 1 n x i - μ x \sum i = 1 n y i + \sum i = 1 n μ x μ y) = \sum n i = 1 x i y i - n μ x μ y - n μ x μ y + n μ x μ y n \sum n i = 1 ( x i - μ x ) 2 n = \sum n i = 1 x i y i - n μ x μ y \sum n i = 1 x 2 i - 2 μ x \sum n i = 1 x i + \sum n i = 1 μ 2 x = \sum n i = 1 x i y i - n μ x μ y \sum n i = 1 x 2 i - n μ 2 x = θ 1

$r \frac{\sigma_y}{\sigma_x} = \frac{1}{n} \frac{\sigma_y}{\sigma_x} \sum_{i = 1}^n \left(\frac{x_i - \mu_x}{\sigma_x}\right) \left(\frac{y_i - \mu_y}{\sigma_y}\right) \\ = \frac{1}{n\sigma_x^2} \sum_{i = 1}^n \left( (x_i - \mu_x) (y_i - \mu_y) \right) \\ = \frac{1}{n\sigma_x^2} \left( \sum_{i = 1}^n x_iy_i - \mu_y \sum_{i = 1}^n x_i - \mu_x \sum_{i = 1}^n y_i + \sum_{i = 1}^n \mu_x \mu_y \right) \\ = \frac{ \sum_{i = 1}^n x_i y_i - n\mu_x\mu_y - n\mu_x\mu_y + n\mu_x\mu_y }{n \sum_{i = 1}^n \frac{(x_i - \mu_x)^2}{n}} \\ = \frac{\sum_{i = 1}^n x_i y_i - n\mu_x\mu_y}{\sum_{i = 1}^n x_i^2 - 2\mu_x \sum_{i = 1}^n x_i + \sum_{i = 1}^n \mu_x^2 } \\ = \frac{\sum_{i = 1}^n x_i y_i - n\mu_x\mu_y}{\sum_{i = 1}^n x_i^2 - n \mu_x^2 } \\ = \theta_1$

as required.

4. Matrix Formulation

Now, instead of dealing with purely scalar values, we will introduce vectors and matrices.

Given our feature matrix $\phi$ and values vector $y$ , we want to find a vector $\theta$ that best approximates $y = \phi \theta$ ; specifically we want the $\theta$ that minimizes $|| y - \phi \theta ||_2^2$ . This solution is given by $\theta^* = (\phi^T\phi)^{-1}\phi^Ty$ .

The matrix formulation is more robust than the ones we've seen previously in that we can select features that are more complicated than linear - for example, we could find parameters to minimize estimation error for $ax + bx^2 + ce^x - tan(x^2)$ if we wanted to. However, we're going to keep the problem the same, and try and model using $y = \theta_1x + \theta_2$ .

(Important note: Now, $\theta$ is a vector, but $\theta_1, \theta_2$ are still scalars!)

Our matrices and vectors are defined as follows:

ϕ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ x 1 x 2 ⋮ x n 111 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥, θ = [θ 1 θ 2], y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ y 1 y 2 ⋮ y n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥

$\phi = \left[ \begin{matrix} x_1 & 1 \\ x_2 & 1 \\ \vdots \\ x_n & 1 \end{matrix} \right], \theta = \left[ \begin{matrix} \theta_1 \\ \theta_2 \end{matrix} \right], y = \left[ \begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix} \right]$

Now, let's compute $(\phi^T\phi)^{-1}\phi^Ty$ to find the matrix least squares solution to $\theta_1$ and $\theta_2$ . (Be warned: this will be relatively algebra heavy. Feel free to skim over the results.)

ϕ T ϕ = [x 1 1 x 2 1 . . . . . . x n 1] ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ x 1 x 2 ⋮ x n 11 ⋮ 1 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥

$\phi^T \phi = \left[ \begin{matrix} x_1 & x_2 & ... & x_n \\ 1 & 1 & ... & 1 \end{matrix} \right] \left[ \begin{matrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_n & 1 \end{matrix} \right]$

= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ \sum i = 1 n x 2 i \sum i = 1 n x i \sum i = 1 n x i n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ \sum i = 1 n x 2 i n μ x n μ x n ⎤ ⎦ ⎥ ⎥ ⎥

$= \left[ \begin{matrix} \sum_{i = 1}^n x_i^2 & \sum_{i = 1}^n x_i \\ \sum_{i=1}^n x_i & n \end{matrix} \right] = \left[ \begin{matrix} \sum_{i = 1}^n x_i^2 & n\mu_x \\ n\mu_x & n \end{matrix} \right]$

Recall, the inverse of a 2x2 matrix $\left[ \begin{matrix} a & b \\ c & d \end{matrix} \right]$ is given by $\frac{1}{ad - bc}\left[ \begin{matrix} d & -b \\ -c & a \end{matrix} \right]$ .

Then:

⟹ (X T X) - 1 = 1 n \sum n i = 1 x 2 i - n 2 μ 2 x ⎡ ⎣ ⎢ ⎢ ⎢ n - n μ x - n μ x \sum i = 1 n x 2 i ⎤ ⎦ ⎥ ⎥ ⎥

$\implies (X^TX)^{-1} = \frac{1}{n \sum_{i = 1}^n x_i^2 - n^2\mu_x^2} \left[ \begin{matrix} n & - n\mu_x \\ -n\mu_x & \sum_{i = 1}^n x_i^2 \end{matrix} \right]$

⟹ (X T X) - 1 X T y = 1 n \sum n i = 1 x 2 i - n 2 μ 2 x ⎡ ⎣ ⎢ ⎢ ⎢ n - n μ x - n μ x \sum i = 1 n x 2 i ⎤ ⎦ ⎥ ⎥ ⎥ [x 1 1 x 2 1 . . . . . . x n 1] ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ y 1 y 2 ⋮ y n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥

$\implies (X^TX)^{-1}X^Ty = \frac{1}{n \sum_{i = 1}^n x_i^2 - n^2\mu_x^2} \left[ \begin{matrix} n & - n\mu_x \\ -n\mu_x & \sum_{i = 1}^n x_i^2 \end{matrix} \right] \left[ \begin{matrix} x_1 & x_2 & ... & x_n \\ 1 & 1 & ... & 1 \end{matrix} \right] \left[ \begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix} \right]$

= 1 n \sum n i = 1 x 2 i - n 2 μ 2 x ⎡ ⎣ ⎢ ⎢ ⎢ n - n μ x - n μ x \sum i = 1 n x 2 i ⎤ ⎦ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ \sum i = 1 n x i y i n μ y ⎤ ⎦ ⎥ ⎥ ⎥

$= \frac{1}{n \sum_{i = 1}^n x_i^2 - n^2\mu_x^2} \left[ \begin{matrix} n & - n\mu_x \\ -n\mu_x & \sum_{i = 1}^n x_i^2 \end{matrix} \right] \left[ \begin{matrix} \sum_{i = 1}^n x_i y_i \\ n\mu_y \end{matrix} \right]$

= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ n \sum n i = 1 x i y i - n 2 μ x μ y n \sum n i = 1 x 2 i - n 2 μ 2 x - n μ x \sum n i = 1 x i y i + n μ y \sum n i = 1 x 2 i n \sum n i = 1 x 2 i - n 2 μ 2 x ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥

$= \left[ \begin{matrix} \frac{n \sum_{i = 1}^n x_i y_i - n^2 \mu_x \mu_y}{n \sum_{i = 1}^n x_i^2 - n^2\mu_x^2} \\ \frac{-n\mu_x \sum_{i = 1}^n x_i y_i + n \mu_y \sum_{i = 1}^n x_i^2}{n \sum_{i = 1}^n x_i^2 - n^2\mu_x^2} \end{matrix} \right]$

⟹ [θ 1 θ 2] = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ \sum n i = 1 x i y i - n μ x μ y \sum n i = 1 x 2 i - n μ 2 x - μ x \sum n i = 1 x i y i + μ y \sum n i = 1 x 2 i \sum n i = 1 x 2 i - n μ 2 x ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥

$\implies \left[\begin{matrix} \theta_1 \\ \theta_2 \end{matrix} \right] = \left[ \begin{matrix} \frac{\sum_{i = 1}^n x_i y_i - n \mu_x \mu_y}{\sum_{i = 1}^n x_i^2 - n\mu_x^2} \\ \frac{-\mu_x \sum_{i = 1}^n x_i y_i + \mu_y \sum_{i = 1}^n x_i^2}{\sum_{i = 1}^n x_i^2 - n\mu_x^2} \end{matrix} \right]$

Here, we see that $\theta_1 = \frac{\sum_{i = 1}^n x_i y_i - n \mu_x \mu_y}{\sum_{i = 1}^n x_i^2 - n\mu_x^2}$ , as we saw earlier.

Also, we have

μ y - θ 1 μ x = μ y - \sum n i = 1 x i y i - n μ x μ y \sum n i = 1 x 2 i - n μ 2 x μ x = μ y \sum n i = 1 x 2 i - n μ 2 x μ y - μ x \sum n i = 1 x i y i + n μ 2 x μ y \sum n i = 1 x 2 i - n μ 2 x = - μ x \sum n i = 1 x i y i + μ y \sum n i = 1 x 2 i \sum n i = 1 x 2 i - n μ 2 x

$\mu_y - \theta_1 \mu_x = \mu_y - \frac{\sum_{i = 1}^n x_i y_i - n \mu_x \mu_y}{\sum_{i = 1}^n x_i^2 - n\mu_x^2}\mu_x \\ = \frac{\mu_y \sum_{i = 1}^n x_i^2 - n\mu_x^2 \mu_y - \mu_x \sum_{i = 1}^n x_i y_i + n \mu_x^2 \mu_y}{\sum_{i = 1}^n x_i^2 - n\mu_x^2} \\ = \frac{-\mu_x \sum_{i = 1}^n x_i y_i + \mu_y \sum_{i = 1}^n x_i^2}{\sum_{i = 1}^n x_i^2 - n\mu_x^2}$

as we saw earlier. We've now shown that the solution to the normal equation $(\phi^T \phi)^{-1} \phi^T y$ gives the same values for $\theta_1, \theta_2$ as the scalar optimization method did.

Conclusion

This entire time, we've been looking at the same problem: finding optimal parameters that minimize the $L_2$ loss of

y ̂ = θ 1 x + θ 2

$\hat{y} = \theta_1 x + \theta_2$

We looked at three solutions to the same problem, and showed that they're all equivalent. We should expect this to be the case, but it's nice to see these connections laid out explicitly.

DS 100 – Linear Regression Connections

Introduction

1. Warmup

2. 2D Linear Regression, as an Optimization Problem

3. Connection to Previous Statistics Courses

4. Matrix Formulation

Conclusion