Theory of Linear Regression

Give a dataset $\mathcal{D} = \{(\mathbf{x}, y) | \mathbf{x} \in R^d, y \in R \}$, we can build a *linear* model to *predict* $y$ (label) given only $\mathbf{x}$ (features).

Formally, a linear model can be written as $\hat{y} = f(x) = \mathbf{x}^T\beta$, where $\beta$ is parameter vector that we can tune and $\hat{y}$ is our prediction.

To evaluate how good our prediction is, an intuitive and simple metric is squared error: $|y-\hat{y}|^2$.

If the size of the dataset $|D| = 1$, we can directly find an optimal parameter vector that allows zero error by solving a linear equation!

$$
|y-\hat{y}|^2 = 0 \Rightarrow y = \hat{y} \Rightarrow \mathbf{x}^T\beta = y
$$

which can be written as:

$$
x_1\beta_1 + x_2\beta_2 + \cdots + x_d\beta_d = y
$$

where $d$ is the number of dimension of vector $\mathbf{x}$.

With non-trivial vector $\mathbf{x}$(i.e., $\exists x_i \neq 0$), we can always find a $\beta$ to satisfy the equation above.

In practice, the size of dataset $\mathcal{D}$ can be arbitrarily large, and in most cases, we can no longer find optimal $\beta$ by solving equations! (A simple example is that the number of distinct equations can be as large as the size of the dataset and even more than the number of unknown variables, which leads to no solution at all in most cases.) However, we can still optimize something we care about statistically!

The mean squared error (MSE), which evaluates the statistical quality of the linear model, can be written as:

$$
\mathcal{L}_{\beta} = \sum_{(\mathbf{x}, y)\in\mathcal{D}}{|y-f(\mathbf{x})|^2} = ||X\beta – \mathcal{y}||^2_2
$$

where $\mathcal{L}_\beta$ is the notation for **loss** (intuitively quantifies the distance between prediction and reality, here MSE is a simple yet effective implementation), $X = [\mathbf{x}_1, \cdots, \mathbf{x}_{|D|}]$ and $\mathcal{y} = [y_1, \cdots, y_{|D|}]$, and the subindex $\beta$ denotes our tunable parameters, $||\cdot||_2$ is the L-2 norm.

Is it possible to find a good enough parameter vector $\beta$ to reduce the loss $\mathcal{L}$? The answer is yes. What’s more, we can find a closed-form solution to vector $\beta$ to minimize the loss $\mathcal{L}$.

$$
min_{\beta}||X\beta-\mathcal{y}||^2_2 \Rightarrow \hat{\beta} = (X^TX)^{-1}X^T\mathbf{y}
$$

**Proof:**

Rewrite the objective,

$$
\mathcal{L} = ||X\beta-\mathcal{y}||^2_2 = (X\beta-\mathcal{y})^T(X\beta-\mathcal{y}) = (\beta^TX^T-y^T)(X\beta-y) = \beta^TX^TX\beta-2y^TX\beta + y^Ty.
$$

Take gradient w.r.t $\beta$,

$$
\nabla_{\beta}\mathcal{L}(\beta) = 2X^TX\beta – 2X^Ty
$$

Set gradient to zero (first-order optimality condition):

$$
2X^TX\beta – 2X^Ty = 0 \Rightarrow 2X^TX\beta = 2X^Ty
$$

Assume $X^TX$ is invertible, then

$$
\hat{\beta} = (X^TX)^{-1}X^T\mathbf{y}
$$

Theory of Linear Regression

Comments

Leave a Reply Cancel reply

More posts

Notes of “The Age of AI has begun”

Lexical Analysis

Image Abstraction using Layer-wise Image Vectorization

More about Matrix – Differentiation