Least Squares Estimates
1 Introduction to Linear Models
Let \(y_i\) be the \(i\)th response among \(n\) measured values. It is often of interest to consider a linear model to model the response: \[y_i = \beta_0 + \beta_1 x_{i1} +\cdots + \beta_{p-1} x_{i(p-1)} + \epsilon_i, \mbox{ } i=1,\cdots,n\]
The linear model consists of a systematic part and a random part. Write the \(n\) equations jointly, \[\begin{eqnarray*} \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots\\ y_n \end{pmatrix} & = & \begin{pmatrix} \beta_0 + \beta_1 x_{11} + \cdots+ \beta_{p-1} x_{1(p-1)} + \epsilon_1 \\ \beta_0 + \beta_1 x_{21} + \cdots+ \beta_{p-1} x_{2(p-1)} + \epsilon_2 \\ \vdots\\ \beta_0 + \beta_1 x_{n1} + \cdots+ \beta_{p-1} x_{n(p-1)} + \epsilon_n \\ \end{pmatrix} \\ &=& \begin{pmatrix} 1 \mbox{ } x_{11} \cdots x_{1(p-1)} \\ 1 \mbox{ } x_{21} \cdots x_{2(p-1)} \\ \vdots\\ 1 \mbox{ } x_{n1} \cdots x_{n(p-1)} \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1\\ \vdots\\ \beta_(p-1) \end{pmatrix} + \begin{pmatrix} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n \end{pmatrix} \end{eqnarray*}\]
Use matrix notations, we have : \[\begin{eqnarray*} {Y} &=& {X}{\beta} + {\epsilon} \end{eqnarray*}\]
In linear models, the basic assumptions we use are: \[\begin{eqnarray*} E[ Y ] &=& X\beta \\ Var[Y] &=& \Sigma\\ \end{eqnarray*}\] where the variance-covariance matrix of \(Y\) is defined as \[\begin{eqnarray*} \Sigma &=& \begin{pmatrix} var[y_1] & cov[y_1, y_2] & \cdots & cov[y_1, y_n] \\ cov[y_1,y_2] & var[y_2] & \cdots & cov[y_2, y_n]\\ \vdots & \cdots & \cdots & \vdots\\ cov[y_1, y_n]& cov[y_2,y_n] & \cdots & var[y_n] \end{pmatrix}\\ &=& \begin{pmatrix} \sigma_1^2 & \sigma_{12} & \cdots & \sigma_{1n}\\ \sigma_{12} & \sigma_2^2 & \cdots & \sigma_{2n}\\ \vdots & \cdots & \cdots & \vdots\\ \sigma_{1n} & \sigma_{2n} & \cdots & \sigma_{n}^2 \end{pmatrix} \end{eqnarray*}\]
The mean assumption is equivalent to \[E[ {\epsilon}]={0}=(0,0,\cdots,0)^T\] The variance-covariance assumption is equivalent to \[cov(y_i, y_j)=cov(\epsilon_i, \epsilon_j)=\sigma_{ij} \]
In practice, for a response variable \(Y\) (or a transformed version), we first identify explanatory variables to be used in the linear model, and estimate parameters \({\beta}\), then try to make statistical inferences based upon the sampling distributions of the estimates.
2 LSE
2.1 The Least Squares Estimate (LSE) / OLSE
One approach to estimate \({\beta}\) is to minimize \[Q={\epsilon}^T{\epsilon}=\sum_{i=1}^n(y_i-E[y_i])^2 = (Y-X\beta)^T(Y-X\beta)\]
To obtain the least square estimates, we first expand it
\[\begin{eqnarray*} Q&=&({Y}^T -{\beta}^T {X}^T)(Y-{X}{\beta})\\ &=&{Y}^T {Y} - {\beta}^T {X}^T {Y} - Y^T X \beta + \beta^T X^T X \beta\\ &=&{Y}^T{Y} - 2{\beta}^TX^T{Y} + {\beta}^T {X}^T {X} {\beta} \end{eqnarray*}\] To minimize the sum of squares with respect to \(\beta\), we take partial derivatives \[\frac{\partial Q}{\partial {\beta}} = \begin{pmatrix} \frac{\partial Q}{\partial {\beta}_0}\\ \frac{\partial Q}{\partial {\beta}_1}\\ \vdots\\ \frac{\partial Q}{\partial {\beta}_{p-1}} \end{pmatrix} \]
Therefore, \[\begin{eqnarray*} \frac{\partial Q }{\partial \beta} &=& -2{X}^T{Y} + 2{X}^T{X}{\beta} \equiv 0\\ &\Rightarrow & {X}^T{X}\hat{{\beta}} = {X}^T {Y} \\ \end{eqnarray*}\] We say \(\hat{{\beta}}\) a least squares estimate of \({\beta}\) for \({Y}={X}{\beta}+{\epsilon}\) if \[{X}^T{X}\hat{{\beta}}={X}^T{Y}\]
Definition. The column space of a matrix \(A\) is the vector space generated by the columns of \(A\). If \(A=(a_{ij})_{n\times p}=(a_1,\cdots, a_p)\), then the column space of \(A\), denoted by \(C(A)\) is given by \[C(A)=\{y:y=\sum_{i=1}^pc_ia_i\}\] for scalars \(c_i,i=1,\cdots,p\). Alternatively, \(y\in C(A)\) iff \(\exists\) \(c\in R^p\) s.t. \(y=Ac\).
2.2 Existence of LSE
The normal equation \(X^TX\beta=X^TY\) always has a solution for \(\beta\).
Lemma. \(C(X^T)=C(X^TX)\). This can approved by the following two facts. - \(C(X^TX)\subset C(X^T)\) because \(X^TX=X^T(X)\). - \(C(X^T)\) and \(C(X^TX)\) have the same dimension because \(X\) and \(X^TX\) have the same rank.
Proof. It is obvious that \(X^TY\in C(X^T)=C(X^TX)\); therefore, \(\exists\) \(c\in R^p\) s.t. \(X^TY=X^TXc\). This proves the existence of LSE.
3 Compute LSE
3.1 When \(X\) is Full Rank
Proposition: When \(rank(X)=p\), i.e., all the columns are linearly independent, the normal equation has a unique solution and the ordinary least squares estimate is \[\hat{\beta}=(X^TX)^{-1}X^TY\]
The proof is trivial.
3.2 When \(X\) is Less Than Full Rank
When \(Rank(X)=r<p\), we can use the following methods to find an LSE
- Choose a linearly independent subset of \(r\) columns of \(X\). WLOG, assume \(X=(X_1, X_2)\), where \(Rank(X_1)=r\). A LSE of \(\beta\) is \[\hat{\beta}= \begin{pmatrix} (X_1^TX_1)^{-1}X_1^TY\\ 0 \end{pmatrix} \]
Proof: (assign it to homework) Hint: Show that there exists a \(r\times (p-r)\) matrix \(C\) such that \(X_2=X_1C\).
(This method is commonly used by regression packages, and the columns selected will be package-dependent).
- Consider an identifiability constraint \(H\beta=0\). We say \(H\beta=0\) an identifiability constraint IFF
\(C(H^T)\) and \(C(X^T)\) are disjoint. (i.e., \(H\) doesn’t contribute any information that \(X\) does.
Rank(W)=p
Define \[ W=\begin{pmatrix}X\\H\end{pmatrix}, Z=\begin{pmatrix}Y\\0\end{pmatrix} \] One LSE is \[\hat{\beta}=(W^TW)^{-1}W^TZ=(W^TW)^{-1}X^TY\] Note (a) and (b) \(\Rightarrow Rank(H)=p-r\).
The proof is not required in STAT200C.
Example. The one-way anova \(Y_{ij}=\mu+\alpha_i+\epsilon_{ij}\).
One way to get a unique solution to \(\beta\) is to require \(\sum \alpha_i=0\)
- Using any generalized inverse matrix \((X^TX)^-\), where a generalized inverse of \(B\) is any matrix \(B^-\) s.t. \(BB^-B=B\). A LSE is then \[\hat{\beta}=(X^TX)^-X^TY\]
The proof follows from the following Lemma.
3.2.1 Lemma
Lemma 1 of Generalized Inverse. \(X(X^TX)^-X^TX=X\)
Proof.
Let \((X^TX)^-\) be a generalized inverse of \(X^TX\). By the definition of generalized inverse we have \(X^TX(X^TX)^-X^TX=X^TX\). Take transpose on both sides we have \(X^TX[(X^TX)^-]^TX^TX=X^TX\), indicating that \([(X^TX)^-]^T\) is also a generalized inverse of \(X^TX\).
Let \(D=X(X^TX)^-X^TX-X\). Using the definition of generalized inverse and the above result, it is easy to verify that \(D^TD=0\), which implies that \(D=0\).
Similarly, one can show that \(X^TX(X^TX)^-X^T=X^T\). Based on this result, it is very easy to verify that \[\hat{\beta}=(X^TX)^-X^TY\] satisfies the normal equation. In other words, it is an LSE.
Note, \[\widetilde{\beta}=\hat{\beta} + (I -(X^TX)^-(X^TX))z\] where z is an arbitrary vector, is also a solution to the normal equation. (a homework problem).
We can conclude from the methods of computing LSE that:
An LSE always exists.
Least squares estimate is unique iff \(rank(X)=p\).
3.3 Example. One-Way ANOVA
The one-way ANOVA \[Y_{ij}=\mu+\alpha_{i}+\epsilon_{ij}\] How to find \(\beta=(\mu, \alpha_1, \alpha_2)^T\)? \[X=\begin{pmatrix} 1&1&0\\ 1&1&0\\ 1&0&1\\ 1&0&1 \end{pmatrix} \] Notice that column 1 of \(X\) is a linear combination of the other two columns. Thus the design matrix is less than full rank.
How to use the three methods to obtain \(X\hat{\beta}\).
Find a subset of the columns of \(X\). E.g., keep the first two columns we have \(\hat\beta=( \bar Y_{2.},\bar Y_{1.}- \bar Y_{2.}, 0)^T\). E.g., keep the last two columns we have \(\hat\beta=( 0, \bar Y_{1.}, \bar Y_{2.})^T\).
Add a constraint, i.e., add \((0,1,1)\) to the design matrix, one can show that \(\hat\beta=( \bar Y_{..},\bar Y_{1.}- \bar Y_{..}, \bar Y_{2.}- \bar Y_{..})^T\)
Calculate a g-inverse of \(X^TX\) using R. (library(MASS), ginv()). Using the g-inv from R, the LSE is \(\hat\beta=( \frac{2}{3}\bar Y_{..},\bar Y_{1.}- \frac{2}{3}\bar Y_{..}, \bar Y_{2.}-\frac{2}{3}\bar Y_{..})^T\).
4 Fitted Values and Residuals
4.1 Fitted values:
\(\hat{\theta}=X\hat{\beta}=\hat{Y}=(\hat{Y}_1,\cdots,\hat{Y}_n)^T\)
4.2 The uniqueness of the fitted values
When \(X\) is less than full rank, there are infinitely many estimates that satisfy the normal equation. Does this mean that we are going to have different fitted values? It has been shown (Urquhart 1969) that all the LSE of \(\beta\) to the normal equation can be written into to the following form \(\hat{\beta}=(X^TX)^-X^TY\).
4.2.1 Theorem. The fitted values, i.e., \(\hat Y=X\hat\beta\) are unique.
Proof. First, based on Urquhart (1969), all the LSEs of \(\beta\) to the normal equation can be written into \(\hat{\beta}=(X^TX)^-X^TY\). The proof follows immediately from the Lemma below.
4.2.2 Lemma
Invariant of \(P_X\).
\(P_X=X(X^TX)^-X^T\) is invariant to the choice of \((X^TX)^-\).
Proof. Let \(G_1\) and \(G_2\) be two generalized inverses of \(X^T X\). We have shown that:
First, \(G_1^T\) and \(G_2^T\) are also generalized inverses of \(X^T X\). This can be proved using the definition of generalized inverse of \(X^TX\).
Second, by Lemma 1 of Generalized Inverse, we have \[\begin{align*} X G_1 X^T X & =X G_1^T X^T X = X G_2 X^T X = X G_2^T X^T X=X,\\ X^TXG_1 X^T & = X^TXG_1^T X^T = X^TXG_2 X^T = X^TXG_2^T X^T = X^T \end{align*}\]
Third. Let \(P_1 = X G_1 X^T\) and \(P_2 = X G_2 X^T\). Denote their difference by \(D\). Then \[\begin{align*} D^TD &=\underline {X G_1^T X^TX} G_1 X^T - \underline{X G_1^T X^TX} G_2 X^T - \underline{X G_2^T X^TX} G_1 X^T + \underline{X G_2^T X^TX} G_2 X^T \\ &=X G_1^T X^T - X G_2 X^T - X G_1 X^T + X G_2 X^T\\ &=0 \end{align*}\]
The second to the last step is true because each of the underlined parts can be substituted by \(X\).
4.2.3 proposition
\(P_X=X(X^TX)^-X^T\) is symmetric and idempotent, i.e., \(P_X\) is a projection matrix.
Proof.
The idempotent part follows from Lemma 1 of Generalized Inverse.
The symmetry follows from the invariance property of \(P_X\).
Remark. Lemma 1 of Generalized Inverse also implies that \(P_X\).
4.3 Residuals
Definition. Residuals:
\[e=\begin{pmatrix} y_1 - \hat y_1\\ \cdots\\ y_n - \hat y_n \end{pmatrix} =Y-\hat{Y}=Y-X\hat{\beta}=Y-P_XY=(I-P_X)Y\]
Since \(P_X\) is a projection matrix, so is \(I-P_X\). Therefore, the residuals are the projection of \(Y\) onto the orthogonal complement of \(C(X)\).
4.3.1 RSS
Definition. Residual sum of squares (RSS): \[RSS=e^Te=\sum_{i=1}^n (y_i-\hat y_i)^2=(Y-\hat{Y})^T(Y-\hat{Y})\]
\(RSS\) can be written into the following form:
\[RSS=(Y-X\hat{\beta})^T(Y-X\hat{\beta})=Y^T(I-P_X)Y.\] The last step is true because \(I - P_X\) is a projection matrix.
5 Estimable Functions
5.1 Expectation and Covariance
Proposition: Let \(Z\) be a random matrix, A, B, and C are known matrices. \[E[AZB+C]=AE[Z]B+C\]
Proof. Let \(W=AZB+C\), then \[\begin{eqnarray*} w_{ij} &= \sum_{k}\sum_{l}a_{ik}z_{kl}b_{lj}+c_{ij}\\ &\Rightarrow E[w_{ij}]=\sum_{k}\sum_{l}a_{ik}E[z_{kl}]b_{lj}+c_{ij} \end{eqnarray*}\]
Proposition: For random vectors \(Y\) and \(Z\), known matrices \(A\) and \(B\), and constants \(c\) and \(d\)
- \(E[AY+BZ]=AE[Y]+BE[Z]\)
- \(Cov(AY+c, BZ+d) = ACov(Y,Z)B^T\)
Proof.
The first part: Use the cov definition for scalar.
The second part:
Let \(U=AY+c\), \(W=BZ+d\). Then \[\begin{eqnarray*} E[U] &=& AE[Y]+c\\ U-E[U] &=& A(Y-E[Y])\\ E[W] &=& BE[Z]+d\\ W-E[W] &=& B(Z-E[Z])\\ Cov(AY+c,BZ+d) &=& ACov(Y,Z)B \end{eqnarray*}\]
5.2 Estimable Functions
Suppose \(X\) is less than full rank, there there are infinitely many of LSEs of \(\beta\). Since \(\hat{\beta}\) (LSE) is not unique, \(\beta\) is not estimable. In addition, is any of the LSEs unbiased?
Suppose it is unbiased, i.e., \(E[\hat{\beta}]=\beta\) for all \(\beta\), i.e., \((X^TX)^{-}X^TX\beta\). Thus \((X^TX)^-X^TX=I_p\), which contradicts with the fact that \(X\) is less than full rank. This implies that no one of the infinitely many solutions is unbiased.
Definition. \(a^T\beta\) is said to be an estimable function if \(\exists\) \(c\) s.t. \(E[c^TY]=a^T\beta\) for any \(\beta\).
Note: \(X\beta\) is always estimable (why? \(E[Y]=X\beta\)). But \(\beta\) might not be estimable.
Theorem. \(a^T\beta\) is estimable IFF any one of the following conditions:
\(a\in C(X^T)\). [Note: Seber A.2.5 implies that \(C(X^T)=C(X^TX)\)]
\(a^T(X^TX)^-X^TX=a^T\).
\(E[a^T\hat{\beta}]=a^T\beta\) for all \(\beta\), where \(\hat{\beta}\) is an LSE.
Proof.
[Proof for (1)] \(a^T\beta\) is estimable iff \(\exists\) \(c\) s.t. \(E[c^TY]=a^T\beta\) for all \(\beta\) iff \(c^TX\beta=a^T\beta\) for all \(\beta\) iff \(a=X^Tc\).
On the other hand, \(a\in C(X^T)\Rightarrow a=X^Tc\Rightarrow a^T\beta=c^TX\beta\Rightarrow E[c^TY]=a^T\beta\). By the definition, \(a^T\beta\) is estimable.
[Proof of (2)] \(E[a^T(X^TX)^{-}X^TY]=a^T(X^TX)^-X^TX\beta=a^T\beta\Rightarrow a^T\beta\) is estimable.
If \(a^T\beta\) is estimable, by (1), \(a=X^TXd\Rightarrow X^TX(X^TX)^-X^TXd=X^TXd=a\).
[Proof of (3)] If estimable, by (2), \(E[a^T\hat\beta]=a^T(X^TX)^-X^TX\beta=a^T\beta\); if \(E[a^T\hat\beta]=a^T\beta\), then \(a^T(X^TX)^-X^TX\beta=a^T\beta\) for all \(\beta\), which implies (2).
A proof for \(C(X^T)=C(X^TX)\). Clearly \(C(X^TX)\subset C(X^T)\). Because \(X^TX\) and \(X\) have the same rank, their column spaces have the same dimension. This implies that the spaces must be equal. As a consequence, there exists some \(M\) so that \(X^T=X^TXM\).
Theorem .Let \(\hat{\beta}\) be the OLSE of \(\beta\) from regression model \[Y=X\beta+\epsilon\] with \(E[\epsilon]=0\). Consider an estimable function \(a^T\beta\), then
\(a^T\hat{\beta}\) is unique, i.e., \(a^T\hat{\beta}\) is invariant to the choice of \((X^TX)^-\).
\(E[a^T\hat{\beta}]=a^T\beta\). When \(X\) is full rank, \(\beta\) is estimable and \(E[\hat{\beta}]=\beta\).
\(Var[a^T\hat{\beta}]=\sigma^2a^T(X^TX)^-a\) if we furtherf assume that \(Var[\epsilon]=\sigma^2I_n\). When \(X\) is full rank, \(Var[\hat{\beta}]=\sigma^2(X^TX)^{-1}\).
Proof: Based on the assumptions, we have The OLSE of \(\beta\) is \(\hat{\beta}=(X^TX)^-X^TY\). Note, when \(X\) is full rank, \((X^TX)^-=(X^TX)^{-1}\). We have \[a^T\hat{\beta}=a^T(X^TX)^-X^TY\]
Uniqueness of the \(a^T\hat{\beta}\)}. Because \(a^T\beta\) is estimable, \(a\in C(X^T)=C(X^TX)\). Thus \(\exists c\) s.t. \(a=X^TXc\). Therefore, \[a\hat{\beta}=c^TX(X^TX)^-X^TY=c^TP_XY.\] It is invariant to the choice of the g-inverse because \(P_X\) is invariant to the choice of the g-inverse.
Mean: Because \(a^T\beta\) is estimable, \(a\in C(X^T)\), i.e., \(\exists\) \(c\) s.t. \(a=X^Tc\). By Lemma 1 of Generalized Inverse, \(X(X^TX)^-X^TX=X\). Thus \(E[a^T\hat{\beta}]=c^TX(X^TX)^-X^TX\beta = c^TX\beta=a^T\beta\).
Variance: Note that \(a^T\hat\beta=a^T(X^TX)^-X^TY=c^TX(X^TX)^-X^TY=c^TP_XY\). Thus, \[Var[a^T\hat{\beta}]c^TP_x\sigma^2P_xc=\sigma^2c^tX(X^TX)^-X^Tc=\sigma^2a^T(X^TX)^-a\]
Note: the LSE of estimable functions are unbiased. But this is not true for functions that are not estimable, as we showed at the beginning of the section.
Summary about estimable function
The expected value of any observation is estimable.
Any linear combination of estimable functions is estimable.
The function \(a^T\beta\) is estimable iff
- \(a\in C(X^T)\)
- \(a^T=a^T(X^TX)^-X^TX\)
- \(a^T\hat{\beta}\) is unbiased for \(a^T\beta\).
\(X\beta\) is estimable, and any linear function of \(X\beta\) is estimable.
\(X^TX\beta\) is estimable, and any linear function of \(X^TX\beta\) is estimable.
There are exactly \(r=rank(X)\) linearly independent estimable functions of \(\beta\).
5.2.1 Example: One-way ANOVA
\[Y_{ij}=\mu+\alpha_{i}+\epsilon_{ij}\] Use matrix/vector notation, \[Y=(Y_{11},Y_{12},Y_{21}, Y_{22})^T,\] \[X=\begin{pmatrix} 1&1&0\\ 1&1&0\\ 1&0&1\\ 1&0&1 \end{pmatrix}, \] \[\beta=(\mu, \alpha_1, \alpha_2)^T\] Notice that column 1 of \(X\) is a linear combination of the other two columns. Thus the design matrix is less than full rank, and the inverse of \(X^TX\) doesn’t exist. \[X^TX = \begin{pmatrix} 4&2&2\\ 2&2&0\\ 2&0&2 \end{pmatrix} \] Consider two generalized inverses of \(X^TX\): \[(X^TX)_1^- = \begin{pmatrix} 0&0&0\\ 0&1/2&0\\ 0&0&1/2 \end{pmatrix}\]
\[(X^TX)_2^- = \begin{pmatrix} 1/2&-1/2&0\\ -1/2&1&0\\ 0&0&0 \end{pmatrix}\]
Note that \[X^TY=(y_{++}, y_{1+}, y_{2+})^T,\] we have \[\hat{\beta}_1=(X^TX)_1^-X^TY=(0, \bar{y}_{1+}, \bar{y}_{2+})^T,\] and \[\hat{\beta}_2=(X^TX)_2^-X^TY=(\bar{y}_{2+}, \bar{y}_{1+}-\bar{y}_{2+}, 0)^T\]
Claim: \(\mu, \alpha_1, \alpha_2, \alpha_1+\alpha_2\) are not estimable. \(\mu+\alpha_1\), \(\alpha_1-\alpha_2\) are estimable. (This can be proved by checking whether they belong to \(C(X^T)\).
5.2.2 Example
Consider the model (additive two-way model) \[Y_{ij}=\alpha_i+\beta_j+\epsilon_{ij}\] where \(i=1,\cdots,a \mbox{ ; } j=1,\cdots,b\) and \(\epsilon_{ij}\overset{iid}\sim N(0,\sigma^2)\). Derive the necessary and sufficient condition for \(\sum c_i \alpha_i + \sum d_j \beta_j\) to be estimable.
6 Optimality of LSE
6.1 Gauss-Markov Theorem
Gauss-Markov Theorem: Let \(\hat{\beta}\) be an OLSE from linear regression model \(Y=X\beta+\epsilon\) with \(E[\epsilon]=0\) and \(Var[\epsilon]=\sigma^2I_n\). Consider an estimable function \(a^T\beta\). Then \(a^T\hat{\beta}\) is the unique estimate with minimum variance among the class of unbiased linear estimators of \(a^T\beta\), i.e., the LSE of \(a^T\beta\) is the best linear unbiased estimator (BLUE).
Proof.
Method 1.
Because \(a^T\beta\) is estimable, \(a^t\hat{\beta}\) is unique and unbiased, i.e. \(E[a^T\hat{\beta}]=a^T\beta\). \[ Var[a^T\hat{\beta}] =\sigma^2 a^T(X^TX)^- a \]
Let \(b^TY\) be any unbiased estimate of \(a^T\beta\), we have \[E(b^TY)=b^TX\beta=a^T\beta,\] which implies that \(b^TX=a^T\).
\[\begin{eqnarray*} Var[b^TY] - Var[a^T\hat{\beta}] &=& \sigma^2[b^Tb - a^T(X^TX)^-a]\\ &=& \sigma^2 [b^Tb-b^TX(X^TXb)^-X^T]\\ &=& \sigma^2 b^T(I-P_X)b \\ &=& \sigma^2 b^T(I-P_X)^2b\ge 0 \end{eqnarray*}\] The equation is true iff \(b=P_xb\), which implies that \(b^TY=a^T\hat{\beta}\).
Method 2
\[\begin{eqnarray*} Var[b^TY] &=& Var[ b^TY-a^T\hat{\beta} + a^T\hat{\beta}]\\ &=& Var[b^TY-b^TX\hat\beta + b^TX\hat\beta]\\ &=& Var[b^T(I-P_x)Y+b^TP_xY]\\ &=& Var[b^T(I-P_x)Y]+Var[b^TP_xY]+0\\ &\ge&Var[a^T\hat\beta] \end{eqnarray*}\] The equation holds iff \(b^TY=a^T\hat{\beta}\).
Note: the proof doesn’t assume a specific distribution on the error part.
Proposition: Let \(\theta=X\beta\), then \(\theta\) is estimable, and \(\hat{\theta}=X\hat{\beta}\) is the BLUE of \(\theta\). For any matrix \(A_{p\times n}\), \(A\theta\) is estimable and \(A\hat{\theta}\) is the BLUE of \(A\theta\).
The OLSE \(a^T\hat{\beta}\) is unbiased for an estimable function \(a^T\beta\), no matter whether the errors are independent or not. QUESTION: is the OLSE optimal when \(Var[\epsilon]=\Sigma\) for any positive definite covariance matrix? No.
6.2 Generalized LSE (GLSE)
Proposition. Let \(\hat{\beta}_G\) be the GLSE from linear regression model \(Y=X\beta+\epsilon\) with \(E[\epsilon]=0\) and \(Var[\epsilon]=\Sigma\) with \(rank(\Sigma)=n\). Then \(a^T\hat{\beta}_G\) is the unique estimate with minimum variance among all linear unbiased estimators of estimable \(a^T\beta\). Hence the GLSE is the BLUE for this problem.
Proof.
(Method 1: show that that \(a\in C(W^T)\), which implies that the GLSE is an estimable function under the old model is also an estimable function under the new model; secnd show that \(E[a^T\hat\beta_G]=a^T\beta\) and \(a^T\hat\beta_G\) is a linear function of \(Z\).)
First notice that \(a^T\hat{\beta}_G\) is a linear function of \(Y\).
Rewrite the model} to \(Z=W\beta+\epsilon^*\), where \(Z=\Sigma^{-1/2}Y\), \(W=\Sigma^{-1/2}X\), and \(E[\epsilon^*]=0\), and \(Var[\epsilon^*]=I_n\).
It can be shown that \(a^T\beta\) is also estimable under the new model. If this is true, \(a^T\hat{\beta}_G\) is the BLUE for \(a^T\beta\).
\(a^T\beta\) is estimable \(\Leftrightarrow\) \(a\in C(X^T)\).
Claim: \(C(X^T)=C(X^T\Sigma^{-1/2})=C(W^T)\).
Based on the claim, \(a\in C(W)\), which implies that \(a^T\beta\) is estimable for \(\beta\) under \(Z=W\beta+\epsilon^*\). Because \(\hat{\beta}_G\) is the OLSE of \(\beta\) under \(Z=W\beta+\epsilon^*\), \(a^T\hat{\beta}_G\) is the BLUE of \(a^T\beta\) under \(Z=W\beta+\epsilon^*\), which is equivalent to the original model.
Proof of the claim.
For any \(y\in C(X^T\Sigma^{-1/2})\), \(\exists c\in R^n\) s.t. \(y=X^T\Sigma^{-1/2}=X^T(\Sigma^{-1/2})\), which implies that \(y\in C(X^T)\). This proves that \(C(X^T\Sigma^{-1/2}) \subseteq C(X^T)\).
For any \(y \in C(X^T)\), \(\exists c\in R^n\) s.t. \(y=X^Tc=X^T\Sigma^{-1/2}\Sigma^{1/2}c=(X^T\Sigma^{-1/2})(\Sigma^{1/2}c)\), which implies that \(C(X^T) \subseteq C(X^T\Sigma^{1/2})\).
Therefore, \(C(X^T\Sigma^{-1/2}) = C(X^T)\).
Since \(\Sigma\) is positive, its square root exists and is nonsingular. Let \(Z=\Sigma ^{-1/2}Y\) and \(W=\Sigma^{-1/2}X\). Then \(Z=\Sigma^{-1/2}Y=\Sigma^{-1/2}X\beta+\Sigma^{-1/2}\epsilon\) and in class we showed that \(\hat{\beta}_{G}=(X^{T}\Sigma^{-1}X)^{-}X^{T}\Sigma^{-1}Y=(W^{T}W)^{-}W^{T}Z\). Since \(a^{T}\beta\) is estimable, there exists a \(c\) such that \(a=X^Tc=X^T\Sigma^{-1/2}\Sigma^{1/2}c=W^T\Sigma^{1/2}c\in \mathcal(W^T)\), which implies that \(a^T\beta\) is also estimable under the new model.
\[\begin{eqnarray*} a^{T}\hat{\beta}_{G} & = & a^{T}(W^{T}W)^{-}W^{T}Z\\ & = & c^{T}W(W^{T}W)^{-}W^{T}Z\\ & = & c^{T}P_{W}W\beta\end{eqnarray*}\] Since \(P_{W}\) is invariant to \((W^{T}W)^{-}\), \(a^{T}\hat{\beta}_{G}\) is also invariant to choice of generalized inverse.
\[\begin{eqnarray*} E[a^{T}\hat{\beta}_{G}] & = & E[a^{T}(W^{T}W)^{-}W^{T}Z]=a^{T}(W^{T}W)^{-}W^{T}W\beta=a^{T}\beta\end{eqnarray*}\] since if \(a^{T}\beta\) is estimable \(a^{T}(W^{T}W)^{-}W^{T}W=a^{T}\) (Thm 4.9 (2)).
It is clear that any linear estimator can be written as \(d^T\), because we can always write it to \((d^T\Sigma) Y\). Clearly \(\hat\beta_G\) is the OLSE for \(Z=W\beta+\epsilon*\) with \(\epsilon*\sim (0,I)\). Thus \(a^T\hat\beta_G\) is the BLUE.
Summary. Consider the linear model \(Y=X\beta+\epsilon\) with \(E[\epsilon]=0\) and \(Var[\epsilon]=\Sigma\), where \(\Sigma\) is a positive definite symmetric matrix. Since \(\Sigma\) is positive definite and symmetric, we can find a non-singular symmetric matrix \(\Sigma^{1/2}\) s.t. \(\Sigma=\Sigma^{1/2}\Sigma^{1/2}\). Consider \(Z=\Sigma^{1/2}Y\), we have \[\begin{eqnarray*} E[Z] &=& \Sigma^{-1/2}E[Y]=\Sigma^{-1/2}X\beta\\ Var[Z] &=& \Sigma^{-1/2}\Sigma \Sigma^{-1/2} = I_n \end{eqnarray*}\] Consider the design matrix \(W=\Sigma^{-1/2}X\) and the response vector \(Z=\Sigma^{-1/2}Y\). The OLSE for \(\beta\) would be \[\hat{\beta}_G=(W^TW)^-W^TZ\] Use the original notation, we have the generalized least squares estimate: \[\hat{\beta}_G=(X^T\Sigma^{-1}X)^-X^T\Sigma^{-1}Y\]
6.3 Examples
Example 1. Consider \(K\) independent random samples with sizes \(n_1, \dots, n_K\). Only the sample means \(\bar{y}_1, \dots, \bar{y}_K\) are available. We estimate the population mean \(\mu\) efficiently.
Solution:
Linear Model: \[ \mathbf{y} = \mathbf{1}_K \mu + \boldsymbol{\epsilon}, \quad \mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}, \quad \text{Var}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{V}, \] where \(\mathbf{V} = \text{diag}(1/n_1, \dots, 1/n_K)\).
Although \(\mathbf {\bar y}\) is unbiased for \(\mu\), it does not achieve minimal variance.
The GLSE solution: \[\hat{\mu}_{\text{GLS}} =(\mathbf{1}_K^T \mathbf V^{-1}\mathbf{1}_K)^{-1}\mathbf{1}_K^T \mathbf V^{-1} \mathbf y= \frac{\sum_{k=1}^K n_k \bar{y}_k}{\sum_{k=1}^K n_k}\]
Note that \(\mathbf V^{-1/2} \mathbf y \sim (\mathbf 1_K u, \sigma^2 \mathbf I_K)\)
Example 2. Assume \(E(Y_i)=x_i^T \beta,\quad Var(Y_i)=(x_i^T \beta)^2, cov(Y_i, Y_j)=0 \text{ if } i\not=j.\)
Note that \[Var(Y)= \mathbf{V} = \begin{pmatrix} (\mathbf{x}_1^T \boldsymbol{\beta})^2 & 0 & \cdots & 0 \\ 0 & (\mathbf{x}_2^T \boldsymbol{\beta})^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & (\mathbf{x}_n^T \boldsymbol{\beta})^2 \end{pmatrix}.\]
Different from Example 1, the \(V\) matrix here depends on \(\beta\). As a result, we don’t have a closed-form GLSE in this problem. But we can use an Iteratively Reweighted Least Squares (IRLS) algorithm:
Algorithm: IRLS for Variance Structure \(Var(Y_i) = (x_i^T \beta)^2\).
\(\mathbf 0\): Initialization: \(\beta^{(0)}\)
\(\mathbf 1\): \(\Sigma(\hat\beta^{(t)} ) = Diag((x_1^T \beta)^2, \cdots, (x_n^T \beta)^2)\)
\(\mathbf 2\): \(\hat\beta^{(t+1)} = (X^T \Sigma^{-1}(\hat\beta^{(t)}) X)^{-1}X^T \Sigma^{-1}(\hat\beta^{(t)})Y\).
Repeat 1 and 2 until convergence.
7 Estimation with linear Restriction
7.1 Design Matrix of Full rank
Let \(Y=X\beta+\epsilon\), where \(X\) is full rank. Suppose we want minimize \[(Y-X\beta)^T(Y-X\beta)\] subject to the linear restriction \(A\beta=c\), where \(A\) is a known \(q\times p\) matrix of rank \(q\) and \(c\) is a known \(q\times 1\) vector.
We will use the Lagrange multipliers to find the least square estimate subject to the linear restriction. \[\begin{eqnarray*} f(\beta,\lambda) &=& (Y-X\beta)^T(Y-X\beta) + \lambda^T(A\beta-c) \end{eqnarray*}\] Take derivative with respect to \(\beta\), we have \[-2X^TY+2X^TX\beta+A^T\lambda=0\] Let \(\hat{\beta}_H\) denote the estimate subject to the linear restriction, \[\hat{\beta}_H=(X^TX)^{-1}X^TY-\frac{1}{2}(X^TX)^{-1}A^T\hat{\lambda}_H =\hat{\beta}-\frac{1}{2}(X^TX)^{-1}A^T\hat{\lambda}_H\] The linear restriction \[c=A\hat{\beta}_H=A\hat{\beta}-\frac{1}{2}A(X^TX)^{-1}A^T\hat{\lambda}_H\] which gives \[-\frac{1}{2}\hat{\lambda}_H=[A(X^T)^{-1}A^T]^{-1}(c-A\hat{\beta})\] Substitute in the formula for \(\hat{\beta}_H\), we have \[\hat{\beta}_H=\hat{\beta}+(X^TX)^{-1}A^T[A(X^TX)^{-1}A^T]^{-1}(c-A\hat{\beta})\]
Note, when \(X\) does not have full rank, not all linear restrictions have a solution.
Proposition
The estimate given above does minimize \(\epsilon^T\epsilon\) subject to \(A\beta=c\).
Proof.
First, \[\begin{eqnarray*} \epsilon^T\epsilon &=& \|Y-X\beta \|^2\\ &=& \|Y-\hat{Y}+\hat{Y}-X\beta \|^2\\ &=& \|Y-\hat{Y} \|^2 + \|\hat{Y}-X\beta \|^2 + 2(Y-\hat{Y})^T(\hat{Y}-X\beta)\\ &=& \|Y-\hat{Y} \|^2 + \|\hat{Y}-X\beta \|^2 + 2Y^T(I-P_X)^TX(\hat{\beta}-\beta)\\ &=& \|Y-\hat{Y} \|^2 + \|X(\hat{\beta}-\beta) \|^2 \end{eqnarray*}\]
Second \[\begin{eqnarray*} \|X(\hat{\beta}-\beta) \|^2 &=& (\hat{\beta}-\beta)^TX^TX(\hat{\beta}-\beta)\\ &=& (\hat{\beta}-\hat{\beta}_H + \hat{\beta}_H-\beta)^TX^TX(\hat{\beta}-\hat{\beta}_H + \hat{\beta}_H - \beta)\\ &=& \|X(\hat{\beta}-\hat{\beta}_H)\|^2 + \|X(\hat{\beta}_H-\beta)\|^2 \end{eqnarray*}\] The last step is true because \[\begin{eqnarray*} 2(\hat{\beta}-\hat{\beta}_H)^TX^TX(\hat{\beta}_H-\beta) &=& \hat\lambda_HA(X^TX)^{-1}X^TX(\hat\beta_H-\beta)\\ &=& \hat{\lambda}_H^TA(\hat{\beta}_H-\beta)\\ &=& \hat{\lambda}_H^T(c-c)=0 \end{eqnarray*}\]
Thus, \[\begin{eqnarray*} e^T e &=& \|Y-\hat{Y} \|^2 + \|X(\hat{\beta}-\hat{\beta}_H)\|^2 + \|X(\hat{\beta}_H-\beta)\|^2 \end{eqnarray*}\]
It is minimized when \(\beta=\hat{\beta}_H\).
Here is another proof
Proof.
\[\begin{eqnarray*} ||Y-X\beta||^2&=& ||Y-X\hat\beta+X\hat\beta-X\hat\beta_H+X\hat\beta_H-X\beta||^2\\ &=& ||Y-X\hat\beta||^2+||X\hat\beta-X\hat\beta_H||^2+||X\hat\beta_H-X\beta||^2+2[\cdots]\\ &=& ||Y-X\hat\beta||^2+||X\hat\beta-X\hat\beta_H||^2+||X\hat\beta_H-X\beta||^2 \end{eqnarray*}\] The last step is true because \[\begin{eqnarray*} Y-X\hat\beta &=& (I-P_X)Y\\ X\hat\beta-X\hat\beta_H &=& X(\hat\beta-\hat\beta_H)=\frac{1}{2}X(X^TX)^{-1}A^T\hat\lambda_H\\ X\hat\beta_H-X\beta&=&X(\hat\beta_H-\beta) \end{eqnarray*}\] Thus, \[\begin{eqnarray*} (Y-X\hat\beta)^T(X\hat\beta-X\hat\beta_H) &=& Y^T(I-P_X)X(\hat\beta-\hat\beta_H)=0\\ (Y-X\hat\beta)^T(X\hat\beta_H-X\beta)&=& Y^T(I-P_X)X(\hat\beta_H-\beta)=0\\ (X\hat\beta-X\hat\beta_H)^T(X\hat\beta_H-X\beta) &=&(\hat\beta-\hat\beta_H)^TX^TX(\hat\beta_H-\beta)\\ &=&\frac{1}{2}\hat\lambda_H^TA(X^TX)^{-1}X^TX(\hat\beta_H-\beta)\\ &=&\frac{1}{2}\hat\lambda_H^T(A\hat\beta_H-A\beta)=\frac{1}{c}\hat\lambda_H^T(c-c)=0 \end{eqnarray*}\]
An interesting observation (Exercise 3.g.4 on page 62): \[||Y-\hat Y_H||^2-||Y-\hat Y||^2=\sigma^2 \hat\gamma_H^T (Var[\hat\gamma_H])^{-1}\hat\gamma_H\]
Example.
Consider the linear model \[Y_{ij}=\mu_i+\epsilon_{ij}; i=1,2, j=1,2\] where \(\epsilon \sim (0, I\sigma^2)\). Consider the linear restriction \(\mu_1=\mu_2\), or \(\mu_1-\mu_2=0\). We can rewrite the linear restriction to \(A\beta=(1 \mbox{ } -1)\beta=0\). \[\begin{eqnarray*} A&=& (1 \mbox{ } -1)\\ \hat{\beta} &=& (\bar Y_{1.}, \bar Y_{2.})^T\\ \hat{\beta}_H &=& \hat\beta+ \begin{pmatrix} 1/2 & 0\\ 0 & 1/2\end{pmatrix} \begin{pmatrix}1 \\ -1\end{pmatrix} \left[(1, -1)\begin{pmatrix} 1/2 & 0\\ 0 & 1/2\end{pmatrix}\begin{pmatrix}1 \\ -1\end{pmatrix} \right]^{-1} (0-(1,-1)\begin{pmatrix}\hat\beta_1 \\ \hat\beta_2\end{pmatrix})\\ &=& \hat\beta - \begin{pmatrix}1/2 \\ -1/2\end{pmatrix}(\hat\beta_1-\hat\beta_2)= \begin{pmatrix} \frac{\bar Y_{1.}-\bar Y_{2.}}{2} \\ \frac{\bar Y_{1.}-\bar Y_{2.}}{2} \end{pmatrix} \end{eqnarray*}\]
Example. Consider the model \[Y_i=\mu_i+\epsilon_i \mbox{, i=1,2,3,4}\] where \(\epsilon\sim(0,I_4\sigma^2)\). We are interested the the restriction \(\mu_1+\mu_2+\mu_3+\mu_4=0\).
Clearly \(X=I_4\), \(\hat\theta=(y_1,y_2,y_3,y_4)^T\), \(A=(1,1,1,1)\), and \(c=0\). Thus the restrcited estimate is \[\hat\beta_H=\hat\beta+(I_4)^{-1}A^T(4)^{-1}(0-(y_1,y_2,y_3,y_4)^T)= \begin{pmatrix}y_1-\bar y\\ y_2-\bar y\\ y_3-\bar y\\ y_4-\bar y\end{pmatrix}\]
7.2 Design Matrix of Less Than Full Rank
Take a subset of \(X\), \(X_1\) and a subset of \(Z\), \(Z_1\).
8 Adding/Deleting Covariates/Cases
The technique introduced here is closely related to the sweep algorithm for fitting the linear model.
8.1 Adding Covariates
Suppose we have fit the model \(Y=X\beta+\epsilon\) and obtained the OLSE \(\hat\beta\). Now we have a new set of variables \(Z\) and we want to add it to the model. In other words, we want to obtain the OLSE of \(\beta\) and \(\gamma\) under the new model: \[Y=X\beta+Z\gamma+\epsilon= (X | Z) \begin{pmatrix} \beta\\ \gamma \end{pmatrix} + \epsilon=W\delta+\epsilon\] Here we assume that (1) the columns of \(X\) and the columns of \(Z\) are linearly independent; (2) both \(X_{n\times p}\) and \(Z_{n\times t}\) are full rank (i.e., $rank([X Z]=p+t $); (3) \(\epsilon\sim (0,\sigma^2I)\). One can refit the model and estimate \(\beta\) and \(\gamma\) using the new model. However, this is not computationally efficient. A more efficient way is to use what we have learned from the old model to obtain OLSE for the new model.
Lemma Let \(R=I-P_X\). Suppose that the columns of \(X\) and the columns of \(Z\) are linearly independent, both \(X\) and \(Z\) are full rank (\(rank(X)=p\), \(rank(Z)=t\)), and their columns are linearly independent, then (1) \(Z^TRZ\) is p.d.; (2) \(rank(Z^TRZ)=rank(Z)=t\).
Another way to state the results. Let \(X\) be an \(n\times p\) matrix and \(Z\) be an \(n\times t\) matrix. Let \(R=I-P_X\). We have the following result \[rank([X\vdots Z]=p+t \Rightarrow rank(Z^T R Z)=t.\]
Before proving the lemma, let’s review a property regarding \(p.d.\) matrices. We learned that
If \(A_{n\times p}\) has rank \(p\), then \(A^TA>0\). Proof: \(x^TA^TAx=y^Ty\ge0\) with ``\(=\)’’ iff \(Ax=0\) iff \(x=0\). The last iff is because \(rank(A)=p\) thus the columns of \(A\) are linearly independent. Here is the proof of the lemma:
Proof.
Suppose the columns of \(RZ\) are not linearly independent, then \(\exists a\not=0\) s.t. \(RZa=0\), i.e., \(Za=P_xa=Xb\). Because we the columns of \(X\) and \(Z\), we have \(a=b=0\). The contradiction implies the columns of \(RZ\) are linearly independent and \(rank(RZ)=t\). In addition, \(Z^TRZ=(RZ)^TRZ\). Therefore, \((RZ)^TRZ=Z^TRZ\) is p.d. and \(rank(Z^TRZ)=rank(RZ)=t\).
An algorithm to update from \(Y~X\) to \(Y~X+Z\):
Regress \(Y\) on \(X\) to obtain \(\hat\beta\). During this step, we can compute \(R=I-P_X\)
Compute \(RZ\). Note that \(RZ\) can also be obtained by regressing each column of \(Z\) of \(X\), although it is more efficient to multiply \(R\) to \(Z\).
Regress \(Y\) on \(RZ\) to obtain \(\hat\gamma_G\).
Regress \(Y-Z\hat\gamma_G\) on \(X\) to obtain \(\hat\beta_G\).
Why does the algorithm work? See the following theorem.
Theorem.
Let \(R=I-P_X\), \(L=(X^TX)^{-1}X^TZ\), \(M=(Z^TRZ)^{-1}\), and \(\hat \delta_G\) is the OLSE of \(\delta\). Then
\(\hat\gamma_G=(Z^TRZ)^{-1}Z^TRY=[(RZ)^T(RZ)]^{-1}(RZ)^T RY=MZ^TRY\)
\(\hat\beta_G=(X^TX)^{-1}X^T(Y-Z\hat\gamma_G)=\hat\beta - L\hat\gamma_G\)
\(RSS_{new}=(Y-\hat Y_G)^T(Y-\hat Y_G)=(Y-Z\hat\gamma_G)^T R (Y-Z\hat\gamma_G)=Y^TRY-\hat\gamma_G^TZ^TRY\)
\(var(\hat\delta_G)=\sigma^2 \begin{pmatrix}(X^TX)^{-1}+LML^T & -LM\\ -ML^T & M \end{pmatrix}= \sigma^2 \tilde X(\tilde X^T\tilde X)^{-1} \tilde X^T\) where \(\tilde X = [X\vdots Z]\).
Proof.
First orthogonalize \(Z\) by writing it to \(Z=P_XZ+R_Z\). Thus the new model is \[\begin{eqnarray*} Y &=& X\beta+P_XZ+RZ\gamma+\epsilon\\ &=& X\alpha + RZ\gamma + \epsilon\\ &=& (X|RZ)\begin{pmatrix}\alpha & \gamma \end{pmatrix} + \epsilon \\ &=& V \lambda + \epsilon \end{eqnarray*}\] where \(\alpha=\beta + (X^TX)^{-1}X^TZ\gamma=\beta+L\gamma\). Note that \(C(X)\perp C(RZ)\) (this is true because for \(\vee a\) and \(\vee b\) we have \(a^TX^TRZb=a^T(X^T-X^T)Zb=0\)), thus the columns of \(X\) and \(RZ\) are linearly independent and \(V\) is full rank. The OLSE of \(\lambda\) is \[\begin{eqnarray*} \hat\lambda &=& (V^TV)^{-1}V^TY =\begin{pmatrix}X^TX & X^TRZ\\ Z^TRX & Z^TRZ\end{pmatrix}^{-1} \begin{pmatrix}X^T \\ Z^TR\end{pmatrix}Y\\ &=& \begin{pmatrix}X^TX & 0 \\ 0 & Z^TRZ\end{pmatrix}^{-1}\begin{pmatrix}X^T \\ Z^TR\end{pmatrix}Y\\ &=& \begin{pmatrix} (X^TX)^{-1}X^TY\\ (Z^TRZ)^{-1}Z^TRY\end{pmatrix}=\begin{pmatrix}\hat \alpha_G \\ \hat \gamma_G\end{pmatrix} \end{eqnarray*}\] Thus \(\hat\gamma_G=(Z^TRZ)^{-1}Z^TRY\). Because \(R\) is a projection matrix, we can rewrite \(\hat\gamma_G\) to \[\hat\gamma_G=(Z^TRZ)^{-1}Z^TRY=[(RZ)^T(RZ)]^{-1}(RZ)^T RY\] Note that if \(Z\) is the only set of covariates in the model, we would solve \(Z^TZ\gamma=Z^TY\). With \(X\) already in the model, we replace \(Z\) with \(RZ\) and replace \(Y\) with \(RY\).
Since \(\hat \alpha_G=\hat\beta_G+L\hat\gamma_G\) we have \[\hat\beta_G=(X^TX)^{-1}X^TY-L\hat\gamma_G\] This correlates the OLSE under the old model to that under the new model.
Here we want to show the relationship between the RSS of the old model to that of the new model. Note that \[\begin{eqnarray*} Y-\hat Y_{new} &=& Y-\hat Y_{new}=Y-X\hat\beta + XL\hat\gamma_G - Z\hat\gamma_G\\ &=& (I-P_X)Y+(X(X^TX)^{-1}X^T-I)Z\hat\gamma_G\\ &=& RY-RZ\hat\gamma_G \end{eqnarray*}\] Thus \[RSS_{new} = (RY-RZ\hat\gamma_G)^T(RY-RZ\hat\gamma_G)= (Y-Z\hat\gamma_G)^TR(Y-Z\hat\gamma_G)\] Also note that \(RZ\hat\gamma_G=RZ(Z^TRZ)^{-1}Z^TRY=P_{RZ}Y\). Hence, \[\begin{eqnarray*} RSS_{new}&=&Y^TRY-2Y^T(RZ\gamma_G)+(\gamma_G)^TZ^TRZ\gamma_G\\ &=& Y^TRY-2Y^T(RZ\gamma_G)+(RZ\gamma_G)^T(RZ\gamma_G)=Y^TRY-2Y^TP_{RZ}Y+Y^TP_{RZ}Y\\ &=& RSS_{old} - Y^TP_{RZ}Y \end{eqnarray*}\]
This result indicates that adding new variables will decrease \(RSS\).
Here is a more straightforward way to prove: Note that \[\begin{eqnarray*} P_V&=&V(V^TV)^{-1}V^T= [X\vdots Z]\begin{pmatrix} (X^TX)^{-1} & 0\\ 0 & (Z^TRZ)^{-1} \end{pmatrix} \begin{pmatrix} X^T\\ Z^T R \end{pmatrix}\\ &=& P_X + P_{RZ} \end{eqnarray*}\] Therefore, \[RSS_{new}=Y^T(I-P_V)Y = Y^TY - Y^T P_X Y - Y^T P_{RZ}Y=RSS - Y^T P_{RZ}Y \le RSS.\]
- Note that \(\hat\gamma_G=(Z^TRZ)^{-1}Z^TRY=MZ^TRY\). Thus \[\begin{eqnarray*} Var(\hat\gamma_G) &=& Var(MZ^TRY)=MZ^TRZM\sigma^2=\sigma^2M\\ Cov(\hat\beta, \hat\gamma_G) &=& cov((X^TX)^{-1}X^TY, MZ^TRY)\\ &=&(X^TX)^{-1}X^TRZM^{-1}L\sigma^2\\ &=& 0 (X^TR=0)\\ Cov(\hat\beta_G,\hat\gamma_G)&=& Cov(\hat\beta-L\hat\gamma_G,\hat\gamma_G)\\ &=& cov(\hat\beta,\hat\gamma_G)-L Var(\hat\gamma_G)\\ &=& -\sigma^2LM\\ Var(\hat\beta_G) &=& Var(\hat\beta)-2cov[\hat\beta,L\hat\gamma_G]+Var(L\hat\gamma_G)\\ &=& (X^TX)^{-1}\sigma^2 - 0+LVar(\hat\gamma_G)L^T\\ &=& \sigma^2[(X^TX)^{-1}+LML^T] \end{eqnarray*}\]
8.1.1 Add one covariate at a time
When \(Z\) is \(n\times 1\), \(L\) (denoted by \(l\) now) and \(M\) (denoted by \(m\) now) are scalars, \(Z^T R Z\) and \(\hat\gamma_G\) are also scalars. We have \[\hat\beta_{new}= \begin{pmatrix}\hat\beta_{old}-lz^TRYm\\z^TRYm \end{pmatrix}\] where \(l=(X^TX)^{-1}X^Tz, m=(z^TRz)^{-1}\).
Example 3.7.2 of Seber (page 57). Let the columns of \(X\) be denoted by \(x_{(j)},j=0,1,\cdots,p-1\), so that \[E[Y]=x_{(0)}\beta_0+x_{(1)}\beta_1+x_{(2)}\beta_2+\cdots+x_{(p-1)}\beta_{p-1}\] Suppose now that we want to introduce the explanatory variable \(x_{(p)}\) into the model. Thus \(Z^TRZ=x_{(p)}^TRx_{(p)}\) is a scalar. Hence, \[\begin{eqnarray*} \hat\beta_{p,G} &=& \hat\gamma_G=(Z^TRZ)^{-1}Z^TRY=\frac{x_{(p)}^TRY}{x_{(p)}^TRx_{(p)}}\\ (\hat\beta_{0,G},\cdots,\hat\beta_{p-1,G})^T &=& \hat\beta-(X^TX)^{-1}X^Tx_{(p)}\hat\beta_{p,G}\\ Y^TR_GY &=& Y^TRY-\hat\beta_{p,G}x_{(p)}^TRY \end{eqnarray*}\] \end{example}
This is a homework assignment
We can also use \(A.9.1\) to prove the results. \(A.9.1\) states that if all inverses exist, then \[\begin{pmatrix}A_{11} & A_{12} \\ A_{21} & A_{22}\end{pmatrix}^{-1} =\begin{pmatrix}A^{11} & A^{12} \\ A^{21} & A^{22} \end{pmatrix} =\begin{pmatrix}A_{11}^{-1} + B_{12}B_{22}^{-1}B_{21} & -B_{12}B_{22}^{-1}\\ -B_{22}^{-1}B_{21} & B_{22}^{-1} \end{pmatrix}\] where \(B_{22}=A_{22}-A_{21}A_{11}^{-1}A_{12}, B_{12}=A_{11}^{-1}A_{12}, B_{21}=A_{21}A_{11}^{-1}\). Here \(B_{22}\) is known as the Schur complement of \(A_{11}\). We can the result as the inverse of a partitioned matrix.
Now let \(A_{11}=(X^TX), A_{12}=X^TZ, A_{21}=Z^TX, A_{22}=Z^TZ\) and \(L=(X^TX)^{-1}X^TZ, M=(Z^TRZ)^{-1}\), we can show that \[A^{11}=(X^TX)^{-1}+LML^T, A^{12}=-LM, A^{21}=-ML^T, A^{22}=M\] Because \[\begin{pmatrix}\hat\beta_G\\ \hat\gamma_G\end{pmatrix} =\begin{pmatrix}A^{11} & A^{12} \\ A^{21} & A^{22} \end{pmatrix} \begin{pmatrix}X^TY \\ Z^TY \end{pmatrix}\] We have \(\hat\beta_G=\hat\beta-L\hat\gamma_G\).
Algebra explanation of of the inverse of a partitioned matrix.
Let \(M\) be a block matrix partitioned as: [ M = \[\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}\]] where \(A_{11}\) is invertible. Define: [ B_{22} = A_{22} - A_{21}A_{11}^{-1}A_{12}, B_{12} = A_{11}^{-1}A_{12}, B_{21} = A_{21}A_{11}^{-1}. ]
The matrix can be decomposed as (LU decomposition): [ M = {L} {D} _{U} ] where:
- \(L\) is lower block triangular (Gaussian elimination matrix)
- \(D\) is block diagonal containing pivots
- \(U\) is upper block triangular
The inverse is \(M^{-1} = U^{-1}D^{-1}L^{-1}\) with: \[\begin{align*} U^{-1} &= \begin{pmatrix} I & -B_{12} \\ 0 & I \end{pmatrix}, \\ D^{-1} &= \begin{pmatrix} A_{11}^{-1} & 0 \\ 0 & B_{22}^{-1} \end{pmatrix}, \\ L^{-1} &= \begin{pmatrix} I & 0 \\ -B_{21} & I \end{pmatrix}. \end{align*}\]
Finally, \[M^{-1} = \boxed{ \begin{pmatrix} A_{11}^{-1} + B_{12}B_{22}^{-1}B_{21} & -B_{12}B_{22}^{-1} \\ -B_{22}^{-1}B_{21} & B_{22}^{-1} \end{pmatrix} }\]
This method is known as the ``sweep operator”. Most software uses the QR decomposition method to solve least squares. SAS has an option to the sweep operator.
8.1.2 Removing Covariates
This is a homework problem
Assume that \(X_1\) and \(X_2\) are full rank and the columns between the two matrices are linearly independent. Suppose we have obtained the OLSE of \(\beta_1\) and \(\beta_2\), denoted by \(\hat\beta_1\) and \(\hat\beta_2\) respectively, in the following full model: \[Y=X_1\beta_1+X_2\beta_2+\epsilon\] Now consider removing \(X_2\) from the model and fit the following new model: \[Y=X_1\beta_1+\epsilon\] and denote the corresponding OLSE as \(\tilde\beta_1\). Then \(\tilde{\beta}_1= \hat\beta_1+(X_1^TX_1)^{-1}X_1^TX_2\hat\beta_2\)
Question: When does adding/removing covariates will not effect the OLSE for the covariates in the model? Think about perpendicular covariates and Analysis of variance (balanced). This happens when the columns of \(X\) and the columns of \(Z\) are orthogonal.
8.2 Adding Subjects
Suppose we have fit a linear model using \(n\) subjects with the design matrix \(X\). Now data from \(k\) new subjects are available and we denote the design matrix for them by \(X_1\) and the responses by \(Y_1\). How can we update our model efficiently? Let \(\hat\beta_{old}\) denote the OLSE based on the first \(n\) subjects, i.e., \(\hat\beta_{old}=(X^TX)^{-1}X^TY\) and let \(\hat\beta_{new}\) denote the OLSE using all available data. Then we have \(\hat\beta_{new}=(X^TX+X_1^TX_1)^{-1}(X^TY+X_1^TY_1)\). Recall the Sherman-Morrison-Woodbury formula (\(A.9.3\)), \[(A+UBV)^{-1}=A^{-1}-A^{-1}UB(B+BVA^{-1}UB)^{-1}BVA^{-1}\] Here \(A=X^TX\), \(U=X_1^T\), \(B=I_k\), and \(V=X_1\). We have \[(X^TX+X_1^TX_1)^{-1}=(X^TX)^{-1}-(X^TX)^{-1} X_1^T(I+X_1(X^TX)^{-1}X_1^T)^{-1}X_1 (X^TX)^{-1}\] Thus, \[\begin{eqnarray*} \hat\beta_{new} &=& [(X^TX)^{-1}-(X^TX)^{-1} X_1^T(I+X_1(X^TX)^{-1}X_1^T)^{-1}X_1 (X^TX)^{-1}](X^TY+X_1^TY_1)\\ &=& \hat\beta_{old} + (X^TX)^{-1}X_1^T(I+X_1(X^TX)^{-1}X_1^T)^{-1}[-X_1(X^TX)^{-1}X^TY-X_1(X^TX)^{-1}X_1^TY_1+\\ &&(I+X_1(X^TX)^{-1}X_1^T)Y_1]\\ &=& \hat\beta_{old}+(X^TX)^{-1}X_1^T(I+X_1(X^TX)^{-1}X_1^T)^{-1}[Y_1-X_1\hat\beta_{old}] \end{eqnarray*}\]
We add a single row to \(X\) and let \(x_1^T\) denote the row. Let \(y_1\) denote the corresponding response. We have \[\hat\beta_{new} = \hat\beta_{old}+\frac{(X^TX)^{-1}x_1}{1+x_1^T(X^TX)^{-1}x_1}[y_1-x_1^T\hat\beta_{old}]\] Note, this can also be proved using \(A.9.4\) b let \(C=(X^TX)^{-1}x_1^T\), \(D=(1+x_1(X^TX)^{-1}x_1^T)^{-1}=(1+x_1C)^{-1}\).
By \(A.9.4\), \((X^TX+x_1^Tx_1)^{-1}=(X^TX)^{-1}-CDC^T\). Thus, \[\begin{eqnarray*} \hat\beta_{new} &=& (X^TX+x_1^Tx_1)^{-1}(X^TY+x_1^Ty_1)\\ &=&\hat\beta_{old}-CDC^T(X^TY+x_1^Ty_1)+cy_1\\ &=&\hat\beta_{old}+CD[D^{-1}y_1-C^TX^TY-C^Tx_1^Ty_1]\\ &=&\hat\beta_{old}+CD[y_1-C^TX^TY]\\ &=&\hat\beta_{old}+CD[y_1-x_1^T\hat\beta_{old}]\\ &=&\hat\beta_{old}+\frac{(X^TX)^{-1}x_1}{1+x_1^T(X^TX)^{-1}x_1}[y_1-x_1^T\hat\beta_{old}] \end{eqnarray*}\]
8.2.1 Removing subjects
We single out the situation when we want to remove the \(i\)th subject. Use \(A.9.4\), i.e., \[(A-uv^T)^{-1} = A^{-1} + \frac{A^{-1}uv^TA^{-1}}{1-v^TA^{-1}u}\] Let \(X_{(i)}\) denote the design matrix with the ith observation remved. Note that \(X^TX=\sum x_ix_i^T\). we have \[\begin{eqnarray*} (X_{(i)}^TX_{(i)})^{-1} &=& (X^TX-x_ix_i^T)^{-1}\\ &=& (X^TX)^{-1}+ \frac{(X^TX)^{-1}x_ix_i^T(X^TX)^{-1}}{1-x_i^T(X^TX)^{-1}x_i}\\ &=& (X^TX)^{-1}+ \frac{(X^TX)^{-1}x_ix_i^T(X^TX)^{-1}}{1-h_{ii}} \end{eqnarray*}\] The last equation is true because we define \(h_{ii}\) as the \((i,i)\) element of \(X(X^TX)^{-1}X^T\). Hence, \[\begin{eqnarray*} \hat\beta_{(i)} &=& [X_{(i)}^TX_{(i)}]^{-1}(X^TY-x_iy_i)\\ &=& [(X^TX)^{-1}+ \frac{(X^TX)^{-1}x_ix_i^T(X^TX)^{-1}}{1-h_{ii}}](X^TY-x_iy_i)\\ &=& \hat\beta-\frac{(X^TX)^{-1}x_i}{1-h_{ii}}[y_i(1-h_i)-x_i^T\hat\beta+h_{ii}y_i]\\ &=& \hat\beta-\frac{(X^TX)^{-1}x_ie_i}{1-h_{ii}} \end{eqnarray*}\] Note, this result is very useful in diagnostics, as it shows the influence of a single data point on OLSE. Since \(\sum h_{ii}=trace(H)=p\), the average of \(h_{ii}\) is \(p/n\). If \(h_{ii}\) is close to 1, the ith observation has a large influence on parameter estimation. The difference, \(\hat\beta-\hat\beta_{(i)}\) (called \(DFBETA\)) and its standardized version is often used to test outliers. We will discuss this later.
<!– #