Linear Algebra Review

From ME 510 & AMATH 584


A subspace of a vector space is a subset of a vector space which is itself a vector space.

QR Decomposition

\[ \displaylines{A = \begin{bmatrix} Q_1 & Q_2 \end{bmatrix} \begin{bmatrix} R_1 \\ 0 \end{bmatrix} = Q_1R_1 } \] where \(A \in \mathbb{R}^{n \times k}\), \((n \ge k)\), \(\begin{bmatrix} Q_1 & Q_2 \end{bmatrix}\) is orthogonal, \(Q_1 \in \mathbb{R}^{n \times r}\), \(Q_2 \in \mathbb{R}^{n \times (n-r)}\), \(R_1 \in \mathbb{R}^{r \times r}\), \(\mathrm{rank}(A) = r\)

  • \(Q_1 R_1\) is called the reduced QR factorization

  • \(Q^\mathrm{T}Q = I_r\), and \(R_1\) is upper triangular & invertible
  • usually computed using a variation on Gram-Schmidt procedure which is less sensitive to numerical (rounding) errors
  • column of \(Q_1\) are orthonormal basis for \(\mathcal{R}(A)\), \(\mathcal{R}(Q_1) = \mathcal{R}(A)\)
  • column of \(Q_2\) are orthonormal basis for \(\mathcal{N}(A^\mathrm{T})\), \(\mathcal{R}(Q_2) = \mathcal{N}(A^\mathrm{T})\)

Least-Squares

Consider \(y = Ax\) where \(A \in \mathbb{R}^{m \times n}\), \(m>n\). To solve this equation approximately, we first define the residual \(r = Ax-y\) and trying to find \(x = x_{\mathrm{ls}}\) that minimizes \(\|r\|\) \[ \displaylines{\|r\|^2 = x^\mathrm{T}A^\mathrm{T}Ax - 2y^\mathrm{T}Ax + y^\mathrm{T}y } \] set gradient to zero \[ \displaylines{\nabla_x \|r\|^2 = 2A^\mathrm{T}Ax - 2A^\mathrm{T}y = 0 } \] we get \[ \displaylines{x_\mathrm{ls} = (A^\mathrm{T}A)^{-1}A^\mathrm{T}y } \]

  • \(A^\dagger = (A^\mathrm{T}A)^{-1} A^\mathrm{T}\) is a left inverse of \(A\)

  • \(Ax_\mathrm{ls}\) is the point in \(\mathcal{R}(A)\) that is closest to \(y\), i.e. it is the projection of \(y\) onto \(\mathcal{R}(A)\): \(Ax_\mathrm{ls} = \mathcal{P}_{\mathcal{R}(A)}(y) = A(A^\mathrm{T}A)^{-1}A^\mathrm{T}y\)

  • \(A(A^\mathrm{T}A)^{-1}A^\mathrm{T}\) gives projection onto \(\mathcal{R}(A)\)
  • optimal residual \(r = Ax_\mathrm{ls} - y\) is orthogonal to \(\mathcal{R}(A)\): \(\langle r, Az \rangle = 0\) for all \(z \in \mathbb{R}^n\)

Least-Squares via \(QR\) Factorization

\[ \displaylines{\|Ax - y\|^2 = \left\|\begin{bmatrix} Q_1 & Q_2 \end{bmatrix} \begin{bmatrix} R_1 \\ 0 \end{bmatrix} x - y \right\|^2 } \] since multiplication by orthogonal matrix does not change norm, so \[ \displaylines{\begin{aligned} \|Ax - y\|^2 &= \left\|\begin{bmatrix} Q_1 & Q_2 \end{bmatrix}^\mathrm{T} \begin{bmatrix} Q_1 & Q_2 \end{bmatrix} \begin{bmatrix} R_1 \\ 0 \end{bmatrix} x - \begin{bmatrix} Q_1 & Q_2 \end{bmatrix}^\mathrm{T} y \right\|^2 \\ &= \left\|\begin{bmatrix} R_1 x - Q_1^\mathrm{T} y \\ -Q_2^\mathrm{T} y \end{bmatrix} \right\|^2 \\ &=\|R_1 x - Q_1^\mathrm{T} y\|^2 + \|Q_2^\mathrm{T} y\|^2 \end{aligned} } \] this is minimized by choosing \(x_\mathrm{ls} = R_1^{-1} Q_1^\mathrm{T} y\)

  • residual with optimal \(x\) is \(Ax_\mathrm{ls} - y = -Q_2 Q_2^\mathrm{T} y\)
  • \(Q_1 Q_1^\mathrm{T}\) gives projection onto \(\mathcal{R}(A)\)
  • \(Q_2 Q_2^\mathrm{T}\) gives projection onto \(\mathcal{R}(A)^\perp\)

Gauss-Newton Method for Nonlinear Least-Squares (NLLS) Problem

Find \(x \in \mathbb{R}^n\) that minimizes \[ \displaylines{\|r(x)\|^2 = \sum_{i = 1}^m r_i(x)^2 } \] where \(r:\mathbb{R}^n \to \mathbb{R}^m\)

Linearize \(r\) near current iterate \(x^{(k)}\) \[ \displaylines{r(x) \approx r(x^k) + Jr(x^{(k)})(x-x^{(k)}) } \] where \(Jr\) is the Jacobian matrix

At \(k\)th iteration, we approximate NLLS problem by linear LS problem: \[ \displaylines{\|r(x)\|^2 \approx \left\|A^{(k)}x - b^{(k)}\right\|^2 } \] where \(A^{(k)} = Jr(x^{(k)})\), \(b^{(k)} = Jr(x^{(k)}) x^{(k)} - r(x^{(k)})\)

At \(k+1\)th iteration \[ \displaylines{x^{(k+1)} = \left( A^{(k)\mathrm{T}} A^{(k)} \right)^{-1} A^{(k)\mathrm{T}} b^{(k)} } \]

Underdetermined Linear Equations

Consider \[ \displaylines{y = Ax } \] where \(A \in \mathbb{R}^{m \times n}\), \((m < n)\). Assume \(\mathrm{rank}(A) = m\), the solution has the form \[ \displaylines{\{ x \mid Ax = y \} = \{ x_p + z \mid z\in \mathcal{N}(A) \} } \] where \(x_p\) is any solution that satisfies \(A x_p = y\)

  • solution has \(\dim \mathcal{N}(A) = n - m\) degrees of freedom

  • the least-norm solution is \(x_\text{ln} = A^\mathrm{T} (A A^\mathrm{T})^{-1} y\) \[ \displaylines{\begin{aligned} \min \quad &\|x\| \\ s.t. \quad &Ax = y \end{aligned} } \]

  • \(A^\dagger = A^\mathrm{T} (A A^\mathrm{T})^{-1}\) is a right inverse of \(A\)

  • \(I - A^\mathrm{T} (A A^\mathrm{T})^{-1} A\) gives projection onto \(\mathcal{N}(A)\)

  • apply QR factorization of \(A^\mathrm{T} = QR\) can get the same result \(x_\text{ln} = QR^{-\mathrm{T}} y\)

  • derivation via Lagrange multipliers: \(L(x, \lambda) = x^\mathrm{T} x + \lambda^\mathrm{T} (A x - y)\) \[ \displaylines{\nabla_x L = 2 x + A^\mathrm{T} \lambda \, , \quad \nabla_\lambda L = A x - y = 0 } \]

General Norm Minimization With Equality Constraints

\[ \displaylines{\begin{aligned} \min \quad & \| A x - b \| \\ s.t. \quad & C x = d \end{aligned} } \] Lagrangian is \[ \displaylines{\begin{aligned} L(x, \lambda) &= \dfrac{1}{2} \|Ax - b\|^2 + \lambda^\mathrm{T} (Cx -d) \\ &= \dfrac{1}{2} x^\mathrm{T} A^\mathrm{T} Ax - b^\mathrm{T} Ax +\dfrac{1}{2} b^\mathrm{T} b + \lambda^\mathrm{T} Cx - \lambda^\mathrm{T} d \end{aligned} } \] optimality conditions are \[ \displaylines{\nabla_x L = A^\mathrm{T} Ax - A^\mathrm{T} b + C^\mathrm{T} \lambda = 0\, , \quad \nabla_\lambda L = Cx - d = 0 } \] write in block matrix form as \[ \displaylines{\begin{bmatrix} A^\mathrm{T} A & C^\mathrm{T} \\ C & 0 \end{bmatrix} \begin{bmatrix} x \\ \lambda \end{bmatrix} = \begin{bmatrix} A^\mathrm{T} b \\ d \end{bmatrix} } \] if the block matrix is invertible, we can solve for \(x\) and \(\lambda\).

If \(A^\mathrm{T} A\) is invertible, we can derive a more explicit formula for \(x\) \[ \displaylines{x = (A^\mathrm{T} A)^{-1} \left(A^\mathrm{T} b - C^\mathrm{T} \bigl(C (A^\mathrm{T} A)^{-1} C^\mathrm{T}\bigr)^{-1} \bigl(C (A^\mathrm{T} A)^{-1} A^\mathrm{T} b\bigr)\right) } \]

Householder Triangularization

The other principal method for computing \(QR\) factorizations (besides Gram-Schmidt orthogonalization).

Gram-Schmidt: \[ \displaylines{A R_1 R_2 \cdots R_n = \hat{Q} } \] Householder: \[ \displaylines{Q_n \cdots Q_2 Q_1 A = R } \] In general, \(Q_k\) operates on row \(k,\dots,m\). At the beginning of step \(k\), there is a block of zeros in the first \(k-1\) columns of these rows. The application of \(Q_k\) forms linear combinations of these rows. After \(n\) steps, all the entries below the diagonal have been eliminated and \(Q_n \cdots Q_2 Q_1 A = R\) is upper-triangular.

Each \(Q\) is chosen to be a unitary matrix of the form \[ \displaylines{Q_k = \begin{bmatrix} I & 0 \\ 0 & F \end{bmatrix} } \] where \(I_k\) is the \((k-1) \times (k-1)\) identity and \(F\) is an \((m-k+1) \times (m-k+1)\) unitary matrix. Multiplication by \(F\) must introduce zeros into the \(k\)th column. The Householder algorithm chooses \(F\) to be a particular matrix called a Householder reflector. \[ \displaylines{Fx = \begin{bmatrix} \|x\| \\ 0 \\ \vdots \\ 0 \end{bmatrix} =\|x\| e_1 } \] \(F\) will reflect the space across the hyperplane \(H\) orthogonal to \(v = \|x\| e_1 -x\). Note that \[ \displaylines{I - \dfrac{vv^*}{v*v} } \] projects onto the space \(H\). To reflect \(y\) across \(H\), we have \[ \displaylines{F = I - 2\dfrac{vv^*}{v*v} } \] Due to stability issues, in practice we use \(v = \mathrm{sign}(x_1) \|x\| e_1 + x\), i.e. always reflect \(x\) to the vector \(\pm \|x\| e_1\) that is not too close to \(x\) itself. \(\mathrm{sign}(x_1) = 0\) if \(x_1 = 0\).

Conditioning and Condition Numbers

A well-conditioned problem is one with the property that all small perturbations of \(x\) lead to only small changes in \(f(x)\).

  • Absolute Condition Number: \[ \displaylines{\hat{\kappa} = \lim_{\delta \to 0} \sup_{\|\delta x\| \le \delta} \dfrac{\|\delta f\|}{\|\delta x\|} } \]

If \(f\) is differentiable, then \(\hat{\kappa} = \|J(x)\|\), where \(J(x)\) is the Jacobi matrix.

  • Relative Condition Number: \[ \displaylines{\kappa = \lim_{\delta \to 0} \sup_{\|\delta x\| \le \delta} \left( \dfrac{\|\delta f\|}{\|f(x)\|} \bigg/ \dfrac{\|\delta x\|}{\|x\|} \right) } \] If \(f\) is differentiable, then \[ \displaylines{\kappa = \dfrac{\|J(x)\|}{\|f(x)\| / \|x\|} } \]

Relative condition numbers are more important in numerical analysis because the floating point arithmetic used by computers introduces relative errors rather than absolute ones.

  • Condition Number of a matrix \[ \displaylines{\begin{aligned} \kappa &= \sup_{\delta x} \left( \dfrac{\|A(x + \delta x) - Ax\|}{\|Ax\|} \bigg/ \dfrac{\|\delta x\|}{\|x\|} \right) \\ &=\sup_{\delta x} \dfrac{\|A \delta x\|}{\|\delta x\|} \bigg/ \dfrac{\|Ax\|}{\|x\|} \\ &= \|A\| \dfrac{\|x\|}{\|Ax\|} \\ &\le \|A\| \|A^{-1}\| \end{aligned} } \]

If \(\|\cdot\| = \|\cdot\|_2\), then \(\|A\| = \sigma_1\) and \(\|A^{-1}\| = 1/\sigma_m\). Thus \[ \displaylines{\kappa(A) = \dfrac{\sigma_1}{\sigma_m} } \] The ratio can be interpreted as the eccentricity of the hyperellipse.

When \(A\) is a rectangular matrix, we can use the pseudoinverse: \(\kappa(A) = \|A\| \|A^\dagger\|\)

Cholesky Factorization

Hermitian matrix is a kind of matrix that satisfies \(A = A^*\). \[ \displaylines{\begin{aligned} A &= \begin{bmatrix} a_{11} & w^* \\ w & K \end{bmatrix} \\ &= \begin{bmatrix} \alpha & 0 \\ w/\alpha & I \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & K-ww^*/a_{11} \end{bmatrix} \begin{bmatrix} \alpha & w^*/\alpha \\ 0 & I \end{bmatrix} \\ &= R_1^* A_1 R_1 \end{aligned} } \] where \(\alpha = \sqrt{a_{11}}\). Continue this process until \[ \displaylines{A = \underbrace{R_1^* R_2^* \cdots R_m^*}_{R^*}\ \underbrace{R_m\cdots R_2 R_1}_{R} } \] where \(R\) is upper-triangular.

Every Hermitian positive definite matrix \(A \in \mathbb{C}^{m \times m}\) has a unique Cholesky factorization.