Orthogonality and Least-Squares

Orthogonal vectors and matrices

Least squares solutions and the QR decomposition

It is extremely common in many applications to want the closest thing to a solution $x$ to an overdetermined system

$Ax = b$

where $A$ is $m \times n$ with $m > n$ , and $b$ is a $m$ -dimensional given vector. Usually an actual solution does not exist, and so we search for an $x$ which minimizes the 2-norm of the residual $r = Ax - b$ . The use of the 2-norm is what gives this the name "least-squares" solution.

Geometrically we can think of the least-squares solution $x$ as the projection of $b$ onto the range of $A$ . Because of this, it is useful to understand some of the linear algebra of projection matrices when studying the process of finding least-squares solutions.

The simplest way of tackling the least-squares problem is to simply multiply both sides of the equation by the transpose of $A$ :

$A^T Ax = A^T b$

These are called the "normal equations". The right-hand side is now $n$ -dimensional and there will be a unique solution for $x$ as long as the rank of $A$ is $n$ :

$x = (A^T A)^{-1} (A^T b)$

As usual with numerical linear algebra we do not want to actually invert $A^T A$ but rather solve for $x$ using some variant of $PLU$ decomposition. For symmetric matrices such as $A^T A$ there is an efficient factorization algorithm (Cholesky factorization).

Sometimes this formulation of least-squares is sufficient to find the answers we need, but it suffers from some fixable problems that can be important.

Simple example of least squares

Suppose we want the least-squares solution to the system

$\left ( \begin{array}{c} 3 \\ 1 \end{array} \right ) x = \left ( \begin{array}{c} 2 \\ 2 \end{array} \right )$

The matrix $A^T A$ is here just the number $10$ , and $A^T b = 8$ , so $x = (A^T A)^{-1} (A^T b) = 10^{-1} 8 = 4/5$ . This is shown in the figure below.

simple example — The least squares solution to 3x = 2, x = 2.

Projections and orthogonality.

A matrix $P$ is a projection if $P^2 = P$ . If in addition $P^T = P$ then $P$ is an orthogonal projection.

The definition of a projection implies that the matrix must be square.

The word orthogonal is used in several very different ways in linear algebra, although all of them relate to the idea of various things being perpendicular - i.e. with an angle of $pi/2$ . For vectors in any dimension, the angle between vectors ( $u$ and $v$ ) is usually defined by the formula

$\cos(\theta) = \frac{u^T v}{|u| |v|}$

Notice that with the principal branch of the inverse cosine the angle $\theta$ will be in the interval $[0,\pi]$ .

The orthogonal projection onto the span of a vector $u$ can be written as

$P = \frac{u u^T}{u^T u}$

If we have a collection of orthonormal vectors $u_i$ , then the orthogonal projection onto their span is the sum of the individual projections:

$$ P = \sum {u_i u_i^T}$

If the range of an orthogonal projection is given by a basis $a_1, a_2, \ldots, a_m$ then if we put the basis vectors $a_i$ into a matrix $A$ as its columns we can write the projection as

$P = A (A^T A)^{-1} A^T$

Non-orthogonal projection example

Lets compute the projection matrix $P$ whose range is the span of $(2,1)$ , and whose nullspace is the span of $(1,-1)$ . Another way to say this is that $P$ projects onto $(2,1)$ along $(1,-1)$ .

$P$ acts on $(2,1)$ as the identity, and $P$ sends $(1,-1)$ to the zero vector. Writing these conditions as one matrix equation gives us:

$P \left ( \begin{array}{cc} 2 & 1 \\ 1 & -1 \end{array} \right ) = \left ( \begin{array}{cc} 2 & 0 \\ 1 & 0 \end{array} \right )$

So one way to compute $P$ is:

$P = \left ( \begin{array}{cc} 2 & 0 \\ 1 & 0 \end{array} \right ) \left ( \begin{array}{cc} 2 & 1 \\ 1 & -1 \end{array} \right )^{-1} = \left ( \begin{array}{cc} 2 & 0 \\ 1 & 0 \end{array} \right ) \left(\begin{array}{cc} \frac{1}{3} & \frac{1}{3} \\ \frac{1}{3} & -\frac{2}{3} \end{array}\right) = \left(\begin{array}{cc} \frac{2}{3} & \frac{2}{3} \\ \frac{1}{3} & \frac{1}{3} \end{array}\right)$

QR decompositions

The QR decomposition is an extremely useful matrix decomposition used for a variety of purposes in numerical linear algebra. For the moment we will focus on its use in least-squares problems.

If $A$ is an $m$ by $n$ matrix, then a QR decomposition $A = QR$ consists of an orthogonal (or unitary in the complex case) matrix $Q$ and an upper-triangular matrix $R$ . A matrix $Q$ is defined to be orthogonal if $Q^T Q = I$ ; a complex matrix $U$ is defined to be unitary if $Q^* Q = I$ , where $Q^*$ is the complex conjugate transpose.

With a QR decomposition in hand we can reframe the least-squares problem:

$Ax = QR x = b$

by multiplying both sides by $Q^T$ , obtaining

$R x = Q^T b$

and this is easy to solve with back substitution since $R$ is upper-triangular. This avoids forming the possibly high condition number matrix $A^T A$ and thus is more reliably accurate than using the normal equations.

Gram-Schmidt Orthogonalization

The Gram-Schmidt algorithm progressively orthogonalizes a sequence of vectors $A_1$ , $A_2$ , $\ldots$ , $A_n$ (which in the QR algorithm we view as the columns of a matrix $A$ ) by normalizing their length after subtracting the projection of each $A_i$ on the previous orthogonalized vectors $Q_i$ . The first few steps of this process are:

$r_{1,1} = |A_1|, \ \ \ Q_1 = Q_1/r_{1,1}$

$r_{1,2} = Q_1^T A_1, \ \ \ r_{2,2} = |A_2 - r_{1,2}Q_1|,$

$Q_2 = (A_2 - r_{1,2}Q_1)/r_{2,2}$

$r_{1,3} = Q_1^T A_3, \ \ \ r_{2,3} = Q_2^T A_3, \ \ \ r_{3,3} = |A_3 - r_{1,3}Q_1 - r_{2,3} Q_2|,$

$Q_3 = (A_3 - r_{1,3}Q_1 - r_{2,3} Q_2)/r_{3,3}$

Gram-Schmidt example

As an example we will compute the QR decomposition of

$A = \left(\begin{array}{ccc} 1 & 1 & 1 \\ 1 & 1 & 2 \\ 0 & 2 & 1 \end{array}\right) = \left(\begin{array}{c|c|c} & & \\ A_1 & A_2& A_3 \\ & & \end{array}\right)$

using the Gram-Schmidt algorithm. We first compute the length of the first column ( $A_1$ ) to find $r_{1,1}$ :

$r_{1,1} = |A_1| = \sqrt{1^2 + 1^2 + 0^2} = \sqrt{2}$

and then

$Q_1 = A_1/r_{1,1} = \left(\begin{array}{c} \frac{\sqrt{2}}{2} \\ \frac{\sqrt{2}}{2} \\ 0 \end{array}\right)$

Next, we subtract off the component of $A_2$ in the $Q_1$ direction from $A_2$ . The length of the remaining vector is $r_{2,2}$ , and the $Q_1$ component is $r_{1,2}$ :

$r_{1,2} = Q_1^T A_2 = \frac{\sqrt{2}}{2} + \frac{\sqrt{2}}{2} + 0 = \sqrt{2}$

$r_{2,2} = |A_2 - r_{1,2}Q_1| = \sqrt{0^2 + 0^2 + 2^2} = 2$

Using these we can compute $Q_2$ :

$Q_2 = \frac{A_2 - r_{1,2}Q_1}{r_{2,2}} = \left ( \begin{array}{c} 0 \\ 0 \\ 1 \end{array}\right )$

Finally, we compute the components of $A_3$ in the $Q_1$ and $Q_2$ directions ( $r_{1,3}$ and $r_{2,3}$ respectively), and then subtract those off:

$r_{1,3} = Q_1^T A_3 = \frac{3 \sqrt{2}}{2}$

$r_{2,3} = Q_2^T A_3 = 1$

$r_{3,3} = | A_3 - r_{1,3} Q_1 - r_{2,3} Q_2 | = \left | \left(\begin{array}{c} -\frac{1}{2} \\ \frac{1}{2} \\ 0 \end{array} \right) \right | = \frac{\sqrt{2}}{2}$

$Q_3 = \frac{A_3 - r_{1,3}Q_1 - r_{2,3} Q_2}{r_{3,3}} = \left(\begin{array}{c} -\frac{\sqrt{2}}{2} \\ \frac{\sqrt{2}}{2} \\ 0 \end{array} \right)$

so the decomposition is

$A = \left(\begin{array}{ccc} 1 & 1 & 1 \\ 1 & 1 & 2 \\ 0 & 2 & 1 \end{array}\right) = \left(\begin{array}{ccc} \frac{1}{2} \, \sqrt{2} & 0 & -\sqrt{\frac{1}{2}} \\ \frac{1}{2} \, \sqrt{2} & 0 & \sqrt{\frac{1}{2}} \\ 0 & 1 & 0 \end{array}\right) \left(\begin{array}{ccc} \sqrt{2} & \sqrt{2} & \frac{3}{2} \, \sqrt{2} \\ 0 & 2 & 1 \\ 0 & 0 & \sqrt{\frac{1}{2}} \end{array}\right) = Q R$

Householder reflections

A Householder reflection is constructed from a vector $v$ as: $I - 2 \frac{v v^T}{v^T v}$

We use reflections to triangularize a matrix by reflecting a column vector $a$ onto the vector $-sgn(a_1)|a| e_1$ where $e_1$ is the first unit coordinate vector (i.e. $e_1^T = (1,0,\ldots,0)$ ). The vector $v$ in the reflection formula is the difference between $a$ and the target: $v = a + sgn(a_1)|a| e_1$ . The sign function $sgn(x)$ is 1 if the sign of the argument is positive, and -1 if it is negative. This convention ensures that $v$ is not too small a vector, which could lead to a loss of precision.

The figure illustrates an example in which the column vector $A_1 = (1,1)^T$ (in blue) is reflected onto $(-\sqrt(2),0)^T$ (in red). The difference between these is $v$ (in green). The grey vector points along the line of reflection; the orthogonal projection matrix $I - \frac{v v^T}{v^T v}$ would project $A_1$ onto the grey vector.

Example QR with Householder

Here we compute a quick example of the QR decomposition using a Householder reflection. To triangularize the matrix:

$A = \left ( \begin{array}{cc} 1 & 2 \\ 2 & 1 \\ 2 & 1 \end{array} \right )$

we will reflect the first column $A_1 = (1,2,2)^T$ onto $-|A_1|e_1 = (-3,0,0)^T$ . The difference between $A_1$ and this target is $v = (4,2,2)^T$ , so the reflection matrix is

$Q_1 = I - 2 \frac{v \ v^T}{v^T v} = \left(\begin{array}{rrr} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right) - \frac{2}{24} \left(\begin{array}{rrr} 16 & 8 & 8 \\ 8 & 4 & 4 \\ 8 & 4 & 4 \end{array}\right)$

$= \left(\begin{array}{rrr} -\frac{1}{3} & -\frac{2}{3} & -\frac{2}{3} \\ -\frac{2}{3} & \frac{2}{3} & -\frac{1}{3} \\ -\frac{2}{3} & -\frac{1}{3} & \frac{2}{3} \end{array}\right)$

Now we have

$Q_1 A = \left(\begin{array}{rr} -3 & -2 \\ 0 & -1 \\ 0 & -1 \end{array}\right)$

Now we need a second reflection to send the lower part of $A_2$ , $(-1,-1)^T$ , to $(\sqrt{2},0)$ . For this 2d reflection, $v = (-1,-1)^T - (\sqrt{2},0)$ and we have

$Q_2 = \left(\begin{array}{rrr} 1 & 0 & 0 \\ 0 & -1/\sqrt{2} & -1/\sqrt{2} \\ 0 & -1/\sqrt{2} & 1/\sqrt{2} \end{array}\right)$

Since $Q_1$ and $Q_2$ are symmetric and orthogonal, $Q_1 = Q_1^T = Q_1^{-1}$ and $Q_2 = Q_2^T = Q_2^{-1}$ , so

$A = Q_1 Q_2 R = QR = \left(\begin{array}{rrr} -\frac{1}{3} & \frac{2}{3} \, \sqrt{2} & 0 \\ -\frac{2}{3} & -\frac{1}{6} \, \sqrt{2} & -\frac{1}{2} \, \sqrt{2} \\ -\frac{2}{3} & -\frac{1}{6} \, \sqrt{2} & \frac{1}{2} \, \sqrt{2} \end{array}\right) \left(\begin{array}{rr} -3 & -2 \\ 0 & \sqrt{2} \\ 0 & 0 \end{array}\right)$

The Singular Value Decomposition

Every matrix $A$ has a Singular Value Decomposition:

$A = U S V^T$

where $U$ and $V$ are unitary matrices, and the only nonzero entries of $\Sigma$ are on its diagonal with $S_{i,i} \ge S_{i+1,i+1} \ge 0$ . These diagonal values of $S$ are called the singular values of $A$ . The singular values have a very nice geometric interpretation: the image of a unit sphere (each point thought of as a vector) under multiplication by a matrix $A$ is an ellipsoid (in the appropriate dimension) with semi-axes lengths given by the singular values. This is shown below for the particular example of

$A = \left(\begin{array}{rr} -3 & 3 \\ 0 & 2 \end{array} \right)$

for which the singular values are

$\sigma_1 = \sqrt{11 + \sqrt{85}} \approx 4.5,$

$\sigma_2 = \sqrt{11 - \sqrt{85}} \approx 1.3$

In the image, the shorter bright green and purple vectors are the columns of $V$ , the right-singular vectors. They are sent by $A$ (multiplying from the left) to the darker vectors $\sigma_1 u_1$ and $\sigma_2 u_2$ . Note that $v_1$ and $v_2$ are orthogonal, and so are $u_1$ and $u_2$ .

The SVD is not cheap to compute, but with it in hand we can compute almost anything else we might want regarding the matrix $A$ .

A simple way to compute the SVD (not the best way for larger matrices) is to form the matrix $A^T A = V \Sigma^T \Sigma V^T$ , which is symmetric. Its eigenvectors are the $v_i$ (the columns of $V$ ), and its eigenvalues are the squares of the singular values.

Low-rank approximations

A major application of the SVD is using it to create optimal low-rank approximations to a matrix. If we want a rank $r$ approximation to $A$ , we simply truncate the number of singular vectors used, i.e.

$A_r = \sum_{i=1}^{r} u_i \sigma_i v_i^*$

where the $u_i$ are the left singular vectors, the $\sigma_i$ are the singular values, and the $v_i$ are the right singular vectors.

These approximations to $A$ are optimal in the sense that the difference between them and $A$ itself is as small as possible in the 2-norm, with the difference being:

$| A - A_r |_2 = \sigma_{r+1}$

The Pseudo-inverse

The SVD gives a nice way to compute and express the pseudo-inverse of a matrix $A$ :

$A^{\dagger} = \sum \sigma^{-1}_k v_k u_k^T$

where we replace $\sigma^{-1}_k$ by $0$ if $\sigma_k=0$ .

This gives another way to think about least-squares solutions: just use the pseudo-inverse.

2-norm Condition number

The 2-norm condition number of a matrix $A$ can be computed from the SVD as $\sigma_1/\sigma_n$ , even when $A$ is not square.

SVD Example

Lets look at a simple example of the SVD for the matrix:

$A = \left(\begin{array}{rr} 0 & -1 \\ 1 & 0 \end{array}\right)$

We can compute the singular values and the singular vectors $v_i$ from the matrix

$A^T A = \left(\begin{array}{rr} 1 & 0 \\ 0 & 1 \end{array}\right)$

For this matrix, any nonzero vector is an eigenvector with eigenvalue $1$ . So the singular values are both $1$ , and we can pick any two orthonormal vectors to be $v_1$ and $v_2$ . If we pick them to be the coordinate vectors $v_1 = e_1 = (1,0)^T$ and $v_2 = e_2 = (0,1)^T$ , then the SVD is

$A = U \Sigma V^T = \left(\begin{array}{rr} 0 & -1 \\ 1 & 0 \end{array}\right) \left(\begin{array}{rr} 1 & 0 \\ 0 & 1 \end{array}\right) \left(\begin{array}{rr} 1 & 0 \\ 0 & 1 \end{array}\right)$

The SVD is not very useful in this case; we include the example only to show what happens in a very simple setting. The action of the matrix $A$ is a 90 degree rotation, so it does not stretch vectors at all (i.e. the singular values are 1). Note that the diagonalization of $A$ gives the same information in a different way, since the eigenvalues are complex ( $i$ and $-i$ ) and so are the eigenvectors:

$A = \left(\begin{array}{rr} \sqrt{2} & \sqrt{2} \\ -i \, \sqrt{2} & i \, \sqrt{2} \end{array}\right)^{-1} \left(\begin{array}{rr} i & 0 \\ 0 & -i \end{array}\right) \left(\begin{array}{rr} \sqrt{2} & \sqrt{2} \\ -i \, \sqrt{2} & i \, \sqrt{2} \end{array}\right)$

Another SVD example

Let's look at a more typical small example of the SVD, for the matrix:

$A = \left(\begin{array}{rr} 1 & -1 \\ -2 & 2 \\ 3 & -3 \end{array}\right)$

Again we use the technique of forming the matrix $A^T A$ and computing its eigenvalues and eigenvectors.

$A^T A = \left(\begin{array}{rrr} 1 & -2 & 3 \\ -1 & 2 & -3 \end{array}\right) \left(\begin{array}{rr} 1 & -1 \\ -2 & 2 \\ 3 & -3 \end{array}\right) = \left(\begin{array}{rr} 14 & -14 \\ -14 & 14 \end{array}\right)$

The eigenvalues of $A^T A$ are 28 and 0, so the singular values are $\sqrt{28}$ and 0. We can choose unit length eigenvectors

$v_1 = \left(\begin{array}{r} 1/\sqrt{2} \\ -1/\sqrt{2} \end{array}\right), \ \ \ v_2 = \left(\begin{array}{r} 1/\sqrt{2} \\ 1/\sqrt{2} \end{array}\right)$

Because of the zero eigenvalue, we cannot define $u_2 = A v_2 /\sigma_2$ . But we do have

$u_1 = A v_1/\sigma_1 = \frac{1}{\sqrt{14}}\left(\begin{array}{r} 1 \\ -2 \\ 3 \end{array}\right)$

which means the reduced SVD is

$A = U \Sigma V^T \left(\begin{array}{r} \frac{1}{\sqrt{14}} \\ \frac{-2}{\sqrt{14}} \\ \frac{3}{\sqrt{14}} \end{array}\right) \left(\begin{array}{r} \sqrt{28} \end{array}\right) \left(\begin{array}{rr} 1/\sqrt{2} & -1/\sqrt{2} \end{array}\right)$

To form a full SVD we would need to construct an orthonormal set $\{u_1, u_2, u_3\}$ ; there is not a unique way to do this but one choice would be to choose

$u_2 = \left(\begin{array}{r} 2/\sqrt{5} \\ 1/\sqrt{5} \\ 0 \end{array}\right)$

and then we must choose $u_3$ (up to a sign) to be

$u_3 = \left(\begin{array}{r} 3/\sqrt{70} \\ -6/\sqrt{70} \\ -5/\sqrt{70} \end{array}\right)$

So one full SVD of $A$ is

$A = \left(\begin{array}{rrr} \frac{1}{\sqrt{14}} & 2/\sqrt{5} & 3/\sqrt{70} \\ \frac{-2}{\sqrt{14}} & 1/\sqrt{5} & -6/\sqrt{70} \\ \frac{3}{\sqrt{14}} & 0 & -5/\sqrt{70} \end{array}\right ) \left(\begin{array}{rr} \sqrt{28} & 0 \\ 0 & 0 \\ 0 & 0 \end{array}\right ) \left(\begin{array}{rr} 1/\sqrt{2} & -1/\sqrt{2} \\ 1/\sqrt{2} & 1/\sqrt{2} \\ \end{array} \right)$

Least-Squares Exercises

Find the least-squares solution $\bar{x} = \left ( \begin{array}{c} \bar{x}_1 \\ \bar{x}_2 \end{array} \right )$ to the system of equations

$x_1 + x_2 = 5$

$x_1 = 2$

$-x_1 + x_2 = 0$
Find the 3 by 3 projection matrix $P$ that projects along $\left ( \begin{array}{c} x \\ y \\ z \end{array} \right ) = \left ( \begin{array}{c} 1 \\ 1 \\ 1\end{array} \right )$ onto the x-y plane (i.e. the span of $\{\left ( \begin{array}{c} 1 \\ 0 \\ 0\end{array} \right ), \left ( \begin{array}{c} 0 \\ 1 \\ 0\end{array} \right ) \}$ .
Compute the reduced QR decomposition of the matrix $A = \left(\begin{array}{rr} -1 & 1 \\ 2 & 1 \\ 2 & 4 \end{array}\right)$ using the Gram-Schmidt algorithm.
Compute the full QR decomposition of the matrix $A = \left(\begin{array}{rr} 3 & 0 \\ 4 & 5 \\ 0 & 4 \end{array}\right)$ using Householder triangularization.
Compute the reduced SVD decomposition of the following matrices:
1. $A = \left(\begin{array}{rr} 1 & 1 \\ 1 & 1 \\ \end{array}\right)$
2. $B = \left(\begin{array}{rr} 1 & 0 \\ 1 & 1 \\ 0 & 1 \end{array}\right)$