Orthogonality and Least-Squares

Orthogonal vectors and matrices

Least squares solutions and the QR decomposition

It is extremely common in many applications to want the closest thing to a solution x to an overdetermined system

Ax = b

where A is m \times n with m > n, and b is a m-dimensional given vector. Usually an actual solution does not exist, and so we search for an x which minimizes the 2-norm of the residual r = Ax - b. The use of the 2-norm is what gives this the name "least-squares" solution.

Geometrically we can think of the least-squares solution x as the projection of b onto the range of A. Because of this, it is useful to understand some of the linear algebra of projection matrices when studying the process of finding least-squares solutions.

The simplest way of tackling the least-squares problem is to simply multiply both sides of the equation by the transpose of A:

A^T Ax = A^T b

These are called the "normal equations". The right-hand side is now n-dimensional and there will be a unique solution for x as long as the rank of A is n:

x = (A^T A)^{-1} (A^T b)

As usual with numerical linear algebra we do not want to actually invert A^T A but rather solve for x using some variant of PLU decomposition. For symmetric matrices such as A^T A there is an efficient factorization algorithm (Cholesky factorization).

Sometimes this formulation of least-squares is sufficient to find the answers we need, but it suffers from some fixable problems that can be important.

Simple example of least squares

Suppose we want the least-squares solution to the system

\left ( \begin{array}{c} 3 \\ 1 \end{array} \right ) x = \left ( \begin{array}{c} 2 \\ 2 \end{array} \right )

The matrix A^T A is here just the number 10, and A^T b = 8, so x = (A^T A)^{-1} (A^T b) = 10^{-1} 8 = 4/5. This is shown in the figure below.

simple example
The least squares solution to 3x = 2, x = 2.

Projections and orthogonality.

A matrix P is a projection if P^2 = P. If in addition P^T = P then P is an orthogonal projection.

The definition of a projection implies that the matrix must be square.

The word orthogonal is used in several very different ways in linear algebra, although all of them relate to the idea of various things being perpendicular - i.e. with an angle of pi/2. For vectors in any dimension, the angle between vectors (u and v) is usually defined by the formula

\cos(\theta) = \frac{u^T v}{|u| |v|}

Notice that with the principal branch of the inverse cosine the angle \theta will be in the interval [0,\pi].

The orthogonal projection onto the span of a vector u can be written as

P = \frac{u u^T}{u^T u}

If the range of an orthogonal projection is given by a basis a_1, a_2, \ldots, a_m then if we put the basis vectors a_i into a matrix A as its columns we can write the projection as

P = A (A^T A)^{-1} A^T

Non-orthogonal projection example

Lets compute the projection matrix P whose range is the span of (2,1), and whose nullspace is the span of (1,-1). Another way to say this is that P projects onto (2,1) along (1,-1).

P acts on (2,1) as the identity, and P sends (1,-1) to the zero vector. Writing these conditions as one matrix equation gives us:

P \left ( \begin{array}{cc} 2 & 1 \\ 1 & -1 \end{array} \right ) = \left ( \begin{array}{cc} 2 & 0 \\ 1 & 0 \end{array} \right )

So one way to compute P is:

P = \left ( \begin{array}{cc} 2 & 0 \\ 1 & 0 \end{array} \right ) \left ( \begin{array}{cc} 2 & 1 \\ 1 & -1 \end{array} \right )^{-1} = \left ( \begin{array}{cc} 2 & 0 \\ 1 & 0 \end{array} \right ) \left(\begin{array}{cc} \frac{1}{3} & \frac{1}{3} \\ \frac{1}{3} & -\frac{2}{3} \end{array}\right) = \left(\begin{array}{cc} \frac{2}{3} & \frac{2}{3} \\ \frac{1}{3} & \frac{1}{3} \end{array}\right)

QR decompositions

The QR decomposition is an extremely useful matrix decomposition used for a variety of purposes in numerical linear algebra. For the moment we will focus on its use in least-squares problems.

If A is an m by n matrix, then a QR decomposition A = QR consists of an orthogonal (or unitary in the complex case) matrix Q and an upper-triangular matrix R. A matrix Q is defined to be orthogonal if Q^T Q = I; a complex matrix U is defined to be unitary if Q^* Q = I, where Q^* is the complex conjugate transpose.

With a QR decomposition in hand we can reframe the least-squares problem:

Ax = QR x = b

by multiplying both sides by Q^T, obtaining

R x = Q^T b

and this is easy to solve with back substitution since R is upper-triangular. This avoids forming the possibly high condition number matrix A^T A and thus is more reliably accurate than using the normal equations.

Gram-Schmidt Orthogonalization

The Gram-Schmidt algorithm progressively orthogonalizes a sequence of vectors A_1, A_2, \ldots, A_n (which in the QR algorithm we view as the columns of a matrix A) by normalizing their length after subtracting the projection of each A_i on the previous orthogonalized vectors Q_i. The first few steps of this process are:

r_{1,1} = |A_1|, \ \ \ Q_1 = Q_1/r_{1,1}
r_{1,2} = Q_1^T A_1, \ \ \ r_{2,2} = |A_2 - r_{1,2}Q_1|,
Q_2 = |A_2 - r_{1,2}Q_1|/r_{2,2}
r_{1,3} = Q_1^T A_3, \ \ \ r_{2,3} = Q_2^T A_3, \ \ \ r_{3,3} = |A_3 - r_{1,3}Q_1 - r_{2,3} Q_2|,
Q_3 = |A_3 - r_{1,3}Q_1 - r_{2,3} Q_2|/r_{3,3}

Gram-Schmidt example

As an example we will compute the QR decomposition of

A = \left(\begin{array}{ccc} 1 & 1 & 1 \\ 1 & 1 & 2 \\ 0 & 2 & 1 \end{array}\right) = \left(\begin{array}{c|c|c} & & \\ A_1 & A_2& A_3 \\ & & \end{array}\right)

using the Gram-Schmidt algorithm. We first compute the length of the first column (A_1) to find r_{1,1}:

r_{1,1} = |A_1| = \sqrt{1^2 + 1^2 + 0^2} = \sqrt{2}

and then

Q_1 = A_1/r_{1,1} = \left(\begin{array}{c} \frac{\sqrt{2}}{2} \\ \frac{\sqrt{2}}{2} \\ 0 \end{array}\right)

Next, we subtract off the component of A_2 in the Q_1 direction from A_2. The length of the remaining vector is r_{2,2}, and the Q_1 component is r_{1,2}:

r_{1,2} = Q_1^T A_2 = \frac{\sqrt{2}}{2} + \frac{\sqrt{2}}{2} + 0 = \sqrt{2}
r_{2,2} = |A_2 - r_{1,2}Q_1| = \sqrt{0^2 + 0^2 + 2^2} = 2

Using these we can compute Q_2:

Q_2 = \frac{A_2 - r_{1,2}Q_1}{r_{2,2}} = \left ( \begin{array}{c} 0 \\ 0 \\ 1 \end{array}\right )

Finally, we compute the components of A_3 in the Q_1 and Q_2 directions (r_{1,3} and r_{2,3} respectively), and then subtract those off:

r_{1,3} = Q_1^T A_3 = \frac{3 \sqrt{2}}{2}
r_{2,3} = Q_2^T A_3 = 1
r_{3,3} = | A_3 - r_{1,3} Q_1 - r_{2,3} Q_2 | = \left | \left(\begin{array}{c} -\frac{1}{2} \\ \frac{1}{2} \\ 0 \end{array} \right) \right | = \frac{\sqrt{2}}{2}
Q_3 = \frac{A_3 - r_{1,3}Q_1 - r_{2,3} Q_2}{r_{3,3}} = \left(\begin{array}{c} -\frac{\sqrt{2}}{2} \\ \frac{\sqrt{2}}{2} \\ 0 \end{array} \right)

so the decomposition is

A = \left(\begin{array}{ccc} 1 & 1 & 1 \\ 1 & 1 & 2 \\ 0 & 2 & 1 \end{array}\right) = \left(\begin{array}{ccc} \frac{1}{2} \, \sqrt{2} & 0 & -\sqrt{\frac{1}{2}} \\ \frac{1}{2} \, \sqrt{2} & 0 & \sqrt{\frac{1}{2}} \\ 0 & 1 & 0 \end{array}\right) \left(\begin{array}{ccc} \sqrt{2} & \sqrt{2} & \frac{3}{2} \, \sqrt{2} \\ 0 & 2 & 1 \\ 0 & 0 & \sqrt{\frac{1}{2}} \end{array}\right) = Q R

Householder reflections

A Householder reflection is constructed from a vector v as: I - 2 \frac{v v^T}{v^T v}

We use reflections to triangularize a matrix by reflecting a column vector a onto the vector -sgn(a_1)|a| e_1 where e_1 is the first unit coordinate vector (i.e. e_1^T = (1,0,\ldots,0)). The vector v in the reflection formula is the difference between a and the target: v = a + sgn(a_1)|a| e_1. The sign function sgn(x) is 1 if the sign of the argument is positive, and -1 if it is negative. This convention ensures that v is not too small a vector, which could lead to a loss of precision.

Householder reflection
Example Householder reflection with a projection shown in grey.

The figure illustrates an example in which the column vector A_1 = (1,1)^T (in blue) is reflected onto (-\sqrt(2),0)^T (in red). The difference between these is v (in green). The grey vector points along the line of reflection; the orthogonal projection matrix I - \frac{v v^T}{v^T v} would project A_1 onto the grey vector.

Householder
Alston Scott Householder.

Example QR with Householder

Here we compute a quick example of the QR decomposition using a Householder reflection. To triangularize the matrix:

A = \left ( \begin{array}{cc} 1 & 2 \\ 2 & 1 \\ 2 & 1 \end{array} \right )

we will reflect the first column A_1 = (1,2,2)^T onto -|A_1|e_1 = (-3,0,0)^T. The difference between A_1 and this target is v = (4,2,2)^T, so the reflection matrix is

Q_1 = I - 2 \frac{v \ v^T}{v^T v} = \left(\begin{array}{rrr} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right) - \frac{2}{24} \left(\begin{array}{rrr} 16 & 8 & 8 \\ 8 & 4 & 4 \\ 8 & 4 & 4 \end{array}\right)
= \left(\begin{array}{rrr} -\frac{1}{3} & -\frac{2}{3} & -\frac{2}{3} \\ -\frac{2}{3} & \frac{2}{3} & -\frac{1}{3} \\ -\frac{2}{3} & -\frac{1}{3} & \frac{2}{3} \end{array}\right)

Now we have

Q_1 A = \left(\begin{array}{rr} -3 & -2 \\ 0 & -1 \\ 0 & -1 \end{array}\right)

Now we need a second reflection to send the lower part of A_2, (-1,-1)^T, to (\sqrt{2},0). For this 2d reflection, v = (-1,-1)^T - (\sqrt{2},0) and we have

Q_2 = \left(\begin{array}{rrr} 1 & 0 & 0 \\ 0 & -1/\sqrt{2} & -1/\sqrt{2} \\ 0 & -1/\sqrt{2} & 1/\sqrt{2} \end{array}\right)

Since Q_1 and Q_2 are symmetric and orthogonal, Q_1 = Q_1^T = Q_1^{-1} and Q_2 = Q_2^T = Q_2^{-1}, so

A = Q_1 Q_2 R = QR = \left(\begin{array}{rrr} -\frac{1}{3} & \frac{2}{3} \, \sqrt{2} & 0 \\ -\frac{2}{3} & -\frac{1}{6} \, \sqrt{2} & -\frac{1}{2} \, \sqrt{2} \\ -\frac{2}{3} & -\frac{1}{6} \, \sqrt{2} & \frac{1}{2} \, \sqrt{2} \end{array}\right) \left(\begin{array}{rr} -3 & -2 \\ 0 & \sqrt{2} \\ 0 & 0 \end{array}\right)

The Singular Value Decomposition

Every matrix A has a Singular Value Decomposition:

A = U S V^T

where U and V are unitary matrices, and the only nonzero entries of \Sigma are on its diagonal with S_{i,i} \ge S_{i+1,i+1} \ge 0. These diagonal values of S are called the singular values of A. The singular values have a very nice geometric interpretation: the image of a unit sphere (each point thought of as a vector) under multiplication by a matrix A is an ellipsoid (in the appropriate dimension) with semi-axes lengths given by the singular values. This is shown below for the particular example of

A = \left(\begin{array}{rr} -3 & 3 \\ 0 & 2 \end{array} \right)

for which the singular values are

\sigma_1 = \sqrt{11 + \sqrt{85}} \approx 4.5,
\sigma_2 = \sqrt{11 - \sqrt{85}} \approx 1.3

In the image, the shorter bright green and purple vectors are the columns of V, the right-singular vectors. They are sent by A (multiplying from the left) to the darker vectors \sigma_1 u_1 and \sigma_2 u_2. Note that v_1 and v_2 are orthogonal, and so are u_1 and u_2.

SVD example
SVD example showing singular vectors.

The SVD is not cheap to compute, but with it in hand we can compute almost anything else we might want regarding the matrix A.

A simple way to compute the SVD (not the best way for larger matrices) is to form the matrix A^T A = V \Sigma^T \Sigma V^T, which is symmetric. Its eigenvectors are the v_i (the columns of V), and its eigenvalues are the squares of the singular values.

Low-rank approximations

A major application of the SVD is using it to create optimal low-rank approximations to a matrix. If we want a rank r approximation to A, we simply truncate the number of singular vectors used, i.e.

A_r = \sum_{i=1}^{r} u_i \sigma_i v_i^*

where the u_i are the left singular vectors, the \sigma_i are the singular values, and the v_i are the right singular vectors.

These approximations to A are optimal in the sense that the difference between them and A itself is as small as possible in the 2-norm, with the difference being:

| A - A_r |_2 = \sigma_{r+1}

The Pseudo-inverse

The SVD gives a nice way to compute and express the pseudo-inverse of a matrix A:

A^{\dagger} = \sum \sigma^{-1}_k v_k u_k^T

where we replace \sigma^{-1}_k by 0 if \sigma_k=0.

This gives another way to think about least-squares solutions: just use the pseudo-inverse.

2-norm Condition number

The 2-norm condition number of a matrix A can be computed from the SVD as \sigma_1/\sigma_n, even when A is not square.

SVD Example

Lets look at a simple example of the SVD for the matrix:

A = \left(\begin{array}{rr} 0 & -1 \\ 1 & 0 \end{array}\right)

We can compute the singular values and the singular vectors v_i from the matrix

A^T A = \left(\begin{array}{rr} 1 & 0 \\ 0 & 1 \end{array}\right)

For this matrix, any nonzero vector is an eigenvector with eigenvalue 1. So the singular values are both 1, and we can pick any two orthonormal vectors to be v_1 and v_2. If we pick them to be the coordinate vectors v_1 = e_1 = (1,0)^T and v_2 = e_2 = (0,1)^T, then the SVD is

A = U \Sigma V^T = \left(\begin{array}{rr} 0 & -1 \\ 1 & 0 \end{array}\right) \left(\begin{array}{rr} 1 & 0 \\ 0 & 1 \end{array}\right) \left(\begin{array}{rr} 1 & 0 \\ 0 & 1 \end{array}\right)

The SVD is not very useful in this case; we include the example only to show what happens in a very simple setting. The action of the matrix A is a 90 degree rotation, so it does not stretch vectors at all (i.e. the singular values are 1). Note that the diagonalization of A gives the same information in a different way, since the eigenvalues are complex (i and -i) and so are the eigenvectors:

A = \left(\begin{array}{rr} \sqrt{2} & \sqrt{2} \\ -i \, \sqrt{2} & i \, \sqrt{2} \end{array}\right)^{-1} \left(\begin{array}{rr} i & 0 \\ 0 & -i \end{array}\right) \left(\begin{array}{rr} \sqrt{2} & \sqrt{2} \\ -i \, \sqrt{2} & i \, \sqrt{2} \end{array}\right)

Another SVD example

Let's look at a more typical small example of the SVD, for the matrix:

A = \left(\begin{array}{rr} 1 & -1 \\ -2 & 2 \\ 3 & -3 \end{array}\right)

Again we use the technique of forming the matrix A^T A and computing its eigenvalues and eigenvectors.

A^T A = \left(\begin{array}{rrr} 1 & -2 & 3 \\ -1 & 2 & -3 \end{array}\right) \left(\begin{array}{rr} 1 & -1 \\ -2 & 2 \\ 3 & -3 \end{array}\right) = \left(\begin{array}{rr} 14 & -14 \\ -14 & 14 \end{array}\right)

The eigenvalues of A^T A are 28 and 0, so the singular values are \sqrt{28} and 0. We can choose unit length eigenvectors

v_1 = \left(\begin{array}{r} 1/\sqrt{2} \\ -1/\sqrt{2} \end{array}\right), \ \ \ v_2 = \left(\begin{array}{r} 1/\sqrt{2} \\ 1/\sqrt{2} \end{array}\right)

Because of the zero eigenvalue, we cannot define u_2 = A v_2 /\sigma_2. But we do have

u_1 = A v_1/\sigma_1 = \frac{1}{\sqrt{14}}\left(\begin{array}{r} 1 \\ -2 \\ 3 \end{array}\right)

which means the reduced SVD is

A = U \Sigma V^T \left(\begin{array}{r} \frac{1}{\sqrt{14}} \\ \frac{-2}{\sqrt{14}} \\ \frac{3}{\sqrt{14}} \end{array}\right) \left(\begin{array}{r} \sqrt{28} \end{array}\right) \left(\begin{array}{rr} 1/\sqrt{2} & -1/\sqrt{2} \end{array}\right)

To form a full SVD we would need to construct an orthonormal set \{u_1, u_2, u_3\}; there is not a unique way to do this but one choice would be to choose

u_2 = \left(\begin{array}{r} 2/\sqrt{5} \\ 1/\sqrt{5} \\ 0 \end{array}\right)

and then we must choose u_3 (up to a sign) to be

u_3 = \left(\begin{array}{r} 3/\sqrt{70} \\ -6/\sqrt{70} \\ -5/\sqrt{70} \end{array}\right)

So one full SVD of A is

A = \left(\begin{array}{rrr} \frac{1}{\sqrt{14}} & 2/\sqrt{5} & 3/\sqrt{70} \\ \frac{-2}{\sqrt{14}} & 1/\sqrt{5} & -6/\sqrt{70} \\ \frac{3}{\sqrt{14}} & 0 & -5/\sqrt{70} \end{array}\right ) \left(\begin{array}{rr} \sqrt{28} & 0 \\ 0 & 0 \\ 0 & 0 \end{array}\right ) \left(\begin{array}{rr} 1/\sqrt{2} & -1/\sqrt{2} \\ 1/\sqrt{2} & 1/\sqrt{2} \\ \end{array} \right)

Least-Squares Exercises