Orthogonality and Least-Squares
Orthogonal vectors and matrices
Least squares solutions and the QR decomposition
It is extremely common in many applications to want the closest thing to a solution x to an overdetermined system
where A is m \times n with m > n, and b is a m-dimensional given vector. Usually an actual solution does not exist, and so we search for an x which minimizes the 2-norm of the residual r = Ax - b. The use of the 2-norm is what gives this the name "least-squares" solution.
Geometrically we can think of the least-squares solution x as the projection of b onto the range of A. Because of this, it is useful to understand some of the linear algebra of projection matrices when studying the process of finding least-squares solutions.
The simplest way of tackling the least-squares problem is to simply multiply both sides of the equation by the transpose of A:
These are called the "normal equations". The right-hand side is now n-dimensional and there will be a unique solution for x as long as the rank of A is n:
As usual with numerical linear algebra we do not want to actually invert A^T A but rather solve for x using some variant of PLU decomposition. For symmetric matrices such as A^T A there is an efficient factorization algorithm (Cholesky factorization).
Sometimes this formulation of least-squares is sufficient to find the answers we need, but it suffers from some fixable problems that can be important.
Simple example of least squares
Suppose we want the least-squares solution to the system
The matrix A^T A is here just the number 10, and A^T b = 8, so x = (A^T A)^{-1} (A^T b) = 10^{-1} 8 = 4/5. This is shown in the figure below.
Projections and orthogonality.
A matrix P is a projection if P^2 = P. If in addition P^T = P then P is an orthogonal projection.
The definition of a projection implies that the matrix must be square.
The word orthogonal is used in several very different ways in linear algebra, although all of them relate to the idea of various things being perpendicular - i.e. with an angle of pi/2. For vectors in any dimension, the angle between vectors (u and v) is usually defined by the formula
Notice that with the principal branch of the inverse cosine the angle \theta will be in the interval [0,\pi].
The orthogonal projection onto the span of a vector u can be written as
If the range of an orthogonal projection is given by a basis a_1, a_2, \ldots, a_m then if we put the basis vectors a_i into a matrix A as its columns we can write the projection as
Non-orthogonal projection example
Lets compute the projection matrix P whose range is the span of (2,1), and whose nullspace is the span of (1,-1). Another way to say this is that P projects onto (2,1) along (1,-1).
P acts on (2,1) as the identity, and P sends (1,-1) to the zero vector. Writing these conditions as one matrix equation gives us:
So one way to compute P is:
QR decompositions
The QR decomposition is an extremely useful matrix decomposition used for a variety of purposes in numerical linear algebra. For the moment we will focus on its use in least-squares problems.
If A is an m by n matrix, then a QR decomposition A = QR consists of an orthogonal (or unitary in the complex case) matrix Q and an upper-triangular matrix R. A matrix Q is defined to be orthogonal if Q^T Q = I; a complex matrix U is defined to be unitary if Q^* Q = I, where Q^* is the complex conjugate transpose.
With a QR decomposition in hand we can reframe the least-squares problem:
by multiplying both sides by Q^T, obtaining
and this is easy to solve with back substitution since R is upper-triangular. This avoids forming the possibly high condition number matrix A^T A and thus is more reliably accurate than using the normal equations.
Gram-Schmidt Orthogonalization
The Gram-Schmidt algorithm progressively orthogonalizes a sequence of vectors A_1, A_2, \ldots, A_n (which in the QR algorithm we view as the columns of a matrix A) by normalizing their length after subtracting the projection of each A_i on the previous orthogonalized vectors Q_i. The first few steps of this process are:
Gram-Schmidt example
As an example we will compute the QR decomposition of
using the Gram-Schmidt algorithm. We first compute the length of the first column (A_1) to find r_{1,1}:
and then
Next, we subtract off the component of A_2 in the Q_1 direction from A_2. The length of the remaining vector is r_{2,2}, and the Q_1 component is r_{1,2}:
Using these we can compute Q_2:
Finally, we compute the components of A_3 in the Q_1 and Q_2 directions (r_{1,3} and r_{2,3} respectively), and then subtract those off:
so the decomposition is
Householder reflections
A Householder reflection is constructed from a vector v as: I - 2 \frac{v v^T}{v^T v}
We use reflections to triangularize a matrix by reflecting a column vector a onto the vector -sgn(a_1)|a| e_1 where e_1 is the first unit coordinate vector (i.e. e_1^T = (1,0,\ldots,0)). The vector v in the reflection formula is the difference between a and the target: v = a + sgn(a_1)|a| e_1. The sign function sgn(x) is 1 if the sign of the argument is positive, and -1 if it is negative. This convention ensures that v is not too small a vector, which could lead to a loss of precision.
The figure illustrates an example in which the column vector A_1 = (1,1)^T (in blue) is reflected onto (-\sqrt(2),0)^T (in red). The difference between these is v (in green). The grey vector points along the line of reflection; the orthogonal projection matrix I - \frac{v v^T}{v^T v} would project A_1 onto the grey vector.
Example QR with Householder
Here we compute a quick example of the QR decomposition using a Householder reflection. To triangularize the matrix:
we will reflect the first column A_1 = (1,2,2)^T onto -|A_1|e_1 = (-3,0,0)^T. The difference between A_1 and this target is v = (4,2,2)^T, so the reflection matrix is
Now we have
Now we need a second reflection to send the lower part of A_2, (-1,-1)^T, to (\sqrt{2},0). For this 2d reflection, v = (-1,-1)^T - (\sqrt{2},0) and we have
Since Q_1 and Q_2 are symmetric and orthogonal, Q_1 = Q_1^T = Q_1^{-1} and Q_2 = Q_2^T = Q_2^{-1}, so
The Singular Value Decomposition
Every matrix A has a Singular Value Decomposition:
where U and V are unitary matrices, and the only nonzero entries of \Sigma are on its diagonal with S_{i,i} \ge S_{i+1,i+1} \ge 0. These diagonal values of S are called the singular values of A. The singular values have a very nice geometric interpretation: the image of a unit sphere (each point thought of as a vector) under multiplication by a matrix A is an ellipsoid (in the appropriate dimension) with semi-axes lengths given by the singular values. This is shown below for the particular example of
for which the singular values are
In the image, the shorter bright green and purple vectors are the columns of V, the right-singular vectors. They are sent by A (multiplying from the left) to the darker vectors \sigma_1 u_1 and \sigma_2 u_2. Note that v_1 and v_2 are orthogonal, and so are u_1 and u_2.
The SVD is not cheap to compute, but with it in hand we can compute almost anything else we might want regarding the matrix A.
A simple way to compute the SVD (not the best way for larger matrices) is to form the matrix A^T A = V \Sigma^T \Sigma V^T, which is symmetric. Its eigenvectors are the v_i (the columns of V), and its eigenvalues are the squares of the singular values.
Low-rank approximations
A major application of the SVD is using it to create optimal low-rank approximations to a matrix. If we want a rank r approximation to A, we simply truncate the number of singular vectors used, i.e.
where the u_i are the left singular vectors, the \sigma_i are the singular values, and the v_i are the right singular vectors.
These approximations to A are optimal in the sense that the difference between them and A itself is as small as possible in the 2-norm, with the difference being:
The Pseudo-inverse
The SVD gives a nice way to compute and express the pseudo-inverse of a matrix A:
where we replace \sigma^{-1}_k by 0 if \sigma_k=0.
This gives another way to think about least-squares solutions: just use the pseudo-inverse.
2-norm Condition number
The 2-norm condition number of a matrix A can be computed from the SVD as \sigma_1/\sigma_n, even when A is not square.
SVD Example
Lets look at a simple example of the SVD for the matrix:
We can compute the singular values and the singular vectors v_i from the matrix
For this matrix, any nonzero vector is an eigenvector with eigenvalue 1. So the singular values are both 1, and we can pick any two orthonormal vectors to be v_1 and v_2. If we pick them to be the coordinate vectors v_1 = e_1 = (1,0)^T and v_2 = e_2 = (0,1)^T, then the SVD is
The SVD is not very useful in this case; we include the example only to show what happens in a very simple setting. The action of the matrix A is a 90 degree rotation, so it does not stretch vectors at all (i.e. the singular values are 1). Note that the diagonalization of A gives the same information in a different way, since the eigenvalues are complex (i and -i) and so are the eigenvectors:
Another SVD example
Let's look at a more typical small example of the SVD, for the matrix:
Again we use the technique of forming the matrix A^T A and computing its eigenvalues and eigenvectors.
The eigenvalues of A^T A are 28 and 0, so the singular values are \sqrt{28} and 0. We can choose unit length eigenvectors
Because of the zero eigenvalue, we cannot define u_2 = A v_2 /\sigma_2. But we do have
which means the reduced SVD is
To form a full SVD we would need to construct an orthonormal set \{u_1, u_2, u_3\}; there is not a unique way to do this but one choice would be to choose
and then we must choose u_3 (up to a sign) to be
So one full SVD of A is
Least-Squares Exercises
-
Find the least-squares solution \bar{x} = \left ( \begin{array}{c} \bar{x}_1 \\ \bar{x}_2 \end{array} \right ) to the system of equations
x_1 + x_2 = 5x_1 = 2-x_1 + x_2 = 0 -
Find the 3 by 3 projection matrix P that projects along \left ( \begin{array}{c} x \\ y \\ z \end{array} \right ) = \left ( \begin{array}{c} 1 \\ 1 \\ 1\end{array} \right ) onto the x-y plane (i.e. the span of \{\left ( \begin{array}{c} 1 \\ 0 \\ 0\end{array} \right ), \left ( \begin{array}{c} 0 \\ 1 \\ 0\end{array} \right ) \}.
-
Compute the reduced QR decomposition of the matrix A = \left(\begin{array}{rr} -1 & 1 \\ 2 & 1 \\ 2 & 4 \end{array}\right) using the Gram-Schmidt algorithm.
-
Compute the full QR decomposition of the matrix A = \left(\begin{array}{rr} 3 & 0 \\ 4 & 5 \\ 0 & 4 \end{array}\right) using Householder triangularization.
-
Compute the reduced SVD decomposition of the following matrices:
-
A = \left(\begin{array}{rr} 1 & 1 \\ 1 & 1 \\ \end{array}\right)
-
B = \left(\begin{array}{rr} 1 & 0 \\ 1 & 1 \\ 0 & 1 \end{array}\right)
-