Matrix Norms and Linear Systems
Vector norms
A vector norm is a generalization of length. The first formalization of length was what we now call the Euclidean norm. In two dimensions the Euclidean norm of a vector x = (x_1, x_2) is |x|_2 = \sqrt{x_1^2 + x_2^2}. Usually if someone uses the term 'length' with no other explanation, they mean the Euclidean norm. Abstracting the key properties of length into the generalization of vector norms gives us some additional freedom which can be useful in proofs and applications.
The defining properties of a vector norm are:
-
||x|| \ge 0, with ||x||=0 only if x=0.
-
||x + y|| \le ||x|| + ||y||
-
|| c x || = |c| ||x|| for scalar c.
The second property is called the triangle inequality, and requiring it for norms preserves a lot of our intuition about the idea of lengths.
The standard vector norms (p-norms)
For a vector x = (x_1, x_2, \ldots, x_n), the most commonly used norms are in the family defined by:
where p \in [1, \infty]. The most important particular norms in this family are when p=1, p=2 (the Euclidean norm), and p=\infty. The p=\infty is defined by the limit for p \rightarrow \infty, and is also called the max norm because in the limit we have
So for example the vector x = (1,2,-3) has a 1-norm of |x|_1 = 1 + 2 + 3 = 6, a 2-norm of |x|_2 = \sqrt(6), and a \infty-norm of |x|_{\infty} = 3.
Induced vector-matrix norms
For matrices, we usually use the induced norms from vector norms, with
These matrix norms gauge the 'size' of a matrix by the maximum amount that they stretch a vector when acting on it from the left.
One of the desirable properties that any induced matrix norm has is the sub-multiplicative property:
Example:
Let us compute the induced 1,2,\infty-norms for
The induced 1-norm is the maximum of the sum of absolute values of the entries in each column. For this matrix, the second column has a sum of 2 + 2 = 4, so |A|_1 = 4.
The induced \infty-norm is also easy to calculate - it is just the maximum of the sum of absolute values of each row's entries. For this matrix, the first row has the largest sum, so |A|_{\infty} = 1 + 2 =3.
The induced 2-norm is more difficult to compute. One relatively simple way to compute it is through the formula |A|_2 = \max \sqrt{\lambda_i} where the \lambda_i are the eigenvalue of A^T A (or A A^T). As we will see in the section on the Singular Value Decomposition (SVD), the matrix A^T A is symmetric and always diagonalizable with an orthogonal basis of eigenvectors, so there are no major problems in computing its eigenvalues.
The Frobenius Norm
While we usually want to use induced matrix norms, there are some situtations where other norms are convenient. The most important of these is the Frobenius norm, which simply treats A as a vector:
although it can also be expressed in the more obviously invariant forms:
where tr denotes the trace of the matrix and the \sigma_i are the singular values.
The Frobenius norm is sub-multiplicative ( |A B|_F \le |A|_F |B|_F ).
Equivalence of norms
Any two norms are bounded relative to each other, in that there are always constants C_1, C_2 (which depend on the norms) such that for norms a and b
For example, for some of the common p-norms we have
and
Condition numbers
The (relative) condition number for f(x) from x is the ratio of the relative forward error to the relative backward error.
For linear problems, such as solving for x in Ax=b, this ratio of errors is bounded by \kappa(A) = |A| |A^{-1}|, which is often called the condition number of A. It does depend on the norm used; usually it is assumed to be the induced Euclidean norm (induced 2-norm).
Example: condition number for Hilbert matrices
The Hilbert matrices are an infamous example of a family of matrices with rapidly increasing condition number. The entries of a n by n Hilbert matrix are defined as H_{i,j} = \frac{1}{i+j-1} (if we start indexing the rows and columns at 1, so i,j \in \{ 1, \ldots, n\}).
For example, for n=4 the Hilbert matrix is
which already has a 1-norm condition number of about 28,375. For the 20 by 20 Hilbert matrix the condition number is about 3.4 \cdot 10^{19}, and it is easy to lose all meaningful precision if using standard 64-bit floating point numbers with such a matrix.
Direct Linear Methods
The most elementary problem in linear algebra is to solve a linear system
where A is a square (n by n) matrix, x is an unknown vector, and b is a given vector. Solution techniques for relatively small (3 by 3 or so) systems have been known for thousands of years, for example in the Chinese text Nine Chapters on the Mathematical Art. Newton described a more systematic method in the late 17th century, which is more or less what many people call Gaussian Elimination.
The method you are most likely to have learned in an introductory linear algebra course is often called Gauss-Jordan elimination. In this method, the augmented coefficient matrix (A|b) is converted to its reduced row-echelon form (rref) using the three fundamental row operations: (1) rescaling a row by a nonzero constant, (2) interchanging (swapping) two rows, and (3) adding a multiple of row to another row.
Gaussian Elimination and the LU factorization
Often in applications the matrix A is fixed but the vector b is changed. In this case it is more efficient to compute a factorization of A which can be used to efficiently solve the system for any b. The factorization is usually written as PA = LU, although it could also be written as A = P^{-1} L U. Here P is a permutation matrix (so PA has the same rows as A, but possibly in a different order), L is a lower-triangular matrix, and U is upper-triangular.
Pivoting
The basic LU factorization cannot succeed in some cases because of the presence of zero entries, for example the matrix
immediately causes a problem since we cannot eliminate the (2,1) entry using a multiple of the (1,1) entry since it is zero.
More subtly, using multiples of small entries for elimination can lead to a serious loss of precision.
Example: loss of precision without pivoting
Consider the system Ax=b where
and b = (1,0). If we eliminate the (2,1) entry by adding -10^{20} times row 1 to row 2, we would obtain the decomposition
Example PA=LU decomposition
Lets look at a particular example of a PA=LU decomposition for a 3 by 3 matrix A:
With partial pivoting, we want to move the A_{2,1} = 4 to the upper left corner. We can do this with a permutation matrix
so now we have
Next we can add -1/2 row 1 to row 2, and -1/4 row 1 to row 3, to obtain two zeros in the first column. At that point we have:
Since bottom row has a larger pivot value than the second row, we swap rows 2 and 3. This means our permutation matrix is now
and
Note that in the lower-triangular matrix we only swap the lower-triangular entries.
Finally, we need to add 1/2 of row 2 to row 3. This gives us
Basic example
Example of pathological error in Gaussian elimination. (Trefethen and Bau.)
Some matrices cause problems even with partial pivoting. A worst-case example is the following:
which has the LU decomposition
The exponential growth of the entries in the right-most column cause numerical precision issues.
Partial pivoting algorithm in pseudocode
U=A, L = I, P = I
for k=1 to m-1:
Select i>=k to maximize |u_{ik}|
interchange row u_k with u_i, l_k with l_i, p_k with p_i
for j=k+1 to m:
l_{jk} = u_{jk}/u_{kk}
u_{j,k:m} = u_{j,k:m} - l_{jk} u_{k,k:m}
Operation counts: about \frac{2}{3}n^3 for n by n.
Indirect (iterative) linear methods
For sparse, large, and related series of linear problems indirect/iterative methods can be vastly superior to direct methods such as the PA=LU decomposition.
The simplest iterative method is the Jacobi method. The idea of this method is to split a matrix into its diagonal and off-diagonal parts: A = D + R. Then the equation Ax = b can be put into the iterative form
This converges if D^{-1}R = I - D^{-1}A has spectral radius less than 1.
Example of Jacobi Method
Lets see how the Jacobi method works on a small example, where
and b=(1,1)^T.
The exact value of x solving Ax=b is (\frac{2}{5},\frac{1}{5})^T. Suppose we start with a guess of x_0 = (1,1)^T. At each iteration step we will need
and
The iteration step is
The first few iterations yield x_1 = (0,0)^T, x_2 = (1/2, 1/3)^T, and after ten steps
which has a relative error of about 0.0003. For this example, the Jacobi method is ridiculously inefficient, but this is the opposite extreme from the large sparse matrix case for which it can be acceptable.
In the Gauss-Seidel method, the matrix is split into a an upper-triangular and a lower-triangular part: A = (D + L) + U; the lower triangular part has a non-zero diagonal, while U has all zeros on its diagonal. Then we use the iterative scheme:
which can be efficiently solved by forward substitution without explicitly inverting the matrix (D+L).
For the previous example, with A = \left(\begin{array}{rr} 2 & 1 \\ 1 & 3 \end{array}\right), the Gauss-Seidel method has a relative error of about 10^{-8} after ten iterations - much better than the Jacobi method.
Successive Over-Relaxation
For some problems convergence can be improved by moving a little more in the direction of the updated solution. The simplest version of this type of method is to use the Gauss-Seidel method with a weight w for the old and new estimates, i.e.
These two steps can be combined into one for a more efficient implementation.
When w > 1 these methods are called Successive Over-Relaxation (SOR) methods.
Example of SOR
Lets look at what happens if we apply the Gauss-Seidel method with SOR to the problem Ax = b where
The number of iterates until convergence depends on the value of w as shown below:
We can see that for this particular problem the optimal weight is around w=1.12.
Optimal weight for SOR
In general there is no formula for the weight used in SOR, but there are a variety of special cases that have been studied. One approach is to perform some number of iterations with w=1, and then switch to a new w_o given at the kth iteration by
where p is a small positive integer.
Exercises
-
Solve the equation for x using elimination on the augmented coefficient matrix:
A x = \left(\begin{array}{rrr} 1 & 0 & 1 \\ 1 & 2 & -1 \\ -2 & -2 & 2 \end{array}\right) x = \left ( \begin{array}{r} -3 \\ -1 \\ 4 \\ \end{array} \right ) -
Compute the norm of the matrix
A = \left ( \begin{array}{rr} 1 & -2 \\ 3 & 4 \end{array} \right )using
-
The induced 1-norm
-
The induced 2-norm
-
The induced \infty-norm
-
The Frobenius norm
-
-
In general the p-norm of a vector x is defined as \left ( \sum_i |x_i|^p \right )^{1/p}, for 1 \le p \le \infty. Show that p=\frac{1}{2} does not define a norm through this formula by finding a counter-example to the triangle inequality.
-
If we combine the 1-norm and \infty-norm by averaging, is the result f(x) = \frac{||x||_1 + ||x||_{\infty}}{2} a norm? Explain why or why not.
-
Find P, L, and U for the PA = LU factorization (using partial pivoting) of
A = \left(\begin{array}{rrr} 0 & 1 & 0 \\ 1 & 0 & 2 \\ -2 & 1 & 0 \end{array}\right).
-
Find a permutation matrix P such that if
B = \left(\begin{array}{rrrr} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \\ 13 & 14 & 15 & 16 \end{array}\right) \ \ \ \text{then} \ \ \ \ PBP^{-1} = \left(\begin{array}{rrrr} 1 & 3 & 2 & 4 \\ 9 & 11 & 10 & 12 \\ 5 & 7 & 6 & 8 \\ 13 & 15 & 14 & 16 \end{array}\right) -
Compute the relative condition number of
-
f(x) = \sqrt{x}.
-
f(x) = x^2.
-
f(x) = x_1 - x_2 (where x = \left ( \begin{array}{c} x_1 \\ x_2 \end{array} \right )), using the \infty-norm.
-
-
Starting with the initial guess x = \left ( \begin{array}{c} 0 \\ 0 \end{array} \right ), use four iterates of the Jacobi method to approximate the solution of Ax=b if A = \left(\begin{array}{rr} 2.0 & 1.0 \\ -1.0 & 3.0 \end{array}\right) and b =\left ( \begin{array}{c} 3 \\ 2 \end{array} \right ). Is this converging to the correct answer?
-
Sketch in the a-b plane the set of values for which the Jacobi method will converge for the matrix A = \left ( \begin{array}{cc}a & b \\ b & a \end{array} \right).
-
Repeat the above exercise for the Gauss-Seidel method.