This repository holds materials the second 6-unit "spoke" half of a new MIT course (Spring 2025) introducing numerical methods and numerical analysis to a broad audience. 18.S097/16.S092 covers large-scale linear algebra: what do you do when the matrices get so huge that you probably can't even store them as
- Prerequisites: 18.03, 18.06, or equivalents, and some programming experience. You should have taken the first half-semester numerical "hub" 18.S190/16.S090, or alternatively any other introductory numerical-methods course (e.g. 18.330, 18.C25, 18.C20, 16.90, 2.086, 12.010, 6.7330J/16.920J/2.097J, or 6.S955).
Taking both the hub and any spoke will count as an 18.3xx class for math majors, similar to 18.330, and as 16.90 for course-16 majors.
Instructor: Prof. Steven G. Johnson.
Lectures: MWF10 in 2-142 (Feb 3 – Mar 31), slides and notes posted below. Lecture videos posted in Panopto Video on Canvas.
Grading (all assignments submitted electronically via Gradescope on Canvas):
- 50% 4 weekly psets: due Fridays at midnight: April 11, 18, 25 and May 2.
- 50% final project: due May 12 (last day of class). The final project will be an 8–15 page paper reviewing, implementing, and testing some interesting numerical linear-algebra algorithm not covered in the course. A 1-page proposal will be due April 18. See final-project/proposal information.
Collaboration policy: Talk to anyone you want to and read anything you want to, with two caveats: First, make a solid effort to solve a problem on your own before discussing it with classmates or googling. Second, no matter whom you talk to or what you read, write up the solution on your own, without having their answer in front of you (this includes ChatGPT and similar). (You can use psetpartners.mit.edu to find problem-set partners.)
Teaching Assistant: Mo Chen
Office Hours: Wednesday 4pm in 2-345 (Prof. Johnson)
Resources: Piazza discussion forum, math learning center, TSR^2 study/resource room, pset partners.
Textbook: No required textbook, but suggestions for further reading will be posted after each lecture. The book Fundamentals of Numerical Computation (FNC) by Driscoll and Braun is freely available online, has examples in Julia, Python, and Matlab, and is a valuable resource. Another important textbook for the course is Numerical Linear Algebra by Trefethen and Bau. (Readable online with MIT certificates, and there is also a PDF posted online at uchicago, though this is a graduate-level textbook and hence is somewhat more sophisticated than the coverage in this course.)
This document is a brief summary of what was covered in each lecture, along with links and suggestions for further reading. It is not a good substitute for attending lecture, but may provide a useful study guide.
- Overview and syllabus (this web page).
- Handwritten notes
- Julia notebook with scaling examples
- Lecture video (MIT only): panopto video on canvas
Reviewed the fact that traditional "dense" linear-algebra algorthms (factorizations LU, QR, diagonalization, SVD, etc.), which assume little or no special structure of the matrix, typically require
However, huge matrices (
The trick is that huge matrices typically have some special structure that you can exploit, and the most common such structure is sparsity: the matrix entries are mostly zero. Ideally, an
Showed how a sparse matrix, in fact a symmetric tridiagonal matrix arises from discretizing a simple PDE on a grid with finite differences: Poisson's equation
How can we generalize this to other sparsity patterns and other types of large-scale matrix structures?
Further reading: FNC book section 8.1: sparsity and structure. The example of discretizing a 1d Poisson equation with finite differences, resulting in a tridiagonal matrix, is discussed in many sources, e.g. these UTexas course notes.
- Sparse matrices and data structures
- sparse-matrix slides from 18.335 (Fall 2006)
- Julia notebook on dense and sparse matrices from 18.06 (Fall 2022)
- sparse factorization and nested-dissection examples
- pset 1: due Friday, April 11
Sparse-direct solvers: For many problems, there is an intermediate between the dense Θ(m³) solvers of LAPACK and iterative algorithms: for a sparse matrix A, we can sometimes perform an LU or Cholesky factorization while maintaining sparsity, storing and computing only nonzero entries for vast savings in storage and work. One key observation is that the fill-in only depends on the pattern of the matrix, which can be interpreted as a graph: m vertices, and edges for the nonzero entries of A (an adjacency matrix of the graph), and sparse-direct algorithms are closely related to graph-theory problems. How efficient the sparse-direct methods are depends on how easy it is to partition the graph by chopping it into pieces, and this is easier for matrices that come from low-dimensional meshes (e.g. discretized low-dimensional PDEs). 1d meshes are best (giving banded matrices with linear complexity), 2d meshes are still pretty good, and 3d meshes start to become challenging. See the scalings in the handout, which are derived in the Davis book below.
Further reading: The book Direct Methods for Sparse Linear Systems by Davis is a useful reference.
Iterative methods: the big picture is to solve
-
Pro: can be fast whenever
is fast (e.g. if is sparse, low-rank, Toeplitz, etc.). Can scale to huge probems. - Con: hard to design an iteration that converges quickly, how well the methods work is often problem-dependent, often requires problem-depending tuning and experimentation (e.g. preconditioners)
The simplest iterative method is the power method for eigenvalues: repeatedly multipling a random initial vector by
Analyzed the convergence of the power method: if
Given an estimate
To find other eigenvectors and eigenvalues, one possibility is an algorithm called deflation. It exploits the fact that for real-symmetric
Deflation is a terrible scheme if you want the smallest magnitude eigenvalue, however, since you'd have to compute all the other eigenvalues/vectors first. Instead, to find the smallest |λ| one can simply apply the power method to
Further reading: FNC book section 8.2: power iteration and section 8.3: inverse iteration. Trefethen & Bau, lecture 27. See this notebook on the power method from 18.06.
Proved that, for a real-symmetric matrix A=Aᵀ, the Rayleigh quotient R(x)=xᵀAx/xᵀx is bounded above and below by the largest and smallest eigenvalues of A (the "min–max theorem"). This theorem is useful for lots of things in linear algebra. Here, it helps us understand why the Rayleigh quotient is so accurate: the power method converges to a maximum-|λ| eigenvalue, which is either the smallest (most negative) or the largest (most positive) λ of a real-symmetric A, and hence that λ is an extremum (minimum or maximum) of the Rayleigh quotient where its gradient is zero. In fact, you can show that ∇R=0 for any eigenvector (not necessarily min/max λ). This means, if we Taylor expand R(x+δx) around an eigenvector x where R(x)=λ, you get R(x+δx)=λ+O(‖δx‖^2): the error in the eigenvalue λ goes as the square of the error in the eigenvector (for real-symmetric A).
Last time, we considered inverse iteration. A more general idea is shifted inverse iteration: at each step, compute
But where would you get a good guess for λ? A simple answer is to use the Rayleigh quotient R(x), where x comes from previous steps of the power iteration. Even if the power iteration is converging slowly, once you have even a rough approximation for λ you can use it as a shift. This leads to the algorithm of Rayleigh-quotient iteration: at each step, compute
The only problem with Rayleigh-quotient is the need for a good initial guess — if you have a bad initial guess, it can be quite unpredictable what eigenvalue it converges to! But any time you can obtain a rough idea of where the desired eigenvalue is, it means you can zoom into the exact value extremely quickly.
Further reading: FNC book section 8.3: inverse iteration; however, beware that the book currently shows a less accurate (for real-symmetric/Hermitian A) method to estimate eigenvalues (issue fnc#16). Trefethen & Bau, lecture 27 covers these algorithms in much more depth. These slides by Per Persson (2006) are a useful summary.
- pset 1 solutions
- pset 2, due April 18
Introduced Krylov subspaces, and the idea of Krylov subspace methods: ideally, we want to find the "best" solution in the whole subspace 𝒦ₙ spanned by {x₀,Ax₀,...,Aⁿ⁻¹x₀}, which is the only subspace you can get starting from x₀ if you are only allowed linear operations and matrix–vector products.
The Arnoldi algorithm is a Krylov algorithm for eigenproblems. It basically has two components:
- Find an orthonormal basis Qₙ for 𝒦ₙ. Essentially, we will to this by a form of Gram–Schmidt, to be determined.
- Given the basis, give the "best" estimate in 𝒦ₙ for one or more eigenvectors and eigenvalues.
Discussed what it means to find the "best" solution in the Krylov subspace 𝒦ₙ. Discussed the general principle of Rayleigh–Ritz methods for approximately solving the eigenproblem in a subspace: finding the Ritz vectors/values (= eigenvector/value approximations) with a residual perpendicular to the subspace (a special case of a Galerkin method).
For Hermitian matrices A, I showed that the max/min Ritz values are the maximum/minimum of the Rayleigh quotient in the subspace, via the min–max theorem. In this sense, at least for Hermitian matrices, the Ritz vectors are optimal in the sense of maximizing (or minimizing) the Rayleigh quotient in the Krylov space. Another sense in which they are optimal for Hermitian A is that the Ritz vectors/values minimize ‖AV-VD‖₂ over all possible orthonormal bases V of the Krylov space and all possible diagonal matrices D (see the Bai notes below for a proof). (Later, we will discuss an "optimal polynomial" interpretation of Arnoldi+Rayleigh–Ritz from Trefethen lecture 34.)
Further reading: FNC book, section 8.4 on Krylov subspaces and Arnoldi. Trefethen lecture 33 on Arnoldi. This 2009 course on numerical linear algebra by Zhaojun Bai has useful notes on Krylov methods, including a discussion of the Rayleigh–Ritz procedure.
How do we construct the orthonormal basis
Moreover, showed that the dot products taken during the Gram–Schmidt process are exactly the entries of our Rayleigh–Ritz matrix
Closed by showing some experimental results with a very simple implementation of the Arnoldi algorithm (see notebook above). Arnoldi indeed converges much faster than power iterations, and can give multiple eigenvalues at once. Like the power method convergence is slower if the desired eigenvalues are clustered closely with undesired ones. Unlike the power method, it can converge not just to the largest |λ| but to any desired "edge" of the set of eigenvalues (the "spectrum"), e.g. the λ with the most positive or most negative real parts. Unlike the power method, the convergence of the Arnoldi algorithm is shift-invariant: it is the same for
Further reading: for Gram–Schmidt, see e.g. Strang Intro to Linear Algebra, chapter 4, and Strang 18.06 lecture 17. Modified Gram–Schmidt is analyzed in Trefethen lecture 8, and a detailed analysis with proofs can be found in e.g. this 2006 paper by Paige et al. [SIAM J. Matrix Anal. Appl. 28, pp. 264-284]. See also Per Persson's 18.335 slides on Gram–Schmidt. See also the links on Arnoldi from last lecture.
Showed that in the case where A is Hermitian, Hₙ is Hermitian as well, so Hₙ is tridiagonal and most of the dot products in the Arnoldi process are zero. Hence Arnoldi reduces to a three-term recurrence, and the Ritz matrix is tridiagonal. This is called the Lanczos algorithm.
Showed some computational examples (notebook above) of Arnoldi convergence. Discussed how rounding problems cause a loss of orthogonality in Lanczos, leading to "ghost" eigenvalues where extremal eigenvalues re-appear. In Arnoldi, we explicitly store and orthogonalize against all
A solution to the loss of orthogonality in Lanczos and the growing computational effort in Arnoldi, along with the growing storage, is restarting schemes, where we go for n steps and then restart with the k "best" eigenvectors. If we restart with k=1 every step, then we essentially have the power method, so while restarting makes the convergence worse the algorithm still converges, and converges significantly faster than the power method for n>1.
Further reading: Trefethen, lecture 36. See the section on implicitly restarted Lanczos in Templates for the Solution of Algebraic Eigenvalue Problems. Restarting schemes for Arnoldi (and Lanczos) turn out to be rather subtle — you first need to understand why the most obvious idea (taking the
- Handwritten notes
- GMRES for Ax=b
- project proposals due
- pset 2 solutions
- pset 3: due Saturday 4/26 at midnight
There are many other eigensolver algorithms besides Arnoldi; the choice of algorithm depends strongly on the properties of the matrix and the desired eigenvalue. For Hermitian/real-symmetric problems, a powerful algorithm is LOBPCG, a specialized algorithm for minimizing or maximizing the Rayleigh quotient. There are also a remarkable class of algorithms based on the residue theorem of complex analysis, which allow you to efficiently extract all eigenvalues within in a specified region of the complex plane; a prominent version of this is FEAST.
Introduced the GMRES algorithm: compute the basis Qₙ for 𝒦ₙ as in Arnoldi, but then minimize the residual ‖Ax-b‖₂ for x∈𝒦ₙ using this basis. This yields a small (n+1)×n least-squares problem involving Hₙ.
Further reading: The book Templates for the Solution of Algebraic Eigenvalue Problems (2000) gives a number of methods for various types of eigenproblems, but active research in this area continues. For GMRES, see FNC section 8.5 and Trefethen, lectures 35. In class, I showed a plot from GMRES applied to deconvolution, from this tutorial blog post.
- Handwritten notes
- GMRES for Ax=b
Like Arnoldi/Lanczos, if GMRES does not converge quickly we must generally restart it, usually with a subspace of dimension 1; restarting GMRES repeatedly after k steps is called GMRES(k). Unfortunately, unlike Arnoldi for the largest |λ|, restarted GMRES is not guaranteed to converge. If it doesn't converge, or if it simply converges slowly, we must do something to speed up convergence: preconditioning (next time).
The solution to this problem is preconditioning: finding an (easy-to-compute) M such that MA (left preconditioning) or AM (right preconditioning) has clustered eigenvalues (solving MAx=Mb or AMy=b with x=My, respectively). Essentially, one can think of M as a crude approximation for A⁻¹ (or the inverse of a crude approximation of A that is easy to invert). Brief summary of some preconditioning ideas: multigrid, incomplete LU/Cholesky, Jacobi/block-Jacobi. (Since Jacobi preconditioners only have short-range interactions, they tend not to work well for matrices that come from finite-difference/finite-element discretizations on grids, but they can work well for diagonally dominant matrices that arise in spectral and integral-equation methods.)
To get a more precise understanding of how GMRES (and other Krylov methods) converge, started transforming it to a problem of "polynomial fitting" — it turns out that the error after
One useful trick that we needed was based on a property of induced norms. Recall that the induced norm
Further reading (preconditioning): FNC section 8.8 on preconditioning and Trefethen, lecture 40. Templates for the Solution of Linear Systems, chapter on preconditioners, and e.g. Matrix Preconditioning Techniques and Applications by Ke Chen (Cambridge Univ. Press, 2005). For Hermitian A, we can also specialize the GMRES algorithm analogous to Lanczos, giving MINRES and SYMMLQ: Differences in the effects of rounding errors in Krylov solvers for symmetric indefinite linear systems by Sleijpen (2000); see also van der Vorst notes from Lecture 22 and the Templates book.
Further reading (GMRES and polynomials): Trefethen, lectures 34, 35. The convergence of GMRES for very non-normal matrices is a complicated subject; see e.g. this paper on GMRES for defective matrices or this paper surveying different convergence estimates. Regarding convergence problems with GMRES, see this 2002 presentation by Mark Embree on Restarted GMRES dynamics. Cullum (1996) reviews the relationship between GMRES and a similar algorithm called FOM that is more Galerkin-like (analogous to Rayleigh–Ritz rather than least-squares).
- Handwritten notes
- pset 3 due tomorrow, solutions coming soon
- pset 4: coming soon
Continued discussion of polynomial viewpoint on GMRES and Arnoldi convergence, from last lecture. Some key points:
- GMRES works best of the eigenvalues are mostly in a cluster, similar to the identity matrix. Preconditioning tries to improve this.
- Because of the p(0)=1 constraint of the GMRES polynomial, convergence of GMRES for
is not the same as for a shifted matrix . In particular, as the matrix becomes more ill-conditioned, i.e. one eigenvalue gets close to zero relative to the biggest λ, GMRES convergence slows. Preconditioning (as well as other efforts in reformulating the origin of the matrix) tries to make the matrix well-conditioned. - Arnoldi's analysis is similar (see Trefethen), but its polynomial
has the n-th coefficient equal to 1, rather than the 0-th coefficient. This makes (unrestarted) Arnoldi shift-invariant: it converges equally well for and . - In Arnoldi, the Ritz values (eigenvalue estimates) are precisely the roots of the optimal polynomial
. This means that Arnoldi works best if the desired eigenvalues are extremal (on the edges of the spectrum, e.g. the most positive or most negative real or imaginary parts, or biggest magnitudes) and are not clustered with many undesired eigenvalues. Shift-and-invert is a way of "exploding" clusters near , and for transforming the interior of the spectrum near to the edges of the spectrum.
Further reading (GMRES, Arnoldi, and polynomials): Trefethen, lectures 34, 35, 40. There are also eigenvalue algorithms that can exploit preconditioning if supplied, e.g. the Jacobi–Davidson algorithm or the LOBPCG algorithm mentioned earlier. You can construct the Arnoldi polynomial explicitly from its roots, the Ritz values; the analogous construction of the GMRES polynomial uses "harmonic" Ritz values, as explained in e.g. Goossens and Roose (1999).
- From steepest descent to conjugate gradients.
- pset 4 due, solutions
- Conjugate gradient, continued.
- Other iterative algorithms: Overview
- Randomized linear algebra: the randomized SVD and low-rank approximation
- Differentiating linear algebra solutions: Adjoint methods
- final projects due