Added notes for Algorithms for Uncertainty Quantification.

ila · Aug 16, 2021 · ade5a2a · ade5a2a
1 parent 6820f82
commit ade5a2a
Show file tree

Hide file tree

Showing 34 changed files with 983 additions and 0 deletions.
diff --git a/Algorithms for Uncertainty Quantification/03 - Standard Monte Carlo Sampling.md b/Algorithms for Uncertainty Quantification/03 - Standard Monte Carlo Sampling.md
@@ -0,0 +1,60 @@
+# About these notes
+These notes are my lecture notes on the Algorithms for Uncertainty Quantification course held in the summer term 2021 by Dr. Tobias Neckel. The notes are based on the course slides [^1]. Images are taken from the course slides.
+
+The notes are written in [Obsidian markdown](https://obsidian.md/) and are best viewed in Obsidian.
+
+[^1]: Tobias Neckel - Algorithms for Uncertainty Quantification, lecture slides, summer term 2021
+
+I left out the first two lectures because these only contained an overview of UQ and a repetition of probability theory.
+
+# 03 - Standard Monte Carlo Sampling
+
+### Sampling methods in UQ
+Sampling methods in general: from a statistical population, select a subset to estimate characteristics of the whole population. Up to specification: the population, the sampling method, the sampling size.
+
+In UQ: Population = domain in $\mathbb{R}^n$.
+
+### Monte Carlo sampling
+Monte Carlo sampling: generate samples from a probability distribution via a pseudo-random number generator (*PRNG*) and perform determinstic computations using these samples.
+
+##### Approximating $\pi$
+Sample uniformly from square with sidelength $a$, for each point check if inside the circle; Get $A_{rec}/A_{circ} = \frac{a^2}{\pi(a/2)^2}$ and $\pi = 4 \frac{A_{circ}}{A_{rec}}$.
+
+Example: after 10000 samples, we get $\pi \approx 3.1508$.
+
+##### Formalization of Monte Carlo sampling
+Assume $f: [0, 1] \to \mathbb{R}$ is a continuous function and want to evaluate $I = \int_0^1 f(x)\,dx$. But an antiderivative of $f$ is not available to us.
+
+Notice that $I = E(f(x))$ with $x \sim \mathcal{U}(0, 1)$: So we approximate $E(f(x))$ via sampling as
+
+$$I \approx \hat{I}_f = \frac{1}{N} \sum_{i=1}^N f(U_i), \qquad U_i \sim \mathcal{U}(0, 1)$$
+
+### Error of Monte Carlo integration
+$\hat{I}_f$ is a random variable! The error, i.e. deviation from the real integral, can be expressed by the *root mean squared error* (*RMSE*)
+
+$$RMSE = \sqrt{E((I - \hat{I}_f)^2)}$$
+
+The RMSE is exactly the standard deviation of the estimator $\hat{I}_f$, and related to the standard deviation of $f(U), U \sim \mathcal{U}(0,1)$ via
+
+$$RMSE = \frac{\sigma_f}{\sqrt{N}} \approx \frac{\hat{\sigma}_f}{\sqrt{N}},
+\quad \hat{\sigma}_f^2 = \frac{1}{N-1} \sum_{i=1}^N (f(U_i) - \hat{I}_f)^2$$
+
+### Monte Carlo sampling in UQ
+Goal in UQ: feed input distribution through model to get output distribution, from output distribution compute quantity of interest.
+
+MC sampling can help with this: sample from the input distribution, compute expectation/variance from the outputs via MC integration. 
+
+$$E(y) = \bar{Y} = \frac{1}{N} \sum y_i$$
+$$Var(y) = \bar{S}^2 = \frac{1}{N-1} \sum (y_i - \bar{Y})^2$$
+
+### Advantages and disadvantages of Monte Carlo methods
+
+MC methods are easy, robust to the distribution/model used, "embarrassingly parallel" and have a convergence rate independent of the input dimension.
+
+However the convergence rate is $O(1/\sqrt{N})$, which is very slow.
+
+### Example: Damped oscillator with MC
+
+![[damped-linear-osc.png]]
+
+Via MC: find expected value of $y(t_0), t_0 = 15s$.
diff --git a/Algorithms for Uncertainty Quantification/04 - Advanced Monte Carlo Sampling.md b/Algorithms for Uncertainty Quantification/04 - Advanced Monte Carlo Sampling.md
@@ -0,0 +1,83 @@
+# 04 - Advanced Monte Carlo Sampling
+
+### Variance reduction techniques
+To reduce RMSE, we need to reduce the variance of the estimator.
+
+##### Control variates
+Assume we have an auxiliary function $g$ with known $E(g)$. The random variables $I_f^N$ and $I_g^N$ are the MC estimators for $\int_0^1 f(x) \,dx$ and $\int_0^1 g(x) \,dx$, respectively.
+
+Then the control variate
+
+$$I_{CV} = I_f^N + \alpha (E(g) - I_g^N)$$
+
+is an unbiased estimator for $E(f)$ since the second term has zero mean. The optimal $\alpha$ is 
+
+$$\alpha = \frac{\text{Cov}(I_f^N, I_g^N)}{\text{Var}(I_g^N)} = \frac{\sigma_f \rho_{fg}}{\sigma_g}$$
+
+where $\rho_{fg} = \frac{\text{Cov}(f, g)}{\sigma_f, \sigma_g}$ is the Pearson correlation coefficient. With this optimal choice for $\alpha$, the variance of $I_{CV}$ is 
+
+$$\text{Var}(I_{CV}) = (1 - \rho_{fg}^2) \frac{\text{Var}(f)}{N},$$
+
+so if $f, g$ are (even negatively) correlated, this yields a variance reduction.
+
+##### Importance sampling
+Setting again: estimate an expected value (now more generally with respect to some density $p$) by sampling according to $p$ and estimating $I_f^N = (\sum f(x_i))/N$.
+
+Assume we cannot sample from $p$ directly, but from another density $q$ s.t. $\{f\cdot p > 0\} \subseteq \{ q > 0\}$ . Intuitively, if we find an auxiliary $q$ that matches $p$ closely, that's better than if $q$ is just e.g. the uniform density. Define $w = p / q$: Then
+
+$$\int_D f(x) p(x) \,dx = \int_D f(x) w(x) q(x) \,dx.$$
+
+This motivates the *importance sampling* estimator:
+
+$$I_{IS} = \frac{1}{N} \sum_i f(x_i) w(x_i), \qquad w(x_i) = p(x_i)/q(x_i)$$
+
+The $w(x_i)$ are called *importance weights*. The variance  is given as
+
+$$\text{Var}(I_{IS}) = \frac{1}{N} \bigg( \int_D \frac{(f(x)p(x))^2}{q(x)}\,dx - E(f)^2 \bigg)$$
+
+
+### Alternative RNGs (Quasi Monte Carlo)
+*Quasi Monte Carlo*: Instead of real random sequences, use pseudo-random sequences (behave/look somewhat like random sequences, but are deterministic). Examples:
+
+- Fibonacci generators
+- Latin hypercube sampling
+- Sobol sequences
+- Halton sequences
+
+##### Koksma-Hlawa inequality/Low discrepancy sequences
+We use low-discrepancy sequences. Define the *discrepancy* of a sequence up to term $N$ as
+
+$$D_N = \sup_{A = [a, b] \subseteq [0, 1]} \left\lvert \frac{|A \cap \{x_1,\dots,x_n\}|}{N} - \text{vol}(A) \right\rvert$$
+
+Let $V(f)$ be the *variation* of $f$. Then the *Koksma-Hlawka inequality* states that
+
+$$|I - \hat{I}_f| \leq V(f) D_N \quad \text{(Koksma-Hlawka inequality)}$$
+
+So the error can be reduced by reducing $D_N$, i.e. producing sequences that are "well-spaced" (even better than random?).
+
+Now: error goes from $O(1/\sqrt{N})$ (usual MC) to $O(\log(N)^d / N)$
+
+![[halton-sobol.png]]
+
+##### Halton sequences
+Given: prime $p$. The Halton sequence is defined iteratively as follows
+
+- Break $[0, 1]$ in $p$ subintervals: list the partition points
+- Break each subinterval into $p$ subintervals, list all new partition points in the order (new partition points that are first in their subinterval), (new partition points that are second in their interval), etc.
+- repeat
+
+![[halton-sequence.png]]
+
+For multi-dimensional Halton sequence, take pointwise cartesian product of 1-dimensional Halton sequences with pairwise different primes $p_i$.
+
+The Halton sequence mimics a uniform distribution, but can of course be transformed by the usual technique to mimic different distributions: $F_S^{-1}(U) \sim S$.
+
+
+### Damped oscillator
+Quasi MC can slightly improve the variance of the $y(t_0), t_0=15$ estimation for $c \sim \mathcal{U}(0.08, 0.12)$, but not much.
+
+### Outlook: Multifidelity/Multilevel Monte Carlo
+
+Multifidelity MC: Low-fidelity (i.e. inexpensive to compute, but inaccurate) models are used as control variates: e.g. simplify by interpolating, projecting to lower dimensions, simulate with simplified physics. A standard algorithm optimally distributes work among available models for a given computational budget.
+
+Multilevel MC: special case for the low fidelity comes from having a coarser grid discretization.
diff --git a/Algorithms for Uncertainty Quantification/05 - Interpolation and Quadrature.md b/Algorithms for Uncertainty Quantification/05 - Interpolation and Quadrature.md
@@ -0,0 +1,96 @@
+# 05 - Aspects of Interpolation and Quadrature
+
+## Polynomial interpolation
+
+Interpolation: estimate unknown points of function of which some points are known. Here: polynomial interpolation.
+
+Assume $f: [-1, 1] \to mathbb{R}$ for simplicity, $N \geq 1$. A grid is a set $G$ of $N+1$ notes $x_i$ in ascending order.
+
+The unique polynomial $I_N^G$ of degree $N$ that satisfies
+$$I_N^G(x_i) = f(x_i) ~\forall i$$
+is called the *interpolating polynomial of $f$ on $G$*.
+
+QUESTION: Slides say $f$ should be continuous; but ththat shouldn't be needed at all?
+
+### Lagrange form of $I_N^G$
+Define the $i$-th *Lagrange cardinal polynomial* $L_i^G$ as
+
+$$L_i^G(x) = \prod_{j=0, j\neq i}^N \frac{x - x_j}{x_i - x_j}$$
+
+with the property that $L_i^G(x_j) = \delta_{ij}$. Then the Lagrange form of the interpolant is
+
+$$I_N^G f(x) = \sum_{i=0}^N f(x_i) L_i^G(x)$$
+
+![[lagrange-polys.png]]
+
+### Interpolation error
+Assume that $f$ is "smooth enough" (has derivatives up to order $N+1$), let $bar{x} \in [x_0, x_N]$. Then there is some $\xi \in [x_0, x_N]$ such that
+
+$$f(\bar{x}) - I_N^G f(\bar{x}) = \frac{D^{N+1}f^{(N+1)}(\xi)}{(N+1)!} \prod_{i=0}^N (\bar{x} - x_i)$$
+
+##### Maximum error bound and Lebesgue constant
+Let $I_N^\ast$ be the *best approximation polynomial at $N+1$ nodes*, i.e. the polynomial of minimum supremum norm difference to $f$ that intersects $f$ at $N+1$ points.
+
+The *Lebesgue constant* relative to grid $G$ is
+
+$$\Lambda_N(G) = \max_x \sum_{i=0}^N |L_i^G(x)|$$
+
+Then a bound for the maximum interpolation error of $I_N^G f$ is 
+
+$$||f - I_N^G f||_\infty \leq (1 + \Lambda_N(G)) ||f - I_N^\ast f||_\infty$$
+
+
+$\Lambda_N(G)$ contains all the information about the effect of the grid choice.  For any grid, $\Lambda_N(G) > \frac{2}{\pi} \ln(N+1) - C, C > 0$. For a uniform grid, we have approximately $\Lambda_N(G) \approx \frac{2^{N+1}}{e N \ln N}$.
+
+QUESTION: what is C?
+
+### Runge phenomenon
+Interpolation on uniform grid can lead to Runge phenomenon: Very large approximation errors near the boundaries.
+
+![[runge.png]]
+
+Remedies:
+- use non-uniform grid
+- use other basis functions than $L_i^G$, e.g. splines with local support
+
+
+## Quadrature
+Integrate black-box function without known antiderivative. Idea: approximate based on function evaluations, using rules that could give *exact* integrals for polynomials. In this lecture: *Gaussian quadrature*.
+
+Usually, weighted sum of function evaluations. Classical quadrature: e.g. *trapezoidal sum*
+
+$$h (\frac{f(x_0) + f(x_1)}{2} + \sum_{i=1}^{N-1} f(x_i)), ~ h = x_{i+1}-x_i = \text{const}.$$
+
+### Orthogonal polynomials
+Assume we have a domain $\mathbb{D}$ and weight function $w$, use $L_w^2(\mathbb{D})$ as set of functions that are square-integrable over $\mathbb{D}$ w.r.t. weight function $w$: i.e. $\int_\mathbb{D} f(x)^2 w(x) \,dx < \infty$.
+
+Define the inner product on $L^2_w(\mathbb{D})$ as $$\langle f, g \rangle_w = \int_{\mathbb{D}} f(x) g(x) w(x) \, d(x).$$
+
+Polynomials $p_i, p_j$ of degree $i, j$, respectively, are *orthogonal* if $\langle p_i, p_j \rangle = k \delta_{ij}$ and *orthonormal* if $k=1$.
+
+##### Legendre polynomials
+Used for a uniform weight $w(x) = c$. 
+$$p_0(x) = 1, p_1(x) = x, p_2(x) = \frac{1}{2}(3 x^2 - 1), p_3(x) = \frac{1}{2} (5x^3 - 3x)$$
+
+Different $p_i, p_j$ are orthogonal, and $\langle p_i, p_i\rangle_{w(x)=1/2} = \frac{2}{2i+1}$
+
+##### Hermite polynomials
+Used for Gaussian weight $w(x) = \frac{1}{\sqrt{2\pi}}\exp(-x^2/2)$.
+$$p_0(x)=1, p_1(x) = x, p_2(x) = x^2-1, p_3(x) = x^3-3x$$
+and $\langle p_i, p_i \rangle_w = i!$.
+
+To keep in mind: there are different conventions regarding the normalization, some authors ("physicists") drop the normalization constant in $w(x)$. Then the polynomials have a little different coefficients.
+
+### Gaussian Quadrature
+Let $(p_i)_i$ be a family of orthogonal polynomials wrt. $w$. 
+- Define a quadrature grid $G = (x_i)_i$ by the $N+1$ roots of $p_{N+1}$
+- For each $i$, define a weighting factor $$w_i = \int_{\mathbb{D}} L_i^G(x) w(x) \,dx$$
+
+The Gaussian quadrature is given by
+$$\int_\mathbb{D} f(x) w(x) \,d(x) = \sum_{i=0}^N w_i f(x_i) + \epsilon$$
+
+Gauss-Legendre nodes (roots of Legendre polynomials):
+![[gauss-legendre-nodes.png]]
+
+Gauss-Hermite nodes (roots of Hermite polynomials):
+![[gauss-hermite-nodes.png]]
diff --git a/...nty Quantification/06 - Polynomial Chaos Expansion, Pseudo-spectral Approach.md b/...nty Quantification/06 - Polynomial Chaos Expansion, Pseudo-spectral Approach.md
@@ -0,0 +1,91 @@
+# 06 Polynomial Chaos Expansion
+Notation: $\Omega, \omega, \rho$ for domain/inputs/densities.
+Goal: good approximation of stochastic model that is cheap to evaluate - faster convergence than Monte Carlo.
+
+gPCE = general Polynomial Chaos Expansion.
+
+### Spectral Expansions
+* Fourier series: $s(t) = \sum \hat{s}_n \sin(\dots t)$.
+* Polynomial chaos expansion: $f(t, \omega) \approx \sum_{n=0}^N \hat{f}_n(t) \, \phi_n(\omega)$
+	* choose $\phi_n$ as polynomials, compute coefficients $\hat{f}_n$, choose N and truncate after $N$ terms (go up to $N-1$).
+	* => compute statistical properties of $f(t, \omega)$
+
+* function-space inner product: $\langle p, q \rangle_{\rho} = \int p(\omega) q(\omega) \rho(\omega) \, d\omega$
+
+### Choosing polynomials
+The choice depends on the density $\rho$, since we want orthonormality.
+E.g. $U([0,1])$: Legendre polynomials; $N(0, 1)$: Hermite polynomials.
+
+### Stieltjes' Three-Term Recursion
+All orthonormal polynomials satisfy: (this can't be true, why would we need the base cases?)
+$$ \phi_{-1}(\omega) = 0,\quad \phi_{0}(\omega) = 1$$ $$ \phi_{n+1}(\omega) = (A_n \omega + B_n) \phi_n(\omega) - C_n \phi_{n-1}(\omega) $$
+where $A_n, B_n, C_n$ computed recursively, depend on $\rho$. 
+
+=> yields a *numerically stable* method to get orthonormal polynomials.
+Other schemes e.g. Gram-Schmidt.
+
+### Selecting N and K
+Choose $N$ and $K$: $N$ is the number of terms in the PCE, $K$ is the number of quadrature terms used.
+
+##### Polynomial exactness P(K)
+For a given quadrature method, $P(K)$ is the maximum polynomial degree that can still be integrated exactly. Examples: 
+- Gaussian quadrature: $P(K) = 2K-1$
+- Clenshaw-Curtis: $P(K) = K$
+- Leja: $P(K) = K$
+
+##### Choosing N, K
+The choice depends on $K$ as the number of quadrature nodes $x_i$ in the stochastic space corresponds to the number of model evaluations. So $K$ should be chosen such that we can afford $K$ model evaluations. On the other hand, $N$ can be small as the PCE coefficients decay exponentially.
+
+Rule of thumb: $N = P(K)/2$.
+
+### Pseudo-Spectral Approach for Computing Coefficients
+The $n$-th coefficient can be computed by an inner product with the $n$-th basis polynomial:
+$$\hat{f}_m(t) = \sum_n \hat{f}_n(t) \langle \phi_n(\omega),\, \phi_m(\omega) \rangle_\rho = \langle \sum \hat{f}_n(t) \phi_n(\omega),\, \phi_m(\omega) \rangle_\rho = \langle f(t, \omega),\, \phi_m(\omega) \rangle_\rho$$
+
+So we have a formula for the coefficients. Evaluating the integral is a quadrature problem. Gaussian quadrature is optimal here (in one-dimensional setting).
+
+Using quadrature to obtain the PCE coefficients is known as the *pseudo-spectral approach*.
+
+##### Algorithm for coefficient computation
+Generate $x_k, \omega_k$ and evaluate $f(t, x_k)$. Then compute $\hat{f}_n(t) = \sum_{k=0}^{K-1} f(t, x_k) \phi_n(x_k) \omega_k$ for all $n = 0, \dots, N-1$.
+
+```python
+# expect that f computes w |-> f(t, w) for t fixed
+def compute_pce_coefficients(f, N, K, rho):
+	# e.g. use cp.orth_ttr(..., normed=True)
+	phis = create_orthogonal_polynomials(N, rho)
+	# e.g. use cp.quad_gauss_legendre(...)
+	xs, ws = create_quadrature_nodes_weights(K, rho)
+	f_evaluations = []
+	for k in range(K):
+		f_evaluations.append(f(xs[k]))
+
+	pce_coefficients = []
+	for n in range(N):
+		fn_hat = sum(fk * phis[n](xk) * wk for fk, xk, wk in zip(f_evaluations, xs, ws))
+		pce_coefficients.append(fn_hat)
+
+	return pce_coefficients
+
+```
+
+Other approaches include least squares, stochastic Galerkin, etc.
+
+
+### Statistical Properties of approximation
+* Expected value: $E(f(t, \omega)) = \dots = \hat{f}_0(t)$
+	* follows from the fact that $E(\phi_n(\omega)) = \langle \phi_0, \phi_n \rangle_\rho = \delta_{0n}$
+* Variance: $Var(f(t, \omega)) = \sum_{n \geq 1} \hat{f}_n^2(t)$
+
+"Nice behavior, saves a lot of costs"
+
+
+### PCE for damped linear oscillator
+Assume $c \sim U(0.08, 0.12)$ and some target time $T$. Then PC-expand the RV $y(T, \omega)$, compute coefficients $\hat{y}_n(T) = \sum_{k=0}^{K-1} y(T, x_k) \phi_n(x_k) w_k$.
+
+Result: 5 quadrature node PCE are enough to get a slightly better result than Monte Carlo with 100000 samples. 
+
+
+### Multivariate case
+
+$$f(t, \omega) \approx \sum_{\mathbf{n}} \hat{f}_{\mathbf{n}}(t) \phi_{\mathbf{n}}(\mathbf{\omega})$$ where $\mathbf{n}$ is a multi-index, usually with $n_1 + \dots + n_d \leq N$.