Create answersweek47.tex

mhjensen · mhjensen · commit fd2f6d7eefb9 · 2025-11-18T15:44:03.000+01:00
diff --git a/doc/src/week47/Latexslides/answersweek47.tex b/doc/src/week47/Latexslides/answersweek47.tex
@@ -0,0 +1,258 @@
+\documentclass[12pt]{article}
+\usepackage[a4paper,margin=2.5cm]{geometry}
+\usepackage{amsmath,amssymb}
+\usepackage{enumitem}
+\usepackage{hyperref}
+
+\title{Test yourself questions}
+\author{FYS-STK3155/4155}
+\date{Last weekly exercise set}
+
+\begin{document}
+\maketitle
+
+
+\section{Linear Regression}
+
+\begin{enumerate}[leftmargin=1.2cm]
+
+\item[\textbf{1.}]\textbf{(Multiple Choice)}  
+Which of the following is \emph{not} an assumption of ordinary least squares linear regression?  
+\begin{enumerate}[label=\alph*)]
+\item Linearity between predictors and target  
+\item Normality of predictors/features  
+\end{enumerate}
+Linearity: True. The relationship between predictors and the outcome is linear.
+Normality of predictors: Each independent feature is normally distributed. This is false, linear regression 
+does not require the predictors themselves to be normally distributed.
+
+\item[\textbf{2.}]\textbf{(True/False)}  
+The mean squared error cost function for linear regression is convex in the parameters, guaranteeing a unique global minimum.
+Answer: The MSE cost in linear regression is a convex quadratic function, so gradient-based optimization will find the global minimum .
+
+
+\end{enumerate}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Logistic Regression}
+
+\begin{enumerate}[leftmargin=1.2cm,start=5]
+
+\item[\textbf{5.}]\textbf{(Multiple Choice)}  
+Which statement about logistic regression is \emph{false}?  
+\begin{enumerate}[label=\alph*)]
+\item Used for binary classification  
+\item Uses sigmoid to map linear scores to probabilities  
+\item Has an analytical closed-form solution  
+\item The log-loss is convex  
+\end{enumerate}
+1. True. Logistic regression is used for binary classification problems (outputs a probability for the positive class). 
+
+2. True It uses the logistic (sigmoid) function to map linear combinations of features to probabilities. 
+
+3. True. It has an analytical closed-form solution for its parameters, analogous to the normal equation in linear regression, however it needs to be solved numerically. 
+
+4. Its loss function (log loss) is convex, which guarantees a unique global optimum during training. Logistic regression is a probabilistic classifier for binary outcomes, learned via maximum likelihood. Unlike linear regression, it does not have a closed-form coefficient solver and must be fit with iterative methods. Its negative log-likelihood (cross-entropy) cost is convex, ensuring a single global minimum .
+
+\item[\textbf{6.}]\textbf{(True/False)}  
+Logistic regression produces a linear decision boundary in the input feature space.
+Answer: True. Logistic regression (with no feature transformations) produces a linear decision boundary in the input feature space. True. The model is $\sigma(w^T x + b)$ and the decision boundary occurs at $w^T x + b = 0$, which is a hyperplane, that is a linear boundary.
+\item[\textbf{7.}]\textbf{(Short Answer)}  
+
+Give two reasons why logistic regression is preferred over linear
+regression for binary classification.  Answer: First, logistic
+regression outputs probabilities (via the sigmoid function), which are
+naturally bounded between 0 and 1, whereas linear regression can
+produce arbitrary values not suited for classification.  Second, using
+linear regression for classification (with a threshold) can be
+problematic since it treats errors on 0/1 outcomes in a least-squares
+sense, which can lead to unstable or nonsensical thresholds and is not
+theoretically well-founded. Logistic regression instead optimizes a
+log-loss (cross-entropy) cost, which is well-suited for binary
+outcomes and often yields better calibrated probability
+estimates. Additionally, logistic regression is less sensitive to
+class imbalance than linear regression with a threshold, and its loss
+function is convex, avoiding some of the issues linear regression
+would have on classification tasks.
+
+\end{enumerate}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Neural Networks (Feedforward)}
+
+\begin{enumerate}[leftmargin=1.2cm,start=9]
+
+\item[\textbf{9.}]\textbf{(Multiple Choice)}  
+Which statement is \emph{not} true for fully-connected neural networks?  
+\begin{enumerate}[label=\alph*)]
+\item Without nonlinearities they reduce to a single linear model  
+\item Backpropagation applies the chain rule  
+\item A single hidden layer can approximate any continuous function  
+\item The loss surface is convex  
+\end{enumerate}
+1. Without non-linear activation functions in the hidden layers, the network would be equivalent to a single linear model (no matter how many layers are stacked). 
+
+2. True. Training deep neural networks relies on backpropagation, which uses the chain rule of calculus to compute gradients for all weights in the network. 
+
+3. True. With enough hidden units, a neural network with even a single hidden layer can approximate any continuous function on a closed interval (given mild conditions on the activation function). 
+
+4. False. The loss surface of a deep neural network (with two or more hidden layers) is convex, so any local minimum of the training loss is also a global minimum. (Neural networks require non-linear activations to gain expressive power; otherwise multiple layers collapse to an equivalent single linear transformation . They are universal approximators in theory and are trained via backpropagation (chain rule for gradients). However, the loss surface for deep nets is non-convex, generally possessing many local minima and saddle points.
+
+\item[\textbf{10.}]\textbf{(True/False)}  
+Using sigmoid activations in deep networks can cause vanishing gradients.
+True. Sigmoid/tanh activations squash outputs to (0,1)/(−1,1), and their derivatives are small for large magnitude inputs. In a deep network, gradients propagated backward can thus diminish exponentially, “vanishing” before reaching early layers. This makes it difficult for those layers to learn.
+\item[\textbf{11.}]\textbf{(Short Answer)}  
+The vanishing gradient problem refers to the tendency of gradients to become extremely small in early layers of a deep network during training. It occurs because gradients are the product of many small partial derivatives from the chain rule.  In deep networks (especially with sigmoid or $\tanh$ activations), these derivatives can be less than 1, causing the product to shrink exponentially as it is backpropagated through many layers. As a result, the early (lower) layers learn very slowly since their weights receive almost no update. A common technique to mitigate this is to use ReLU (Rectified Linear Unit) activations (or other activation functions that do not saturate) in place of sigmoids. ReLUs have derivative 0 or 1, which helps maintain larger gradients. Other strategies include careful weight initialization, batch normalization, residual connections, or using architectures like LSTMs (for RNNs) that are designed to preserve gradients.
+
+\item[\textbf{12.}]\textbf{(Short Answer)}  
+Given layer sizes $n_0,n_1,\dots,n_L$, derive the total number of trainable parameters in a fully connected neural network.
+Answer: In a fully-connected network, each layer $i$ (except the input layer) has a weight matrix connecting all $n_{i-1}$ neurons from the previous layer to the $n_i$ neurons of this layer, plus $n_i$ bias terms. Therefore, the number of parameters in layer $i$ is $n_{i-1}\cdot n_i$ (weights) $+,n_i$ (biases) . Summing over all layers from $1$ to $L$ gives the total number of parameters: 
+\[
+\text{Total params} \;=\; \sum_{i=1}^{L} \Big(n_{i-1}\times n_i + n_i\Big)\,. 
+\]
+For example, a network with architecture $[n_0, n_1, n_2]$ (one hidden layer of size $n_1$) has $n_0n_1 + n_1$ parameters in the first (input-to-hidden) layer and $n_1n_2 + n_2$ in the second (hidden-to-output) layer.
+
+\end{enumerate}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Convolutional Neural Networks}
+
+\begin{enumerate}[leftmargin=1.2cm,start=13]
+
+\item[\textbf{13.}]\textbf{(Multiple Choice)}  
+Which of the following is \emph{not} an advantage of convolutional networks?  
+\begin{enumerate}[label=\alph*)]
+\item Local receptive fields  
+\item Weight sharing  
+\item More parameters than fully-connected layers  
+\item Pooling gives translation invariance  
+\end{enumerate}
+
+1. True. CNNs use local receptive fields, meaning each neuron in a
+convolutional layer connects to only a small region of the input
+(spatially). 
+
+2. True. CNNs employ weight sharing: the same filter (set of
+weights) is applied across different positions of the input, greatly
+reducing the number of parameters. 
+
+3. True. CNNs have more parameters than
+fully-connected networks applied to inputs of the same size, due to
+the use of many filters. 
+
+
+4. True. Pooling layers in CNNs help achieve a
+degree of translation invariance, by summarizing features over small
+neighborhoods. Convolution and pooling confer two key benefits: far
+fewer parameters thanks to local connectivity
+and shared filters, and some robustness to translations, e.g. a
+feature’s exact location is less critical after pooling . CNNs
+leverage these properties to generalize well to image data.
+
+\item[\textbf{14.}]}  
+
+  Zero-padding can preserve spatial dimensions when using $3\times3$
+  kernels with stride~1.  Using zero-padding in convolutional layers
+  can preserve the spatial size of the input. For example, with a
+  $3\times 3$ kernel and stride 1, choosing a padding of $P=1$ on each
+  side will keep an input image of size $W \times H$ the same size in
+  the output feature map . True. (In general, the output width for a
+  1D convolution is $\frac{W - K + 2P}{S} + 1$. Setting $P = (K-1)/2$
+  for stride $S=1$ yields output width $W$. For a $3\times 3$ kernel,
+  $(K-1)/2 = 1$, so padding by 1 keeps the output size equal to the
+  input size.)
+
+\item[\textbf{15.}]\textbf{(Short Answer)}  
+Derive the formula for the output width of a convolutional layer with input width $W$, filter size $K$, stride $S$, and padding $P$.
+Answer: The output width $W_{\text{out}}$ is given by
+\[
+W_{\text{out}} = \frac{W - K + 2P}{S} + 1,
+\] 
+assuming $(W - K + 2P)$ is divisible by $S$. An analogous formula holds for the output height using $H$, and this formula also generalizes to multiple convolutional layers or to the spatial dimensions of feature maps.
+
+\item[\textbf{16.}]\textbf{(Short Answer)}  
+A convolutional layer has $C_{\mathrm{in}}$ input channels, $C_{\mathrm{out}}$ filters, and kernel size $K_h \times K_w$.  
+Compute the number of trainable parameters (including biases).
+Answer: Each filter has $C_{\text{in}} \times K_h \times K_w$ weights, and typically one bias term. With $C_{\text{out}}$ filters in the layer, the total parameter count is
+\[
+K_h \cdot K_w \cdot C_{\text{in}} + 1) \times C_{\text{out}}, 
+\]
+which accounts for all filter weights plus one bias per filter . (For example, a conv layer with $32$ filters of size $3\times 3$ and $3$ input channels has $(3\cdot3\cdot3+1)\times 32 = 896$ parameters .)
+
+\end{enumerate}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Recurrent Neural Networks}
+
+\begin{enumerate}[leftmargin=1.2cm,start=17]
+
+\item[\textbf{17.}]\textbf{(Multiple Choice)}  
+Which statement about vanilla RNNs is \emph{false}?  
+\begin{enumerate}[label=\alph*)]
+\item They maintain a hidden state  
+\item They use shared weights across time  
+\item They can process sequences of arbitrary length  
+\item They avoid vanishing gradients  
+\end{enumerate}
+
+1. True. RNNs maintain a hidden state vector that is updated at each time
+step, allowing the network to retain information from previous
+inputs. 
+
+2. True. RNNs use the same weight matrices (shared weights) at every
+time step of the sequence, instead of having separate weights for each
+time step. 
+
+3. True. RNNs can, in principle, process input sequences of
+arbitrary length (they are not limited to a fixed input size per
+se). 
+
+4. False. RNNs completely eliminate the vanishing gradient problem, so
+they can easily learn long-term dependencies. (Standard RNNs “unfold”
+a single recurrent layer across time steps and share parameters along
+the sequence . They maintain an internal memory (hidden state) to
+capture temporal dependencies and can handle sequences of varying
+length. However, vanilla RNNs do suffer from the vanishing gradient
+problem, which makes learning long-term dependencies challenging
+. That drawback led to the development of
+gated RNN variants.
+
+
+\item[\textbf{18.}]\textbf{(True/False)}
+  
+LSTMs mitigate vanishing gradients by introducing gating mechanisms.
+True.
+Long Short-Term Memory (LSTM) networks were designed to overcome the vanishing gradient issue in RNNs by introducing gating mechanisms that control information flow (e.g. input, forget, and output gates). 
+LSTMs have a cell state and gates that regulate when to store,
+forget, or output information. This architecture enables gradients to
+flow better over long time spans, mitigating vanishing gradients and
+enabling the network to learn long-term dependencies.
+
+\item[\textbf{19.}]\textbf{(Short Answer)}  
+What is Backpropagation Through Time (BPTT), and why is it required for training RNNs?
+
+Answer: BPTT is the adaptation of the standard backpropagation
+algorithm for unfolded recurrent neural networks. When an RNN is
+“unrolled” over $T$ time steps, it can be viewed as a deep network
+with $T$ layers (one per time step). Backpropagation Through Time
+entails propagating the error gradients backward through all these
+time-step connections (hence “through time”) to compute weight
+updates. It is necessary because an RNN’s output at time $t$ depends
+on not only the weights at that step but also on the states (and thus
+inputs) from previous time steps. BPTT allows the network to assign
+credit (or blame) to weights based on sequence-wide outcomes by
+accumulating gradients over each time step . Without BPTT, the RNN
+would not learn temporal relationships properly, since we must
+consider the influence of earlier inputs on later outputs when
+adjusting the recurrent weights.
+
+
+\end{enumerate}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\end{document}
+
+
+
+