Skip to content

Commit fd2f6d7

Browse files
committed
Create answersweek47.tex
1 parent 5952608 commit fd2f6d7

File tree

1 file changed

+258
-0
lines changed

1 file changed

+258
-0
lines changed
Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
\documentclass[12pt]{article}
2+
\usepackage[a4paper,margin=2.5cm]{geometry}
3+
\usepackage{amsmath,amssymb}
4+
\usepackage{enumitem}
5+
\usepackage{hyperref}
6+
7+
\title{Test yourself questions}
8+
\author{FYS-STK3155/4155}
9+
\date{Last weekly exercise set}
10+
11+
\begin{document}
12+
\maketitle
13+
14+
15+
\section{Linear Regression}
16+
17+
\begin{enumerate}[leftmargin=1.2cm]
18+
19+
\item[\textbf{1.}]\textbf{(Multiple Choice)}
20+
Which of the following is \emph{not} an assumption of ordinary least squares linear regression?
21+
\begin{enumerate}[label=\alph*)]
22+
\item Linearity between predictors and target
23+
\item Normality of predictors/features
24+
\end{enumerate}
25+
Linearity: True. The relationship between predictors and the outcome is linear.
26+
Normality of predictors: Each independent feature is normally distributed.
This is false, linear regression
27+
does not require the predictors themselves to be normally distributed.
28+
29+
\item[\textbf{2.}]\textbf{(True/False)}
30+
The mean squared error cost function for linear regression is convex in the parameters, guaranteeing a unique global minimum.
31+
Answer: The MSE cost in linear regression is a convex quadratic function, so gradient-based optimization will find the global minimum .
32+
33+
34+
\end{enumerate}
35+
36+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
37+
\section{Logistic Regression}
38+
39+
\begin{enumerate}[leftmargin=1.2cm,start=5]
40+
41+
\item[\textbf{5.}]\textbf{(Multiple Choice)}
42+
Which statement about logistic regression is \emph{false}?
43+
\begin{enumerate}[label=\alph*)]
44+
\item Used for binary classification
45+
\item Uses sigmoid to map linear scores to probabilities
46+
\item Has an analytical closed-form solution
47+
\item The log-loss is convex
48+
\end{enumerate}
49+
1. True. Logistic regression is used for binary classification problems (outputs a probability for the positive class).

50+
51+
2. True It uses the logistic (sigmoid) function to map linear combinations of features to probabilities.

52+
53+
3. True. It has an analytical closed-form solution for its parameters, analogous to the normal equation in linear regression, however it needs to be solved numerically.

54+
55+
4. Its loss function (log loss) is convex, which guarantees a unique global optimum during training.
Logistic regression is a probabilistic classifier for binary outcomes, learned via maximum likelihood. Unlike linear regression, it does not have a closed-form coefficient solver and must be fit with iterative methods. Its negative log-likelihood (cross-entropy) cost is convex, ensuring a single global minimum .
56+
57+
\item[\textbf{6.}]\textbf{(True/False)}
58+
Logistic regression produces a linear decision boundary in the input feature space.
59+
Answer: True. Logistic regression (with no feature transformations) produces a linear decision boundary in the input feature space.
True. The model is $\sigma(w^T x + b)$ and the decision boundary occurs at $w^T x + b = 0$, which is a hyperplane, that is a linear boundary.
60+
\item[\textbf{7.}]\textbf{(Short Answer)}
61+
62+
Give two reasons why logistic regression is preferred over linear
63+
regression for binary classification. Answer: First, logistic
64+
regression outputs probabilities (via the sigmoid function), which are
65+
naturally bounded between 0 and 1, whereas linear regression can
66+
produce arbitrary values not suited for classification. Second, using
67+
linear regression for classification (with a threshold) can be
68+
problematic since it treats errors on 0/1 outcomes in a least-squares
69+
sense, which can lead to unstable or nonsensical thresholds and is not
70+
theoretically well-founded. Logistic regression instead optimizes a
71+
log-loss (cross-entropy) cost, which is well-suited for binary
72+
outcomes and often yields better calibrated probability
73+
estimates. Additionally, logistic regression is less sensitive to
74+
class imbalance than linear regression with a threshold, and its loss
75+
function is convex, avoiding some of the issues linear regression
76+
would have on classification tasks.
77+
78+
\end{enumerate}
79+
80+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
81+
\section{Neural Networks (Feedforward)}
82+
83+
\begin{enumerate}[leftmargin=1.2cm,start=9]
84+
85+
\item[\textbf{9.}]\textbf{(Multiple Choice)}
86+
Which statement is \emph{not} true for fully-connected neural networks?
87+
\begin{enumerate}[label=\alph*)]
88+
\item Without nonlinearities they reduce to a single linear model
89+
\item Backpropagation applies the chain rule
90+
\item A single hidden layer can approximate any continuous function
91+
\item The loss surface is convex
92+
\end{enumerate}
93+
1. Without non-linear activation functions in the hidden layers, the network would be equivalent to a single linear model (no matter how many layers are stacked).

94+
95+
2. True. Training deep neural networks relies on backpropagation, which uses the chain rule of calculus to compute gradients for all weights in the network.

96+
97+
3. True. With enough hidden units, a neural network with even a single hidden layer can approximate any continuous function on a closed interval (given mild conditions on the activation function).

98+
99+
4. False. The loss surface of a deep neural network (with two or more hidden layers) is convex, so any local minimum of the training loss is also a global minimum.
(Neural networks require non-linear activations to gain expressive power; otherwise multiple layers collapse to an equivalent single linear transformation . They are universal approximators in theory and are trained via backpropagation (chain rule for gradients). However, the loss surface for deep nets is non-convex, generally possessing many local minima and saddle points.
100+
101+
\item[\textbf{10.}]\textbf{(True/False)}
102+
Using sigmoid activations in deep networks can cause vanishing gradients.
103+
True. Sigmoid/tanh activations squash outputs to (0,1)/(−1,1), and their derivatives are small for large magnitude inputs. In a deep network, gradients propagated backward can thus diminish exponentially, “vanishing” before reaching early layers. This makes it difficult for those layers to learn.
104+
\item[\textbf{11.}]\textbf{(Short Answer)}
105+
The vanishing gradient problem refers to the tendency of gradients to become extremely small in early layers of a deep network during training. It occurs because gradients are the product of many small partial derivatives from the chain rule. In deep networks (especially with sigmoid or $\tanh$ activations), these derivatives can be less than 1, causing the product to shrink exponentially as it is backpropagated through many layers. As a result, the early (lower) layers learn very slowly since their weights receive almost no update. A common technique to mitigate this is to use ReLU (Rectified Linear Unit) activations (or other activation functions that do not saturate) in place of sigmoids. ReLUs have derivative 0 or 1, which helps maintain larger gradients. Other strategies include careful weight initialization, batch normalization, residual connections, or using architectures like LSTMs (for RNNs) that are designed to preserve gradients.
106+
107+
\item[\textbf{12.}]\textbf{(Short Answer)}
108+
Given layer sizes $n_0,n_1,\dots,n_L$, derive the total number of trainable parameters in a fully connected neural network.
109+
Answer: In a fully-connected network, each layer $i$ (except the input layer) has a weight matrix connecting all $n_{i-1}$ neurons from the previous layer to the $n_i$ neurons of this layer, plus $n_i$ bias terms. Therefore, the number of parameters in layer $i$ is $n_{i-1}\cdot n_i$ (weights) $+,n_i$ (biases) . Summing over all layers from $1$ to $L$ gives the total number of parameters:

110+
\[
111+
\text{Total params} \;=\; \sum_{i=1}^{L} \Big(n_{i-1}\times n_i + n_i\Big)\,.

112+
\]
113+
For example, a network with architecture $[n_0, n_1, n_2]$ (one hidden layer of size $n_1$) has $n_0n_1 + n_1$ parameters in the first (input-to-hidden) layer and $n_1n_2 + n_2$ in the second (hidden-to-output) layer.
114+
115+
\end{enumerate}
116+
117+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
118+
\section{Convolutional Neural Networks}
119+
120+
\begin{enumerate}[leftmargin=1.2cm,start=13]
121+
122+
\item[\textbf{13.}]\textbf{(Multiple Choice)}
123+
Which of the following is \emph{not} an advantage of convolutional networks?
124+
\begin{enumerate}[label=\alph*)]
125+
\item Local receptive fields
126+
\item Weight sharing
127+
\item More parameters than fully-connected layers
128+
\item Pooling gives translation invariance
129+
\end{enumerate}
130+
131+
1. True. CNNs use local receptive fields, meaning each neuron in a
132+
convolutional layer connects to only a small region of the input
133+
(spatially).

134+
135+
2. True. CNNs employ weight sharing: the same filter (set of
136+
weights) is applied across different positions of the input, greatly
137+
reducing the number of parameters.

138+
139+
3. True. CNNs have more parameters than
140+
fully-connected networks applied to inputs of the same size, due to
141+
the use of many filters.

142+
143+
144+
4. True. Pooling layers in CNNs help achieve a
145+
degree of translation invariance, by summarizing features over small
146+
neighborhoods.
Convolution and pooling confer two key benefits: far
147+
fewer parameters thanks to local connectivity
148+
and shared filters, and some robustness to translations, e.g. a
149+
feature’s exact location is less critical after pooling . CNNs
150+
leverage these properties to generalize well to image data.
151+
152+
\item[\textbf{14.}]}
153+
154+
Zero-padding can preserve spatial dimensions when using $3\times3$
155+
kernels with stride~1. Using zero-padding in convolutional layers
156+
can preserve the spatial size of the input. For example, with a
157+
$3\times 3$ kernel and stride 1, choosing a padding of $P=1$ on each
158+
side will keep an input image of size $W \times H$ the same size in
159+
the output feature map .
True. (In general, the output width for a
160+
1D convolution is $\frac{W - K + 2P}{S} + 1$. Setting $P = (K-1)/2$
161+
for stride $S=1$ yields output width $W$. For a $3\times 3$ kernel,
162+
$(K-1)/2 = 1$, so padding by 1 keeps the output size equal to the
163+
input size.)
164+
165+
\item[\textbf{15.}]\textbf{(Short Answer)}
166+
Derive the formula for the output width of a convolutional layer with input width $W$, filter size $K$, stride $S$, and padding $P$.
167+
Answer: The output width $W_{\text{out}}$ is given by
168+
\[
169+
W_{\text{out}} = \frac{W - K + 2P}{S} + 1,
170+
\]
171+
assuming $(W - K + 2P)$ is divisible by $S$. An analogous formula holds for the output height using $H$, and this formula also generalizes to multiple convolutional layers or to the spatial dimensions of feature maps.
172+
173+
\item[\textbf{16.}]\textbf{(Short Answer)}
174+
A convolutional layer has $C_{\mathrm{in}}$ input channels, $C_{\mathrm{out}}$ filters, and kernel size $K_h \times K_w$.
175+
Compute the number of trainable parameters (including biases).
176+
Answer: Each filter has $C_{\text{in}} \times K_h \times K_w$ weights, and typically one bias term. With $C_{\text{out}}$ filters in the layer, the total parameter count is
177+
\[
178+
K_h \cdot K_w \cdot C_{\text{in}} + 1) \times C_{\text{out}},

179+
\]
180+
which accounts for all filter weights plus one bias per filter . (For example, a conv layer with $32$ filters of size $3\times 3$ and $3$ input channels has $(3\cdot3\cdot3+1)\times 32 = 896$ parameters .)
181+
182+
\end{enumerate}
183+
184+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
185+
\section{Recurrent Neural Networks}
186+
187+
\begin{enumerate}[leftmargin=1.2cm,start=17]
188+
189+
\item[\textbf{17.}]\textbf{(Multiple Choice)}
190+
Which statement about vanilla RNNs is \emph{false}?
191+
\begin{enumerate}[label=\alph*)]
192+
\item They maintain a hidden state
193+
\item They use shared weights across time
194+
\item They can process sequences of arbitrary length
195+
\item They avoid vanishing gradients
196+
\end{enumerate}
197+
198+
1. True. RNNs maintain a hidden state vector that is updated at each time
199+
step, allowing the network to retain information from previous
200+
inputs.

201+
202+
2. True. RNNs use the same weight matrices (shared weights) at every
203+
time step of the sequence, instead of having separate weights for each
204+
time step.

205+
206+
3. True. RNNs can, in principle, process input sequences of
207+
arbitrary length (they are not limited to a fixed input size per
208+
se).

209+
210+
4. False. RNNs completely eliminate the vanishing gradient problem, so
211+
they can easily learn long-term dependencies.
(Standard RNNs “unfold”
212+
a single recurrent layer across time steps and share parameters along
213+
the sequence . They maintain an internal memory (hidden state) to
214+
capture temporal dependencies and can handle sequences of varying
215+
length. However, vanilla RNNs do suffer from the vanishing gradient
216+
problem, which makes learning long-term dependencies challenging
217+
. That drawback led to the development of
218+
gated RNN variants.
219+
220+
221+
\item[\textbf{18.}]\textbf{(True/False)}
222+
223+
LSTMs mitigate vanishing gradients by introducing gating mechanisms.
224+
True.
225+
Long Short-Term Memory (LSTM) networks were designed to overcome the vanishing gradient issue in RNNs by introducing gating mechanisms that control information flow (e.g. input, forget, and output gates).

226+
LSTMs have a cell state and gates that regulate when to store,
227+
forget, or output information. This architecture enables gradients to
228+
flow better over long time spans, mitigating vanishing gradients and
229+
enabling the network to learn long-term dependencies.
230+
231+
\item[\textbf{19.}]\textbf{(Short Answer)}
232+
What is Backpropagation Through Time (BPTT), and why is it required for training RNNs?
233+
234+
Answer: BPTT is the adaptation of the standard backpropagation
235+
algorithm for unfolded recurrent neural networks. When an RNN is
236+
“unrolled” over $T$ time steps, it can be viewed as a deep network
237+
with $T$ layers (one per time step). Backpropagation Through Time
238+
entails propagating the error gradients backward through all these
239+
time-step connections (hence “through time”) to compute weight
240+
updates. It is necessary because an RNN’s output at time $t$ depends
241+
on not only the weights at that step but also on the states (and thus
242+
inputs) from previous time steps. BPTT allows the network to assign
243+
credit (or blame) to weights based on sequence-wide outcomes by
244+
accumulating gradients over each time step . Without BPTT, the RNN
245+
would not learn temporal relationships properly, since we must
246+
consider the influence of earlier inputs on later outputs when
247+
adjusting the recurrent weights.
248+
249+
250+
\end{enumerate}
251+
252+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
253+
254+
\end{document}
255+
256+
257+
258+

0 commit comments

Comments
 (0)