Skip to content

Latest commit

 

History

History
130 lines (72 loc) · 9.02 KB

cnn-interview-notes.md

File metadata and controls

130 lines (72 loc) · 9.02 KB

CNNs use convolutional layers to automatically learn spatial hierarchies of features from images, making them particularly effective for image classification, object detection, and image segmentation tasks.

Key Features of CNNs:

  • Convolutional Layers: Utilize filters to detect features like edges, textures, and shapes in images.
  • Pooling Layers: Reduce the spatial dimensions of the input, maintaining essential features while minimizing computational complexity.
  • Fully Connected Layers: Combine the features learned by previous layers to make final predictions.

Vision Transformers (ViTs) apply the transformer architecture to image data. Instead of convolutions, ViTs treat images as sequences of patches and utilize self-attention mechanisms to learn relationships between these patches.

Key Features of Vision Transformers:

  • Patch Embedding: Images are divided into fixed-size patches, which are flattened and projected into a high-dimensional space.
  • Self-Attention Mechanism: Allows the model to weigh the importance of different patches based on their relationships, enabling it to capture global context effectively.
  • Positional Encoding: Adds information about the position of each patch to maintain the spatial arrangement of image data.

Object Detection

https://yolo-docs.readthedocs.io/en/latest/0_get_start/1_introduction.html

Faster R-CNN

Faster R-CNN is a single unified model, the architecture is comprised of two modules:

  • RPN (Region Proposal Network) : Convolutional neural network for proposing regions and the type of object to consider in the region.
  • Fast R-CNN : Convolutional neural network for extracting features from the proposed regions and outputting the bounding box and class labels.

YOLO (You Only Look Once)

https://yolo-docs.readthedocs.io/en/latest/0_get_start/1_introduction.html

SSD (Single Shot Detector)


Are CNNs invariant to translation, rotation, and scaling?

A translation is a geometric transformation that shifts all points in a given direction and by the same distance.

Describe the process of backpropagation in CNNs. Explain how gradients are calculated and weights are updated through convolutional layers?

Backpropagation in a convolutional layer

What is the role of activation functions in CNNs? Compare common activation functions like ReLU, Sigmoid, and Tanh, focusing on their advantages.

  • ReLU is the most popular and commonly used activation function in CNN. It outputs the input value if it is positive or returns zero. ReLU helps to prevent the exponential growth in the computation required to operate the neural network.
  • tanh function is a type of activation function that transforms the input value between -1 and 1.
  • Sigmoid function transforms the input to a range between 0 and 1

Why do we use a Pooling Layer in a CNN?

Pooling downsamples feature maps while retaining important information. Pooling layers reduce the spatial dimensions of feature maps generated by the convolutional layers. This process helps in reducing the computational complexity of the network and prevents overfitting.

  • Max Pooling - a pooling operation that selects the maximum element from the region of the feature map covered by the filter.
  • Average pooling computes the average of the elements present in the region of feature map covered by the filter.
  • Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw x nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of dimensions nh x nw i.e. the dimensions of the feature map.

https://medium.com/@abhishekjainindore24/pooling-and-their-types-in-cnn-4a4b8a7a4611

What is the size of the feature map for a given input size image, Filter Size, Stride, and Padding amount?

Spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border.

For example, for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.

An input image has been converted into a matrix of size 12 X 12 along with a filter of size 3 X 3 with a Stride of 1. Determine the size of the convoluted matrix.

We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output.

Explain the terms “Valid Padding” and “Same Padding” in CNN.

Padding refers to the process of adding a layer of zeros/ones to the sides of a matrix. We perform padding so that we can smoothly implement the convolution operation.

To sum up, ‘valid’ padding means no padding. The output size of the convolutional layer shrinks depending on the input size & kernel size. On the contrary, ‘same’ padding means using padding.

What is Stride? What is the effect of high Stride on the feature map?

Stride refers to the number of steps the filter matrix can shift after evaluating the convolution between input and filter. When the stride=1, the filter matrix shifts by one pixel; if it is 2, then the filter matrix must shift by two pixels.

Explain the role of the flattening layer in CNN.

The flattening layer is usually towards the end of the CNN architecture, and it is used to transform all the two-dimensional matrices into a single lengthy vector. The output of this layer is passed to the fully-connected layer.

What is the role of the Fully Connected (FC) Layer in CNN?

Fully connected layers serve as the final stage in a CNN architecture. They connect every neuron from the previous layer to every neuron in the subsequent layer, transforming the output from the convolutional and pooling layers into a single, continuous vector.

This vector is then passed through an activation function, such as a softmax function, to generate the final output probabilities for each clas

In an image classification task with 10 classes, the fully connected layer will output a 10-dimensional vector, with each element representing the probability of the input belonging to a specific class.

What’s the difference between batch normalization and dropout layers in a CNN?

Batch Norm is a normalization technique done between the layers of a Neural Network instead of in the raw data. It is done along mini-batches instead of the full data set.

  • normalizing the inputs to take on a similar range of values can speed up learning.
  • Batch Norm reduces the internal covariate shift of the network.
  • it seems that Batch Norm has a regularization effect. Because it is computed over mini-batches and not the entire data set, the model’s data distribution sees each time has some noise. This can act as a regularizer,

Dropout is a regularization technique that randomly drops out some fraction of the neurons in a layer during training. This helps to prevent overfitting by blocking information from certain neurons completely to make sure the neurons do not co-adapt. Dropout forces the network to learn more robust features that are useful in making predictions.

What is the main purpose of using zero padding in a CNN?

Zero padding is a technique used in Convolutional Neural Networks (CNNs) to preserve the original input size.

It involves adding zeros to the borders of the input feature map when it is being processed by the kernel of a CNN. The main purpose of using zero padding is to avoid losing information at the boundaries of the input feature map and to control the shrinkage of dimension after applying filters larger than 1x1.

When would you prefer a 1D convolution over 2D convolutions?

  • 2D convolutions are typically used when the input data is 2D, such as an image.
  • 1D convolutions are typically used when the input data is 1D, such as a time series or text.

Explain the significance of “Parameter Sharing” and “Sparsity of connections” in CNN.

Which CNNs are you used?

EfficientNet-B0

https://theaisummer.com/cnn-architectures/

References