Add object detection lesson

mini-1235 · May 23, 2022 · 2cc0ba9 · 2cc0ba9
1 parent 7f4b307
commit 2cc0ba9
Show file tree

Hide file tree

Showing 18 changed files with 819 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -68,7 +68,7 @@ For a gentle introduction to *AI in the Cloud* topics you may consider taking th
 <tr><td>8</td><td>Pre-trained Networks and Transfer Learning<br/>Training Tricks</td><td><a href="lessons/4-ComputerVision/08-TransferLearning/README.md">Text</a><br/><a href="lessons/4-ComputerVision/08-TransferLearning/TrainingTricks.md">Text</a></td><td><a href="lessons/4-ComputerVision/08-TransferLearning/TransferLearningPyTorch.ipynb">PyTorch</a></td><td><a href="lessons/4-ComputerVision/08-TransferLearning/TransferLearningTF.ipynb">TensorFlow</a><br/><a href="lessons/4-ComputerVision/08-TransferLearning/Dropout.ipynb">Dropout sample</a></td><td><a href="lessons/4-ComputerVision/08-TransferLearning/lab/README.md">Lab</a></td></tr>
 <tr><td>9</td><td>Autoencoders and VAEs</td><td><a href="lessons/4-ComputerVision/09-Autoencoders/README.md">Text</a></td><td><a href="lessons/4-ComputerVision/09-Autoencoders/AutoEncodersPytorch.ipynb">PyTorch</td><td><a href="lessons/4-ComputerVision/09-Autoencoders/AutoencodersTF.ipynb">TensorFlow</a></td><td></td></tr>
 <tr><td>10</td><td>Generative Adversarial Networks<br/>Artistic Style Transfer</td><td><a href="lessons/4-ComputerVision/10-GANs/README.md">Text</a></td><td><a href="lessons/4-ComputerVision/10-GANs/GANPyTorch.ipynb">PyTorch</td><td><a href="lessons/4-ComputerVision/10-GANs/GANTF.ipynb">TensorFlow GAN</a><br/><a href="lessons/4-ComputerVision/10-GANs/StyleTransfer.ipynb">Style Transfer</a></td><td></td></tr>
-<tr><td>11</td><td>Object Detection</td><td>Text</td><td>PyTorch</td><td>TensorFlow</td><td></td></tr>
+<tr><td>11</td><td><a href="lessons/4-ComputerVision/11-ObjectDetection/README.md">Object Detection</a></td><td>Text</td><td>PyTorch</td><td><a href="lessons/4-ComputerVision/11-ObjectDetection/ObjectDetection-TF.ipynb">TensorFlow</td><td><a href="lessons/4-ComputerVision/11-ObjectDetection/README.md">Lab</a></td></tr>
 <tr><td>12</td><td>Semantic Segmentation. U-Net</td><td><a href="lessons/4-ComputerVision/12-Segmentation/README.md">Text</a></td><td><a href="lessons/4-ComputerVision/12-Segmentation/SemanticSegmentationPytorch.ipynb">PyTorch</td><td><a href="lessons/4-ComputerVision/12-Segmentation/SemanticSegmentationTF.ipynb">TensorFlow</td><td></td></tr>
 <tr><td>V</td><td colspan="2"><b><a href="lessons/5-NLP/README.md">Natural Language Processing</a></b></td>
    <td><a href="https://docs.microsoft.com/learn/modules/intro-natural-language-processing-pytorch/?WT.mc_id=academic-57639-dmitryso">MS Learn</a></td>
@@ -80,7 +80,7 @@ For a gentle introduction to *AI in the Cloud* topics you may consider taking th
 <tr><td>16</td><td>Recurrent Neural Networks</td><td><a href="lessons/5-NLP/16-RNN/README.md">Text</a></td><td><a href="lessons/5-NLP/16-RNN/RNNPyTorch.ipynb">PyTorch</a></td><td><a href="lessons/5-NLP/16-RNN/RNNTF.ipynb">TensorFlow</a></td><td></td></tr>
 <tr><td>17</td><td>Generative Recurrent Networks</td><td><a href="lessons/5-NLP/17-GenerativeNetworks/README.md">Text</a></td><td><a href="lessons/5-NLP/17-GenerativeNetworks/GenerativePyTorch.md">PyTorch</a></td><td><a href="lessons/5-NLP/17-GenerativeNetworks/GenerativeTF.md">TensorFlow</a></td><td><a href="lessons/5-NLP/17-GenerativeNetworks/lab/README.md">Lab</a></td></tr>
 <tr><td>18</td><td>Transformers. BERT.</td><td><a href="lessons/5-NLP/18-Transformers/README.md">Text</a></td><td><a href="lessons/5-NLP/18-Transformers/TransformersPyTorch.md">PyTorch</a></td><td><a href="lessons/5-NLP/18-Transformers/TransformersTF.md">TensorFlow</a></td><td></td></tr>
-<tr><td>19</td><td>Named Entity Recognition</td><td><a href="lessons/5-NLP/19-NER/README.md">Text</a></td><td></td><td><a href="lessons/5-NLP/19-NER/NER-TF.ipynb">TensorFlow</a></td><a href="lessons/5-NLP/19-NER/lab/README.md">Lab</a><td></td></tr>
+<tr><td>19</td><td>Named Entity Recognition</td><td><a href="lessons/5-NLP/19-NER/README.md">Text</a></td><td></td><td><a href="lessons/5-NLP/19-NER/NER-TF.ipynb">TensorFlow</a></td><td><a href="lessons/5-NLP/19-NER/lab/README.md">Lab</a></td></tr>
 <tr><td>20</td><td>Large Language Models, Prompt Programming and Few-Shot Tasks</td><td>Text</td><td>PyTorch</td><td>TensorFlow</td><td></td></tr>
 <tr><td>VI</td><td colspan="4"><b>Other AI Techniques</b></td><td></td></tr>
 <tr><td>21</td><td>Genetic Algorithms</td><td><a href="lessons/6-Other/21-GeneticAlgorithms/README.md">Text</a><td colspan="2"><a href="lessons/6-Other/21-GeneticAlgorithms/Genetic.ipynb">Notebook</a></td><td></td></tr>

diff --git a/lessons/4-ComputerVision/11-ObjectDetection/ObjectDetection.ipynb b/lessons/4-ComputerVision/11-ObjectDetection/ObjectDetection.ipynb
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/README.md b/lessons/4-ComputerVision/11-ObjectDetection/README.md
@@ -0,0 +1,150 @@
+# Object Detection
+
+Image classification models we have dealt with so far took an image and produced categorical result, such as the class number. However, in many cases we do not want just to know that an object is on the picture - we want to be able to determine its location. This is exactly the point of **object detection**.
+
+![Object Detection](images/Screen_Shot_2016-11-17_at_11.14.54_AM.png)
+
+> Image from [YOLO v2 web site](https://pjreddie.com/darknet/yolov2/)
+
+## Naive Approach to Object Detection
+
+A very naive approach to object detection would be the following. If we want to find a cat on the picture, we can break the picture down to a number of tiles, and run image classification on each tile. Those tiles that result in sufficiently high activation can be considered to contain the object in question.
+
+![Naive Object Detection](images/naive-detection.png)
+
+> *Image from [Exercise Notebook](ObjectDetection-TF.ipynb)*
+
+However, this approach is far from ideal, because it allows to locate the object's bounding box very imprecisely. For more precise location, we need to run some sort of **regression** to predict coordinates of bounding boxes - and for that, we need specific datasets.
+
+## Regression for Object Detection
+
+To unders
+
+[This blog post](https://towardsdatascience.com/object-detection-with-neural-networks-a4e2c46b4491) has a great gentle introduction to detecting shapes.
+
+## Datasets for Object Detection
+
+* [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/) - 20 classes
+* [COCO](http://cocodataset.org/#home) - Common Objects in Context. 80 classes, boinding boxes + segmentation masks
+
+![COCO](images/coco-examples.jpg)
+
+## Object Detection Metrics
+
+### Intersection over Union
+
+While for image classification it is easy to measure how well the algorithm performs, for object detection we need to measure both correctness of the class, as well as precision of bounding box location. For the latter, we use so-called **Intersection over Union** (IoU), which measures how well two boxes (or two arbitrary areas) overlap.
+
+![IoU](images/iou_equation.png)
+
+> *Figure 2 from [this excellent blog post on IoU](https://pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/)*
+
+The idea is simple - we divide the area of intersection between two figures by the area of their union. For two identical areas, IoU would be 1, while for completely disjoint areas it will be 0. Otherwise it will vary from 0 to 1. We typically only consider those bounding boxes for which IoU is over a certain value.
+
+### Average Precision
+
+Suppose we want to measure how well a given class of objects $C$ is recognized. To measure it, we use **Average Precision** metrics, which is calculated as follows.
+
+Consider Precision-Recall curve, which shows the accuracy depending on a detection threshold value (from 0 to 1). Depending on the threshold, we will get more or less objects detected in the image, and different values of precision and recall. The curve will look like this:
+
+<img src="https://github.com/shwars/NeuroWorkshop/raw/master/images/ObjDetectionPrecisionRecall.png"/>
+
+> *Image from [NeuroWorkshop](http://github.com/shwars/NeuroWorkshop)*
+
+Average Precision for a given class $C$ is the area under this curve. More precisely, Recall axis is typically divided into 10 parts, and Precision is averaged over all those points:
+
+$$
+AP = {1\over11}\sum_{i=0}^{10}\mbox{Precision}(\mbox{Recall}={i\over10})
+$$
+
+### AP and IoU
+
+We shall consider only those detections, for which IoU is above a certain value. For example, in PASCAL VOC dataset typically $\mbox{IoU Threshold} = 0.5$ is assumed, while in COCO AP is measured for different values of $\mbox{IoU Threshold}$.
+
+<img src="https://github.com/shwars/NeuroWorkshop/raw/master/images/ObjDetectionPrecisionRecallIoU.png"/>
+
+> *Image from [NeuroWorkshop](http://github.com/shwars/NeuroWorkshop)*
+
+### Mean Average Precision - mAP
+
+The main metrics for Object Detection is called **Mean Average Precision**, or **mAP**. It is the value of Average Precision, average across all object classes, and sometimes also over $\mbox{IoU Threshold}$. In more detail, the process of calculating **mAP** is described
+[in this blog post](https://medium.com/@timothycarlen/understanding-the-map-evaluation-metric-for-object-detection-a07fe6962cf3)), and also [here with code samples](https://gist.github.com/tarlen5/008809c3decf19313de216b9208f3734).
+
+## Different Object Detection Approaches
+
+There are two broad classes of object detection algorithms:
+
+* **Region Proposal Networks** (R-CNN, Fast R-CNN, Faster R-CNN). The main idea is to generate **Regions of Interests** (ROI) and run CNN over them, looking for maximum activation. It is a bit similar to the naive approach, with the exception that ROIs are generated in a more clever way. One of the majors drawbacks of such methods is that they are slow, because we need many passes of CNN classifier over the image.
+* **One-pass** (YOLO, SSD, RetinaNet) methods. In those architectures we design the network to predict both classes and ROIs in one pass.
+
+### R-CNN: Region-Based CNN
+
+[R-CNN](http://islab.ulsan.ac.kr/files/announcement/513/rcnn_pami.pdf) uses [Selective Search](http://www.huppelen.nl/publications/selectiveSearchDraft.pdf) to generate hierarchical structure of ROI regions, which are then passed through CNN feature extractors and SVM-classifiers to determine object class, and linear regression to determine *bounding box* coordinates. [Official Paper](https://arxiv.org/pdf/1506.01497v1.pdf)
+
+![RCNN](images/rcnn1.png)
+
+> *Image from van de Sande et al. ICCV’11*
+
+![RCNN-1](images/rcnn2.png)
+
+> *Images from [this blog](https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e)
+
+### F-RCNN - Fast R-CNN
+
+This approach is similar to R-CNN, but regions are defined after convolution layers have been applied.
+
+![FRCNN](images/f-rcnn.png)
+
+> Image from [Offical Paper](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_Fast_R-CNN_ICCV_2015_paper.pdf), [arXiv](https://arxiv.org/pdf/1504.08083.pdf), 2015
+
+### Faster R-CNN
+
+The main idea of this approach is to use neural network to predict ROIs - so-called *Region Proposal Network*. [Paper](https://arxiv.org/pdf/1506.01497.pdf), 2016
+
+![FasterRCNN](images/faster-rcnn.png)
+
+> Image from [the official paper](https://arxiv.org/pdf/1506.01497.pdf)
+
+### R-FCN: Region-Based Fully Convolutional Network
+
+This algorithm is even faster than Faster R-CNN. The main idea is the following:
+
+1. We extract features using ResNet-101
+1. Features are processed by **Position-Sensitive Score Map**. Each object from $C$ classes is divided by $k\times k$ regions, and we are training to predict parts of objects.
+1. For each part from $k\times k$ regions all networks vote for object classes, and the object class with maximum vote is selected. 
+
+![](https://cdn-images-1.medium.com/max/840/1*JFtFIzpDhb3KsN1jran6yA.png)
+
+> Image from [official paper](https://arxiv.org/abs/1605.06409)
+
+### YOLO - You Only Look Once
+
+YOLO is a realtime one-pass algorithm. The main idea is the following:
+
+ * Image is divided into $S\times S$ regions
+ * For each region, **CNN** predicts $n$ possible objects, *bounding box* coordinates and *confidence*=*probability* * IoU.
+
+ ![YOLO](images/yolo.png)
+
+> Image from [official paper](https://arxiv.org/abs/1506.02640)
+
+ * [Good blog post](https://www.analyticsvidhya.com/blog/2018/12/practical-guide-object-detection-yolo-framewor-python/) describing YOLO
+ * [Official site](https://pjreddie.com/darknet/yolo/)
+ * Yolo: [Keras implementation](https://github.com/experiencor/keras-yolo2), [step-by-step notebook](https://github.com/experiencor/basic-yolo-keras/blob/master/Yolo%20Step-by-Step.ipynb)
+ * Yolo v2: [Keras implementation](https://github.com/experiencor/keras-yolo2), [step-by-step notebook](https://github.com/experiencor/keras-yolo2/blob/master/Yolo%20Step-by-Step.ipynb)
+
+### Other Algorithms
+
+* RetinaNet: [official paper](https://arxiv.org/abs/1708.02002)
+   - [PyTorch Implementation in Torchvision](https://pytorch.org/vision/stable/_modules/torchvision/models/detection/retinanet.html)
+   - [Keras Implementation](https://github.com/fizyr/keras-retinanet)
+   - [Object Detection with RetinaNet](https://keras.io/examples/vision/retinanet/) in Keras Samples
+* SSD (Single Shot Detector): [official paper](https://arxiv.org/abs/1512.02325)
+
+## References
+
+* [Object Detection](https://tjmachinelearning.com/lectures/1718/obj/) by Nikhil Sardana
+* [A good comparison of object detection algorithms](https://lilianweng.github.io/lil-log/2018/12/27/object-detection-part-4.html)
+* [Review of Deep Learning Algorithms for Object Detection](https://medium.com/comet-app/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852)
+* [A Step-by-Step Introduction to the Basic Object Detection Algorithms](https://www.analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the-basic-object-detection-algorithms-part-1/)
+* [Implementation of Faster R-CNN in Python for Object Detection](https://www.analyticsvidhya.com/blog/2018/11/implementation-faster-r-cnn-python-object-detection/)
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/1200px-Girl_and_cat.jpg b/lessons/4-ComputerVision/11-ObjectDetection/images/1200px-Girl_and_cat.jpg
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/ObjDetectionPrecisionRecall.png b/lessons/4-ComputerVision/11-ObjectDetection/images/ObjDetectionPrecisionRecall.png
diff --git a/...s/4-ComputerVision/11-ObjectDetection/images/ObjDetectionPrecisionRecallIoU.png b/...s/4-ComputerVision/11-ObjectDetection/images/ObjDetectionPrecisionRecallIoU.png
diff --git a/...puterVision/11-ObjectDetection/images/Screen_Shot_2016-11-17_at_11.14.54_AM.png b/...puterVision/11-ObjectDetection/images/Screen_Shot_2016-11-17_at_11.14.54_AM.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/coco-examples.jpg b/lessons/4-ComputerVision/11-ObjectDetection/images/coco-examples.jpg
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/f-rcnn.png b/lessons/4-ComputerVision/11-ObjectDetection/images/f-rcnn.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/faster-rcnn.png b/lessons/4-ComputerVision/11-ObjectDetection/images/faster-rcnn.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/iou_equation.png b/lessons/4-ComputerVision/11-ObjectDetection/images/iou_equation.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/naive-detection.png b/lessons/4-ComputerVision/11-ObjectDetection/images/naive-detection.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/r-fcn.png b/lessons/4-ComputerVision/11-ObjectDetection/images/r-fcn.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/rcnn1.png b/lessons/4-ComputerVision/11-ObjectDetection/images/rcnn1.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/rcnn2.png b/lessons/4-ComputerVision/11-ObjectDetection/images/rcnn2.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/images/yolo.png b/lessons/4-ComputerVision/11-ObjectDetection/images/yolo.png
diff --git a/lessons/4-ComputerVision/11-ObjectDetection/lab/README.md b/lessons/4-ComputerVision/11-ObjectDetection/lab/README.md
@@ -0,0 +1,63 @@
+# Head Detection using Hollywood Heads Dataset
+
+Lab Assignment from [AI for Beginners Curriculum](https://github.com/microsoft/ai-for-beginners).
+
+## Task
+
+Counting number of people on video surveillance camera stream is an important task that will allow us to estimate the number of visitors in a shops, busy hours in a restaurant, etc. To solve this task, we need to be able to detect human heads from different angles. To train object detection model to detect human heads, we can use [Hollywood Heads Dataset](https://www.di.ens.fr/willow/research/headdetection/).
+
+## The Dataset
+
+[Hollywood Heads Dataset](https://www.di.ens.fr/willow/research/headdetection/release/HollywoodHeads.zip) contains 369,846 human heads annotated in 224,740 movie frames from Hollywood movies. It is provided in [https://host.robots.ox.ac.uk/pascal/VOC/](PASCAL VOC) format, where for each image there is also an XML description file that looks like this:
+
+```xml
+<annotation>
+	<folder>HollywoodHeads</folder>
+	<filename>mov_021_149390.jpeg</filename>
+	<source>
+		<database>HollywoodHeads 2015 Database</database>
+		<annotation>HollywoodHeads 2015</annotation>
+		<image>WILLOW</image>
+	</source>
+	<size>
+		<width>608</width>
+		<height>320</height>
+		<depth>3</depth>
+	</size>
+	<segmented>0</segmented>
+	<object>
+		<name>head</name>
+		<bndbox>
+			<xmin>201</xmin>
+			<ymin>1</ymin>
+			<xmax>480</xmax>
+			<ymax>263</ymax>
+		</bndbox>
+		<difficult>0</difficult>
+	</object>
+	<object>
+		<name>head</name>
+		<bndbox>
+			<xmin>3</xmin>
+			<ymin>4</ymin>
+			<xmax>241</xmax>
+			<ymax>285</ymax>
+		</bndbox>
+		<difficult>0</difficult>
+	</object>
+</annotation>
+```
+
+In this dataset, there is only one class of objects `head`, and for each head, you get the coordinates of the bounding box. You can parse XML using Python libraries, or use [this library](https://pypi.org/project/pascal-voc/) to deal directly with PASCAL VOC format.
+
+## Training Object Detection 
+
+You can train an object detection model using one of the following ways:
+
+* Using [Azure Custom Vision](https://docs.microsoft.com/azure/cognitive-services/custom-vision-service/quickstarts/object-detection?tabs=visual-studio&WT.mc_id=academic-57639-dmitryso) and it's Python API to programmatically train the model in the cloud. Custom vision will not be able to use more than a few hundred images for training the model, so you may need to limit the dataset.
+* Using the example from [Keras tutorial](https://keras.io/examples/vision/retinanet/) to train RetunaNet model.
+* Using [torchvision.models.detection.RetinaNet](https://pytorch.org/vision/stable/_modules/torchvision/models/detection/retinanet.html) build-in module in torchvision.
+
+## Takeaway
+
+Object detection is a task that is frequently required in industry. While there are some services that can be used to perform object detection (such as [Azure Custom Vision](https://docs.microsoft.com/azure/cognitive-services/custom-vision-service/quickstarts/object-detection?tabs=visual-studio&WT.mc_id=academic-57639-dmitryso)), it is important to understant how object detection works and to be able to train your own models. 
diff --git a/lessons/X-Extras/X1-MultiModal/README.md b/lessons/X-Extras/X1-MultiModal/README.md
@@ -50,7 +50,7 @@ To generate an image corresponding to a text prompt, we start with some random e
 
 A great library that implements VQGAN+CLIP is [Pixray](http://github.com/pixray/pixray)
 
-![Picture produced by Pixray](images/a_closeup_watercolor_portrait_of_young_male_teacher_of_literature_with_a_book.png) |  ![Picture produced by pixray](images/a_closeup_oil_portrait_of_young_female_teacher_of_computer_science_with_a_computer.png) | ![Picture produced by Pixray](a_closeup_oil_portrait_of_old_male_teacher_of_math.png)
+![Picture produced by Pixray](images/a_closeup_watercolor_portrait_of_young_male_teacher_of_literature_with_a_book.png) |  ![Picture produced by pixray](images/a_closeup_oil_portrait_of_young_female_teacher_of_computer_science_with_a_computer.png) | ![Picture produced by Pixray](images/a_closeup_oil_portrait_of_old_male_teacher_of_math.png)
 ----|----|----
 Picture generated from prompt *a closeup watercolor portrait of young male teacher of literature with a book* | Picture generated from prompt *a closeup oil portrait of young female teacher of computer science with a computer* | Picture generated from prompt *a closeup oil portrait of old male teacher of mathematics in front of blackboard*
 
@@ -59,4 +59,4 @@ Picture generated from prompt *a closeup watercolor portrait of young male teach
 ## References
 
 * VQGAN Paper: [Taming Transformers for High-Resolution Image Synthesis](https://compvis.github.io/taming-transformers/paper/paper.pdf)
-* CLIP Paper: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf) 
+* CLIP Paper: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf)