Simple and Intuitive Explanation of YOLO

Sushant Gautam
6 min readMay 16, 2023

--

Explore the fundamentals of YOLO and mathematical loss functions with an annotated paper and summary.

YOLO (“You Only Look Once”) is an effective real-time object recognition algorithm, first described in the seminal 2015 paper by Joseph Redmon et al. This article covers the basic intuition of the YOLO paper, uncovers hidden mathematical details, and provides a step-by-step explanation of related concepts.

For the full annotated YOLO paper and the summary, check the Github repo.

Why do we need YOLO?

Problem: Prior work on object detection like R-CNN uses a slow-mandated classifier for object detection and can’t be used in real-time applications like self-driving cars.

Solution: The author presents YOLO, an approach that frames object detection as a regression problem for spatially separated bounding boxes and associated class probabilities. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes in real-time, like at 45 FPS.

Comparison of YOLO, and other traditional object detection algorithms. Reference.

In the above figure, you can see YOLO is much faster in real-time with slightly less accuracy.

This unified model (YOLO) has several benefits over traditional methods of object detection, including:

  1. First, YOLO has an extremely fast speed (45 frames per second). They frame detection as a regression problem and don’t need a complex pipeline. They simply run a neural network on a new image at test time to predict detections.
  2. Second, YOLO thinks globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time, so it implicitly encodes contextual information about classes as well as their appearance. It also predicts all bounding boxes across all classes for an image simultaneously.
  3. Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Also, The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision

How does Yolo work?

First, the image is divided into various grids. Each grid has a dimension of S x S. The following image shows how an input image is divided into grids.

Step1: Divide input image into SxS equal grid.

In the image above, there are SxS grid cells of equal dimensions. Every grid cell will detect objects that appear within it. For example, if an object center appears within a certain grid cell, then this cell will be responsible for detecting it.

Second, image classification and localization are applied to each grid. YOLO then predicts the bounding boxes and their corresponding class probabilities for objects (if any are found). Each grid can produce B boxes, B = 2 in this paper. Every bounding box in the image consists of the following attributes:

  • Width (bw)
  • Height (bh)
  • Classes (c1, c2, …., c20) . This is represented by c. It is one hot vector where only one class is predicted by each grid.
  • Bounding box center (bx, by)

Second, image classification and localization are applied to each grid. YOLO then predicts the bounding boxes and their corresponding class probabilities for objects (if any are found). Each grid can produce B boxes, B = 2 in this paper. Every bounding box in the image consists of the following attributes:

  • Width (bw)
  • Height (bh)
  • Classes (c1, c2, …., c20) . This is represented by c. It is one hot vector where only one class is predicted by each grid.
  • Bounding box center (bx, by)

Second, image classification and localization are applied to each grid. YOLO then predicts the bounding boxes and their corresponding class probabilities for objects (if any are found). Each grid can produce B boxes, B = 2 in this paper. Every bounding box in the image consists of the following attributes:

  • Width (bw)
  • Height (bh)
  • Classes (c1, c2, …., c20) . This is represented by c. There is only one hot vector, and each grid predicts just one class.
  • Bounding box center (bx, by)
YOLO label and prediction details.

Each cell can predict multiple bounding boxes, which causes a lot of duplicate predictions due to multiple cells predicting the same object with different bounding box predictions.

Third, YOLO uses intersection over union (IOU) and non-maximal suppression (NMS) to produce one bounding box for each object.

Three main steps in YOLO for object detection in an input image.

Intersection Over Union:

The idea of intersection over union is that it calculates the intersection over the union of the actual bounding box and the predicted bounding box. So, the IOU is calculated for each bounding box.

IOU equation.
An example of computing Intersection over Unions for various bounding boxes.

The above figure shows an example of a good and bad intersection over union scores.

As can be seen, predicted bounding boxes that heavily overlap with ground truth bounding boxes outperform those with less overlap. This makes intersection over union an excellent metric for evaluating custom object detectors. Simply put, predicted bounding boxes with a higher IOU are more important than those with a lower IOU.

Non Maximal Suppression:

Then, in non-maximal suppression, YOLO suppresses all bounding boxes that have lower probability scores. YOLO does this by first looking at the probability scores for each choice and then going with the one with the highest score. The bounding boxes with the lowest intersection over union with the current low probability bounding box are then suppressed.
This step is repeated until the final bounding boxes are obtained for each object.

Final Bounding box per object after applying Non-Maximal Suppression

YOLO Architecture

Inspired by the GoogleNet architecture, Yolo’s architecture has a total of 24 convolutional layers, with two fully connected layers at the end.

YOLO Architecture. Reference.

The final output of the network is the 7 × 7 × 30 tensor of predictions.

Loss Function

YOLO Loss function explanation in detail.

To summarize, the YOLO loss function is primarily composed of three terms:

  • First, two terms are for bounding box coordinates. We set λcoord=5 to high-prioritize this loss and ensure the network learns to produce the correct bounding box.
  • The second two terms are for whether the box contains an image or not and we take both cases (object present and not present)
  • The last term is to ensure the network predicts the correct class of objects across 20 categories. Here, they use the regression type of loss.

YOLO Limitations:

  • YOLO struggles to detect and segregate small objects in images that appear in groups, as each grid is constrained to detect only a single object. Small objects that naturally come in groups, such as a line of ants, or a group of birds, are therefore hard for YOLO to detect and localize.
  • YOLO is also characterized by lower accuracy when compared to much slower object detection algorithms like Fast RCNN.
  • This model learns to predict bounding boxes from data, but it struggles to generalize to objects with new or unusual aspect ratios or configurations.

Here is a detailed video explanation of the paper by the author himself:

Thanks for reading! I hope it proves beneficial to you. Feel free to use the comments section to ask questions, point out mistakes, or suggest ways to make this post better. I will try to answer it.

If you found this tutorial useful, please give it a thumbs up and share it with others.

--

--

Sushant Gautam
Sushant Gautam

Written by Sushant Gautam

Interest in Computer Vision, Deep Learning Research. LinkedIn: https://www.linkedin.com/in/susan-gautam/

No responses yet