Focal Loss: An efficient way of handling class imbalance

Published in

The Startup

5 min readJun 30, 2020

Figure 1. Plotting loss curve for different settings of γ.

Recently I participated in a Kaggle competition: SIIM-ISIC Melanoma Classification. In this competition one has to output the probability of the melanoma in images of skin lesions from the two classes of skin cancer. So it is a kind of binary image classification task. The Evaluation criterion is AUC (Area Under the Curve) metric. At first I worked on a model with cross-entropy as a loss function. Then after some searching over the internet I found this paper in which the team at Facebook AI research(FAIR) introduced a new loss function — Focal Loss.

I got a good AUC score (92+) with this loss function so I have decided to discuss something about this loss function.

Object Detectors
Focal Loss (an Extension to Cross Entropy Loss)
Definition of Focal Loss
Focal Loss (alternate form)
References

Object Detectors

Before moving on to the discussion of Focal Loss let me give a short overview over two types of object detectors i.e one stage and two stage detectors.

Two Stage Detectors

Two stages are required for this class of object detectors to detect an object. The first stage scans through the image and generate proposals and the second stage classifies those proposals and outputs bounding boxes and classes. The accuracy is quite good but the speed is slow than one-stage object detectors.

One Stage Detectors

Only a single stage is required for this class of object detectors to detect an object. The image is divided into grid of (n x n) where n can be any positive integer. This image is then passed through a convolutional neural network which detects object and outputs bounding box corresponding to object in the image. Note that all the grids are classified in a single iteration of this network. These object detectors are faster than two stage object detectors but are comparatively less accurate.

Focal Loss (an Extension to Cross Entropy loss):

Basically Focal loss is an extension to cross entropy loss. It is specific enough to deal with class imbalance issues. A cross entropy loss would be defined as

Here y = {-1,1} is ground truth label and p is the probability that the example to be classified belongs to positive class (y=1).

We can also define a variable \pt as

Provides convenience for notation of loss function

So now cross entropy loss can be re-written as

This loss function is somewhat unable to handle to importance of positive/negative examples and hence a new version of it is introduced with name: Balanced Cross entropy and is defined as

Here a weighting factor “α” is introduced whose range is [0,1] and it is α for positive class and 1 - α for negative class and both these definitions are merged under the name α_t which can be defined as

This α_t is used in the definition of CE loss above.

This loss function slightly solves the problem of class imbalance but still is quite unable to differentiate between easy and hard examples. To solve this issue, Focal Loss was defined.

Definition of Focal Loss:

Theoretical Definition: Focal loss can be considered as a loss function which down-weights the easily classified examples and gives much more importance to examples which are hard to classify.

Mathematical Definition: A Focal loss is — a modulating factor multiplied to the original cross entropy loss.

Formula for Focal loss would be:

(1-pt)^γ — modulating factor

Here γ ≥ 0 and known as focusing parameter.

Two properties of focal loss can be extracted from the above definition —

When example is misclassified, pt would tend to become 0 and so modulating factor would tend to become 1 which makes makes loss function almost unaffected. On the other hand if example is correctly classified, pt would tend to become 1 and modulating factor would tend to become 0 making loss to be very near to 0 which down-weights that particular example.
The focusing parameter (γ) smoothly adjusts the rate at which easily classified examples are down-weighted.

One comparison of FL wit CE can be:

When γ = 2 and example classified with probability 0.9 would have 100x lower loss compared with CE and with 0.968, it would have 1000x lower loss.

The figure at the top describes the FL on different γ values. At γ=0 FL will be equal to CE loss. Here we can see that for γ=0 (CE loss) even examples that are easily classified incur a loss with non-trivial magnitude.These losses on summation can overwhelm the rare class (class that are hard to classify).