LAST UPDATED: JUNE 11, 2024

Understanding YOLOv5 Loss: A Comprehensive Analysis

A Step-by-Step Breakdown of the Source Code and Mathematical Formulation

25 min readJun 10, 2024

YOLOv5 🚀 has been one of the most widely used YOLO algorithms during the last few years, and is still very popular today. YOLOv5 introduced some improvements to the YOLOv4 architecture, enhancing its performance and becoming one of the most accurate and fast object detection models available. YOLOv5 is more than just a single model architecture, it is a comprehensive repository with many features for training and evaluating YOLOv5 models. It was created by Glenn Jocher, the founder of Ultralytics, in 2020, and is still maintained by the Ultralytics team and subject to changes.

Recently, I have been trying to implement the YOLOv5 model from scratch, and the loss function has been one of the toughest parts to fully understand, as there is very little information explaining how it has actually been implemented in the repository. Moreover, there is no published paper or official mathematical formulation of the loss function (at least I have not found it). However, Ultralytrics has a documentation website where some YOLOv5 concepts are explained.

Despite being helpful to review, it was not sufficient to fully understand and implement the loss function from scratch. For this reason, I saw no other way but to analyze the source code step by step, trying to understand every detail of their implementation.

In this article, I would like to review and provide a thorough explanation of everything I have found out during my analysis of the YOLOv5 loss function. Despite the complexity of some aspects, I intend to make it easily digestible to save time for those of you who may be going through the same process or are just curious about the topic. I will discuss three main things:

Firstly, I will provide an in-depth conceptual explanation of the rationale behind the YOLOv5 loss implementation, drawn from my experience analyzing the source code and reviewing the documentation.
Secondly, I will examine each line of code step by step with a guided example to better understand the actual implementation and show how it is efficiently implemented using PyTorch in the official YOLOv5 Ultralytics repository.
Finally, I will provide a mathematical formulation for the reviewed implementation of the YOLOv5 loss function, which I believe can be valuable to readers.

Additionally, I have created a GitHub repository with the entire source code analysis, as well as a cleaned and fully documented implementation of the YOLOv5 and YOLOv3 loss functions, following the concepts explained in the first section.

⚠️This article assumes that you have a good understanding of PyTorch and the basics of the YOLO architecture, such as anchors, prediction layers, bounding box prediction formulas, etc.

1. Conceptual Explanation

Before diving into the code, let me first explain the concepts visually, so that later we can delve into the code and see how they are implemented.

Loss Components

The loss is made up of three different parts:

Class loss: It is the loss associated with the error in the classification task. It uses Binary Cross Entropy (BCE) to support multi-label classification.
Objectness loss: It is the loss associated with the error in detecting the presence of an object in a particular grid cell. It also uses BCE.
Bounding box loss: It is the loss associated with the bounding box prediction error. This is a regression task and, like YOLOv4, it uses the IoU loss (CIoU by default), which has been shown to perform better than MSE for this problem.

These losses are computed for each prediction layer and then summed up. Each loss component is weighted to control its contribution (tunable hyperparameters). Additionally, the objectness loss has an extra weight that varies for each prediction layer to ensure predictions at different scales contribute appropriately to the total loss. Below is the summarized loss formula for a single sample (P3, P4 and P5 refer to each of the three default prediction layers):

Summarized YOLOv5 loss formula for a single sample. Source: Image by the author.

Bounding Box Prediction Formulas

YOLOv5 introduces a new formula for calculating the bounding box predictions that differs from the previous YOLO versions:

YOLOv5 bounding box prediction formulas. Source: Image by the author.

As explained in the Ultralytics documentation, these formulas address the issue of grid sensitivity in bx and by and impose a boundary to the bw and bh predictions to avoid previous problems such as runaway gradients, instabilities and NaN losses due to the unbounded exponential function.

The range for bx and by is now from -0.5 to 1.5, while bw and bh range from 0 to 4. Consequently, the maximum adjustment allowed for a predefined anchor box’s width (pw) or height (ph) to align with our ground truth (GT) box is 4 times its original size. This is crucial because if, for example, a selected anchor has pw=0.8 and the ground truth width bw=4.7, it becomes impossible for that cell to accurately predict the ground truth box. This limitation occurs because the maximum adjustment possible for pw is 4 times, resulting in 0.8 * 4 = 3.2, which is still considerably less than 4.7.

Terminology explanation

The terminology used in the realm of YOLO can sometimes be a little confusing, so let’s first clarify some concepts that will be used in the following sections of the article:

Anchors: They refer to a predefined set of bounding boxes that serve as reference points for object detection models. These anchors are typically chosen to represent a diverse range of object shapes and sizes effectively. They are usually computed by running a k-means clustering algorithm on the bounding boxes from a custom dataset, extracting the best k anchors that minimize the difference between the predicted and actual bounding boxes. This helps the model generalize better to objects of varying scales and aspect ratios.
Cell anchor or cell prediction box: Each cell in a YOLO model can predict N bounding boxes. Typically, the number of boxes a model layer can predict is set to the number of selected anchor boxes for that layer. In this way, each cell prediction box is assigned to a different anchor box and is therefore responsible for predicting different object shapes. These cell prediction boxes will also be called cell anchors, referring to the fact that each prediction box uses a different anchor as the base prediction shape.
Targets or Ground Truth (GT): These refer to the object annotation information, such as the bounding box and object class.
Built-Targets: In this article, this term will refer to the processed targets after being output from the build_targets function. Each built-target contains information about the indices of the cell anchors that are tasked with predicting a target (object) and its associated information.

The diagram below further explains these concepts:

Diagram illustrating key concepts in YOLO terminology, including anchors, cell anchors, targets (ground truth), and built-targets — Diagram illustrating important concepts in YOLO terminology. Source: Image by the author.

Assigning Targets to Cell Anchors

In previous YOLO versions, such as YOLOv3, usually the process of assigning targets (ground truth objects) to cell anchors followed a different approach than in YOLOv5.

YOLOv3 approach

In YOLOv3, for each prediction layer, we calculate for each target which is the grid cell that contains the center point of the ground truth object. Then, once the target has been assigned to that cell, we compare the bounding box with each of the predefined layer anchors and select the one that has the greater IoU with the ground truth box.

When the best-fitting anchor falls below a certain threshold (iou_t), it is discarded, and the target is not used in that specific layer, awaiting a better fit with the anchors of a different prediction layer. If there is a good match for that target with anchors from different layers, the same target is assigned to a cell anchor across different scales.

Lastly, imagine that more than one cell anchor is a very good fit for an object, let’s say that two out of the three different predefined anchors have an IoU > ignore_t (ignore_t can be equal to iou_t or different). In this case, as only one can be selected, then the remaining anchors that have not been selected as the best ones, but still have a very good overlap with the ground truth, are ignored in the objectness loss computation. This way, we avoid penalizing good anchor boxes that have not been selected, which could cause training instabilities.

YOLOv3 summarized approach to assigning targets to anchors. Source: Image by the author.

Note that with this approach, each target can only be assigned to one cell, the one that contains the center point of the object in that specific prediction layer/scale and only to the best fitting anchor, if the IoU is good enough. Additionally, the same target can be assigned at the same time to other cell anchor of different prediction scales.

2. YOLOv5 Approach

In YOLOv5, as in YOLOv3, for each layer, we start by calculating, for each target, which is the grid cell that contains the center point of the ground truth object. From this point onwards, things start to differ.

First, we compare each target (ground truth) to each anchor, and we select all the anchors, not only the best one, that meet the following requirement rmax < anchor_t:

Anchor-Target fit evaluation formulas. Source: Image by the author.

But what does this mean? Well, this is just another way of evaluating whether an anchor box is a good fit for a ground truth box. By default, the parameter anchor_t is usually set to 4.

But why 4? If you remember the previous part about the new bounding box prediction formulas, the maximum change we can apply to the sides of the anchor box is 4 times their original size. Hence, if rmax, which represents the maximum ratio in either width or height, exceeds 4, we won’t be able to fit the ground truth object under any circumstances. The following diagrams illustrate this concept with two examples:

Anchors fit evaluation example: Anchors smaller than ground truth object. Source: Image by the author.

Anchors fit evaluation example: Anchors bigger than ground truth object. Source: Image by the author.

So what we are saying is, if there’s no way to perfectly fit the anchor box to the ground truth object, discard it, but select all the others that can be modified to fit the GT box.

Now that we understand how target-anchor pairs are created, let’s examine how cells are chosen.

With these new formulas, it’s important to note that predictions for each cell are no longer confined to that cell alone. Each cell can now predict x, y coordinates that extend beyond its boundaries. This is due to the added offsets, expanding the range from -0.5 to 1.5.

YOLOv3-v5 (x, y) prediction boundaries. Source: Image by the author.

For this reason, in YOLOv5, they have implemented a strategy in which they attempt to select more than one cell per target. They choose adjacent cells to the one containing the center of the object. Each main cell is divided into four sectors, and adjacent cells are selected based on the center point’s location.

If the center point is in the top-left corner, the top and left cells will also be selected, and so on. Therefore, if any of the anchors is a good fit, a minimum of one cell is selected for each ground truth object, with the possibility of adding one or two more cells depending on the cell and center point locations on the grid.

YOLOv5 summarized approach to assigning targets to anchors. Source: Image by the author.

By implementing this change, the number of cell anchors considered to contain an object increases in each prediction layer. Now, more cells are tasked with predicting an object, rather than just one as in YOLOv3. As a result, this amplifies the number of positive samples for the model’s prediction, enhancing its sensitivity to such instances and refining its ability to distinguish objects from the background.

2. Source Code Explanation

We will follow a guided example so that everything is easier to understand. Suppose we have input a batch of 2 images of size 320x320 into the model. Our dataset has 20 classes, and the number of anchors per layer is 3. Our model uses the default three prediction layers of the YOLOv5 architecture, with strides [P3: 8, P4: 16, P5: 32].

The input variables for the loss function are p and targets:

p is a list of torch.Tensor objects, each one corresponding to a different prediction layer (small:P3, medium:P4 and large:P5 objects). Each tensor has a shape (batch_size, num_anchors, num_cells_y, num_cells_x, 5+num_classes).
targets is a torch.Tensor object of shape (num_targets, 6). Each element contains (img_id, class, x, y, w, h) for each ground truth. All coordinates and size values (x, y, w, h) are scaled in the range [0, 1]. So, if the bounding box of an object has values (280, 130, 30, 50), its scaled values would be (280/img_size_x, 130/img_size_y, 30/img_size_x, 50/img_size_y).

In this case, the prediction heads would output 3 tensors of shape:

P3: (2, 3, 320//8, 320//8, 5+20) = (2, 3, 40, 40, 25)
P4: (2, 3, 320//16, 320//16, 5+20) = (2, 3, 20, 20, 25)
P5: (2, 3, 320//32, 320//32, 5+20) = (2, 3, 10, 10, 25)

Let’s suppose that Image 1 has 3 objects and Image 2 has 2 objects. In total, we have 5 target objects (ground truths). Therefore, targets would have shape (5, 6).

Below is the code that defines the initial variables for our analysis:

device = 'cpu'
img_size = 320
num_classes = 20; num_layers = 3
anchor_t = 4.0

# Loss weights
balance = [4.0, 1.0, 0.4]
lambda_box = 0.05; lambda_obj = 0.7; lambda_cls = 0.3

anchors = torch.tensor([
    # P3 anchors
    [[ 1.25000,  1.62500],[ 2.00000,  3.75000],[ 4.12500,  2.87500]], 
    # P4 anchors
    [[ 1.87500,  3.81250],[ 3.87500,  2.81250],[ 3.68750,  7.43750]], 
    # P5 anchors
    [[ 3.62500,  2.81250],[ 4.87500,  6.18750],[11.65625, 10.18750]], 
])
assert anchors.shape[0] == num_layers
num_anchors = anchors.shape[1]

targets = torch.tensor([
        [ 0.00000, 14.00000,  0.49535,  0.50528,  0.15267,  0.56956],
        [ 0.00000,  0.00000,  0.54872,  0.92491,  0.05361,  0.03183],
        [ 0.00000,  0.00000,  0.36780,  0.98716,  0.06031,  0.02567],
        [ 1.00000,  6.00000,  0.97072,  0.04398,  0.05856,  0.08796],
        [ 1.00000, 16.00000,  0.70696,  0.10348,  0.32971,  0.16793],
])
batch_size = len(targets[:,:1].unique())

strides = [8, 16, 32]
p = [
    torch.randn((batch_size, num_anchors, img_size//strides[i], img_size//strides[i], 5 + num_classes)) 
    for i in range(num_layers)
]

print("Targets Shape:", targets.shape)
print("Anchors Shape:", anchors.shape)
for i, pi in enumerate(p):
    print(f"Layer P{i+3} Shape:", pi.shape)

====================================================================

>>> Targets Shape: torch.Size([5, 6])
>>> Anchors Shape: torch.Size([3, 3, 2])
>>> Layer P3 Shape: torch.Size([2, 3, 40, 40, 25])
>>> Layer P4 Shape: torch.Size([2, 3, 20, 20, 25])
>>> Layer P5 Shape: torch.Size([2, 3, 10, 10, 25])

The file we are going to analyze is located in utils/loss.py. In case the repository changes in the future after publishing this article, and maybe the link provided is broken or the code has changed, I will leave here the current GitHub commit I used in the analysis, so you can go to utils/loss.py and review the code we are going to examine now.

The important code part is defined in the ComputeLoss class. Let’s review each method of this class:

__init__

class ComputeLoss:
    sort_obj_iou = False

    # Compute losses
    def __init__(self, model, autobalance=False):
        """Initializes ComputeLoss with model and autobalance option, autobalances losses if True."""
        device = next(model.parameters()).device  # get model device
        h = model.hyp  # hyperparameters

        # Define criteria
        BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h["cls_pw"]], device=device))
        BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h["obj_pw"]], device=device))

        # Class label smoothing https://arxiv.org/pdf/1902.04103.pdf eqn 3
        self.cp, self.cn = smooth_BCE(eps=h.get("label_smoothing", 0.0))  # positive, negative BCE targets

        # Focal loss
        g = h["fl_gamma"]  # focal loss gamma
        if g > 0:
            BCEcls, BCEobj = FocalLoss(BCEcls, g), FocalLoss(BCEobj, g)

        m = de_parallel(model).model[-1]  # Detect() module
        self.balance = {3: [4.0, 1.0, 0.4]}.get(m.nl, [4.0, 1.0, 0.25, 0.06, 0.02])  # P3-P7
        self.ssi = list(m.stride).index(16) if autobalance else 0  # stride 16 index
        self.BCEcls, self.BCEobj, self.gr, self.hyp, self.autobalance = BCEcls, BCEobj, 1.0, h, autobalance
        self.na = m.na  # number of anchors
        self.nc = m.nc  # number of classes
        self.nl = m.nl  # number of layers
        self.anchors = m.anchors
        self.device = device

This first part initializes the loss class function. The important things to pay attention to here are:

BCEcls and BCEobj are BCEWithLogitsLoss instances, with the possibility to specify a weight for the positive sample loss, which is useful in cases where the dataset is imbalanced.
Label smoothing by default is not used. Therefore, self.cp = 1 (class positive value) and self.cn = 0 (class negative value).
Focal loss is also not used by default, g = h[“fl_gamma”] = 0.
de_parallel() function removes DataParallel or DistributedDataParallel wrappers in case of multiple GPUs being used and returns a single GPU model. It is not important for understanding the loss function.
self.balance will be [4.0, 1.0, 0.4] when number of layers is 3, which is our case.
We can ignore self.ssi, because we are not going to use autobalance (changes iteratively the balance value of each layer objectness loss weight). This feature is useful for optimizing the balance parameter for custom datasets, but it is not used by default.
self.na is the number of anchors in each layer and by default is 3, self.nc is the number of classes in the dataset and self.nl is the number of prediction layers, which by default is also 3.
self.anchors stores the predefined anchors of each prediction layer.

build_targets

The __call__ method performs the forward pass, calculating the losses for each prediction layer. Before explaining how the __call__ method computes the loss, let’s first describe the build_targets method. This method is invoked in the initial lines of the __call__ function and it is responsible for assigning targets to cell anchors and preparing them for loss computation according to the YOLOv5 formulation. Let’s go through this step by step:

def build_targets(self, p, targets):
    """Prepares model targets from input targets (image,class,x,y,w,h) for loss computation, returning class, box,
    indices, and anchors.
    """
    na, nt = self.na, targets.shape[0]  # number of anchors, targets
    tcls, tbox, indices, anch = [], [], [], []
    gain = torch.ones(7, device=self.device)  # normalized to gridspace gain
    ai = torch.arange(na, device=self.device).float().view(na, 1).repeat(1, nt)  # same as .repeat_interleave(nt)
    targets = torch.cat((targets.repeat(na, 1, 1), ai[..., None]), 2)  # append anchor indices

    g = 0.5  # bias
    off = (
        torch.tensor(
            [
                [0, 0],
                [1, 0],
                [0, 1],
                [-1, 0],
                [0, -1],  # j,k,l,m
                # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
            ],
            device=self.device,
        ).float()
        * g
    )  # offset

This is the first part of the function. Steps:

Store the values of number of anchors and targets (na = 3, nt = 5)
Initialize the output of the function (tcls, tbox, indices, anch) with empty lists.
Initialize the gain tensor that will be used later for scaling the targets in each layer (shape=(7,)).
Map each target to each anchor

ai = torch.arange(na, device=self.device).float().view(na, 1).repeat(1, nt)  # same as .repeat_interleave(nt)
targets = torch.cat((targets.repeat(na, 1, 1), ai[..., None]), 2)  # append anchor indices

The purpose of the above 2 lines of code is to create a tensor that maps each target to each anchor. We have 3 anchors in each prediction layer, so we want to compare each target (GT) to each of the 3 anchors, resulting in 5*3=15 comparisons. To achieve this, we repeat the target tensor (Size([5,6])) 3 times along a new first dimension, creating a tensor of shape [3, 5, 6]. Then, we append the index of the anchor (ai) to each target array, resulting in a shape of [3, 5, 7], where each target contains (img_id, class, x, y, w, h, anchor_id).

Note: ai[…, None] is equivalent to ai.unsqueeze(-1), which adds a size-one dimension at the end. Size([3,5]) -> Size([3,5,1]).

ai
>>> tensor([[0., 0., 0., 0., 0.],
            [1., 1., 1., 1., 1.],
            [2., 2., 2., 2., 2.]]) # Anchor indices

targets
>>> tensor([
    [[ 0.0000, 14.0000,  0.4954,  0.5053,  0.1527,  0.5696,  0.0000],
     [ 0.0000,  0.0000,  0.5487,  0.9249,  0.0536,  0.0318,  0.0000],
     [ 0.0000,  0.0000,  0.3678,  0.9872,  0.0603,  0.0257,  0.0000],
     [ 1.0000,  6.0000,  0.9707,  0.0440,  0.0586,  0.0880,  0.0000],
     [ 1.0000, 16.0000,  0.7070,  0.1035,  0.3297,  0.1679,  0.0000]],

    [[ 0.0000, 14.0000,  0.4954,  0.5053,  0.1527,  0.5696,  1.0000],
     [ 0.0000,  0.0000,  0.5487,  0.9249,  0.0536,  0.0318,  1.0000],
     [ 0.0000,  0.0000,  0.3678,  0.9872,  0.0603,  0.0257,  1.0000],
     [ 1.0000,  6.0000,  0.9707,  0.0440,  0.0586,  0.0880,  1.0000],
     [ 1.0000, 16.0000,  0.7070,  0.1035,  0.3297,  0.1679,  1.0000]],

    [[ 0.0000, 14.0000,  0.4954,  0.5053,  0.1527,  0.5696,  2.0000],
     [ 0.0000,  0.0000,  0.5487,  0.9249,  0.0536,  0.0318,  2.0000],
     [ 0.0000,  0.0000,  0.3678,  0.9872,  0.0603,  0.0257,  2.0000],
     [ 1.0000,  6.0000,  0.9707,  0.0440,  0.0586,  0.0880,  2.0000],
     [ 1.0000, 16.0000,  0.7070,  0.1035,  0.3297,  0.1679,  2.0000]]])

5. Define the offsets in each grid direction (j, k, l, m) (left, up, right, down)

# j,k,l,m
torch.tensor(
  [[0, 0], [1, 0], [0, 1], [-1, 0], [0, -1]],
   device=self.device
).float() * g

These offsets will be subtracted from the built-targets grid coordinates (gxy - offsets), so a 1 actually represents a -1 unit in that dimension. For example, [1, 0] (j), indicates subtracting 1 unit in the x-dimension, referring to the left adjacent cell.

The g term scales these adjustments to 0.5 units, which is sufficient because positive offsets [0.5, 0] and [0, 0.5] are always subtracted from values with a decimal part less than 0.5, while negative offsets [-0.5, 0] and [0, -0.5] are always subtracted from values with a decimal part greater than 0.5. This consistently changes the integer part of the grid coordinate value, and therefore the grid cell index.

Now, let’s explain the logic behind the target-anchors assignment:

 for i in range(self.nl):
    anchors, shape = self.anchors[i], p[i].shape
    gain[2:6] = torch.tensor(shape)[[3, 2, 3, 2]]  # xyxy gain

    # Match targets to anchors
    t = targets * gain  # shape(3,n,7)
    if nt:
        # Matches
        r = t[..., 4:6] / anchors[:, None]  # wh ratio
        j = torch.max(r, 1 / r).max(2)[0] < self.hyp["anchor_t"]  # compare
        # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n)=wh_iou(anchors(3,2), gwh(n,2))
        t = t[j]  # filter

        # Offsets
        gxy = t[:, 2:4]  # grid xy
        gxi = gain[[2, 3]] - gxy  # inverse
        j, k = ((gxy % 1 < g) & (gxy > 1)).T
        l, m = ((gxi % 1 < g) & (gxi > 1)).T
        j = torch.stack((torch.ones_like(j), j, k, l, m))
        t = t.repeat((5, 1, 1))[j]
        offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
    else:
        t = targets[0]
        offsets = 0

For each prediction layer output (let’s assume we are using the P3 output, i = 0) we get the anchors for that layer, determine the output shape and scale x, y, w, h with respect to the grid size of that layer.

anchors, shape = self.anchors[i], p[i].shape
gain[2:6] = torch.tensor(shape)[[3, 2, 3, 2]]  # xyxy gain

# Match targets to anchors
t = targets * gain  # shape(3,n,7)

i
>>> 0

anchors
>>> tensor([[1.2500, 1.6250],
            [2.0000, 3.7500],
            [4.1250, 2.8750]])

shape
>>> torch.Size([2, 3, 40, 40, 25])

gain
>>> tensor([1., 1., 40., 40., 40., 40., 1.])

t
>>> tensor([[
       [ 0.0000, 14.0000, 19.8140, 20.2112,  6.1068, 22.7824,  0.0000],
       [ 0.0000,  0.0000, 21.9488, 36.9964,  2.1444,  1.2732,  0.0000],
       [ 0.0000,  0.0000, 14.7120, 39.4864,  2.4124,  1.0268,  0.0000],
       [ 1.0000,  6.0000, 38.8288,  1.7592,  2.3424,  3.5184,  0.0000],
       [ 1.0000, 16.0000, 28.2784,  4.1392, 13.1884,  6.7172,  0.0000]],

      [[ 0.0000, 14.0000, 19.8140, 20.2112,  6.1068, 22.7824,  1.0000],
       [ 0.0000,  0.0000, 21.9488, 36.9964,  2.1444,  1.2732,  1.0000],
       [ 0.0000,  0.0000, 14.7120, 39.4864,  2.4124,  1.0268,  1.0000],
       [ 1.0000,  6.0000, 38.8288,  1.7592,  2.3424,  3.5184,  1.0000],
       [ 1.0000, 16.0000, 28.2784,  4.1392, 13.1884,  6.7172,  1.0000]],

      [[ 0.0000, 14.0000, 19.8140, 20.2112,  6.1068, 22.7824,  2.0000],
       [ 0.0000,  0.0000, 21.9488, 36.9964,  2.1444,  1.2732,  2.0000],
       [ 0.0000,  0.0000, 14.7120, 39.4864,  2.4124,  1.0268,  2.0000],
       [ 1.0000,  6.0000, 38.8288,  1.7592,  2.3424,  3.5184,  2.0000],
       [ 1.0000, 16.0000, 28.2784,  4.1392, 13.1884,  6.7172,  2.0000]]])

We then check if the anchors meet the requirement rmax < anchor_t, which we reviewed previously. Here, r (Size([3,5,2])) contains the rw and rh target-anchor ratios. Using torch.max(r, 1 / r).max(2)[0], we obtain rmax, while j (Size([3,5])) represents a boolean mask indicating whether each target-anchor pair meets the requirement. Finally, as the last step, t is filtered to only contain those that meet this requirement, resulting in a change in its size to [num_pairs_selected, 7]:

# Matches
r = t[..., 4:6] / anchors[:, None]  # wh ratio
j = torch.max(r, 1 / r).max(2)[0] < self.hyp["anchor_t"]  # compare
t = t[j]  # filter

j
>>> tensor([[False,  True,  True,  True, False],
            [False,  True,  True,  True, False],
            [False,  True,  True,  True,  True]])

t[j]
>>> tensor([
      [ 0.0000,  0.0000, 21.9488, 36.9964,  2.1444,  1.2732,  0.0000],
      [ 0.0000,  0.0000, 14.7120, 39.4864,  2.4124,  1.0268,  0.0000],
      [ 1.0000,  6.0000, 38.8288,  1.7592,  2.3424,  3.5184,  0.0000],
      [ 0.0000,  0.0000, 21.9488, 36.9964,  2.1444,  1.2732,  1.0000],
      [ 0.0000,  0.0000, 14.7120, 39.4864,  2.4124,  1.0268,  1.0000],
      [ 1.0000,  6.0000, 38.8288,  1.7592,  2.3424,  3.5184,  1.0000],
      [ 0.0000,  0.0000, 21.9488, 36.9964,  2.1444,  1.2732,  2.0000],
      [ 0.0000,  0.0000, 14.7120, 39.4864,  2.4124,  1.0268,  2.0000],
      [ 1.0000,  6.0000, 38.8288,  1.7592,  2.3424,  3.5184,  2.0000],
      [ 1.0000, 16.0000, 28.2784,  4.1392, 13.1884,  6.7172,  2.0000]])

t[j].shape
>>> torch.Size([10, 7])

Now that we have the target-anchor pairs that passed the filter, let’s assign them to the cell that contains their center point and also to the adjacent cells, as reviewed earlier, depending on the location of the center point within the cell.

# Offsets
gxy = t[:, 2:4]  # grid xy
gxi = gain[[2, 3]] - gxy  # inverse
j, k = ((gxy % 1 < g) & (gxy > 1)).T
l, m = ((gxi % 1 < g) & (gxi > 1)).T

To explain this part, which can be a little confusing at first, let’s clarify two things. First, the x % 1 operator is used to obtain the decimal part of a number x (e.g., 3.72 % 1 = 0.72). Second, the variables j, k, l, m correspond to each direction in the grid cell:

The variables j and l respectively stores whether the center point is on the left/right side of the middle vertical line of the cell and if there’s a cell to the left/right of the current one or if we’re at the grid’s edge. If both conditions are true, the left/right adjacent cell will also be selected.
The variables k and m respectively stores whether the center point is above/below the middle horizontal line of the cell and if there’s a cell above/below the current one or if we’re at the grid’s edge. If both conditions are true, the upper/bottom adjacent cell will also be selected.

To compute the conditions for l and m, the strategy involves reversing the coordinates of the origin of the grid cell (gxi = gain[[2, 3]] - gxy) and applying the same conditions as in j and k, respectively. This process is illustrated in the figure below:

Example computation of conditions (j, k, l, m). Source: Image by the author

Once all conditions are computed, a large boolean mask is created to select all main cells (where the center point lies) and their respective adjacent cells selected (stored in j, k, l, m).

j = torch.stack((torch.ones_like(j), j, k, l, m))
t = t.repeat((5, 1, 1))[j]

j
>>> tensor([
  # Select main cell, where the object center point lies
  [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
  # Select the adjacent cell to the left of the main cell
  [False, False, False, False, False, False, False, False, False,  True],
  # Select the adjacent cell above the main cell
  [False,  True, False, False,  True, False, False,  True, False,  True],
  # Select the adjacent cell to the right of the main cell
  [ True,  True,  True,  True,  True,  True,  True,  True,  True, False],
  # Select the adjacent cell below the main cell
  [ True, False,  True,  True, False,  True,  True, False,  True, False]])

t.shape
>>> torch.Size([30, 7])

So, the number of built-targets will range from the minimum of the number of first filtered target-anchor pairs (10) to three times that (30), due to the possibility of selecting up to two more cells per main cell.

However, the built-targets stored at this point in t are not accurate because the (x, y) coordinates still refer to the main cell. Therefore, the offsets are computed using the previously defined direction offsets (off) and are stored to be applied in the next and final step.

offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]

As the last step of this function, we prepare the final built-targets for loss computation:

  # Define
  bc, gxy, gwh, a = t.chunk(4, 1)  # (image, class), grid xy, grid wh, anchors
  a, (b, c) = a.long().view(-1), bc.long().T  # anchors, image, class
  gij = (gxy - offsets).long()
  gi, gj = gij.T  # grid indices

  # Append
  indices.append((b, a, gj.clamp_(0, shape[2] - 1), gi.clamp_(0, shape[3] - 1)))  # image, anchor, grid
  tbox.append(torch.cat((gxy - gij, gwh), 1))  # box
  anch.append(anchors[a])  # anchors
  tcls.append(c)  # class

return tcls, tbox, indices, anch

In this step, the corresponding grid cell indices for each built-target are computed using the previously calculated offsets (gij = (gxy - offsets).long()). This operation extracts the integer part (cell indices) of the modified (x, y) coordinates:

gxy[12:17]
>>> tensor([[14.7120, 39.4864],
            [14.7120, 39.4864],
            [28.2784,  4.1392],
            [21.9488, 36.9964],
            [14.7120, 39.4864]])

offsets[12:17]
>>> tensor([[ 0.0000,  0.5000],
            [ 0.0000,  0.5000],
            [ 0.0000,  0.5000],
            [-0.5000,  0.0000],
            [-0.5000,  0.0000]])

gxy[12:17] - offsets[12:17]
>>> tensor([[14.7120, 38.9864],
            [14.7120, 38.9864],
            [28.2784,  3.6392],
            [22.4488, 36.9964],
            [15.2120, 39.4864]])

gij[12:17]
>>> tensor([[14, 38],
            [14, 38],
            [28,  3],
            [22, 36],
            [15, 39]])

Finally, the function returns 4 outputs:

indices (list[tuple[Tensor]]): A list of 3 tuples, one for each layer, containing tensors representing indices for sample indices, cell anchor indices, and grid cell indices. They are used to extract the corresponding model predictions, in which a ground truth object has been assigned.
tbox (list[Tensor]): A list of 3 tensors, one for each layer, containing target bounding boxes (x, y, w, h). The (x, y) values are normalized in the range of -0.5 to 1.5, while (w, h) values are adjusted with respect to the layer grid size, ranging from 0 to the number of grid cells along the corresponding axis.
anch (list[Tensor]): A list of 3 tensors, one for each layer, containing the anchor (w, h) values of each selected cell anchor.
tcls (list[Tensor]): A list of 3 tensors, one for each layer, containing the target class labels.

__call__

Once we’ve built the targets, the worst part is over.

def __call__(self, p, targets):  # predictions, targets
    """Performs forward pass, calculating class, box, and object loss for given predictions and targets."""
    lcls = torch.zeros(1, device=self.device)  # class loss
    lbox = torch.zeros(1, device=self.device)  # box loss
    lobj = torch.zeros(1, device=self.device)  # object loss
    tcls, tbox, indices, anchors = self.build_targets(p, targets)  # targets
  
    # Losses
    for i, pi in enumerate(p):  # layer index, layer predictions
        b, a, gj, gi = indices[i]  # image, anchor, gridy, gridx
        tobj = torch.zeros(pi.shape[:4], dtype=pi.dtype, device=self.device)  # target obj
  
        n = b.shape[0]  # number of targets
        if n:
            # pxy, pwh, _, pcls = pi[b, a, gj, gi].tensor_split((2, 4, 5), dim=1)  # faster, requires torch 1.8.0
            pxy, pwh, _, pcls = pi[b, a, gj, gi].split((2, 2, 1, self.nc), 1)  # target-subset of predictions

For each prediction layer, we extract the predictions that are responsible for detecting an object. These specific predictions, selected from the entire prediction tensor (pi) using indices calculated in build_targets, are used to compute the box loss, objectness loss, and class loss. The remaining predictions, which are not assigned to a ground truth, will only contribute to the computation of the objectness loss.

Bounding Box Regression Loss

# Regression
pxy = pxy.sigmoid() * 2 - 0.5
pwh = (pwh.sigmoid() * 2) ** 2 * anchors[i]
pbox = torch.cat((pxy, pwh), 1)  # predicted box
iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze()  # iou(prediction, target)
lbox += (1.0 - iou).mean()  # iou loss

This part is straightforward: we apply the formulas to the bounding box predictions, calculate the CIoU (Complete Intersection over Union), and compute the loss as (1 - CIoU). The final box loss is averaged over the number of built-targets in that layer.

Define Objectness Targets

# Objectness
iou = iou.detach().clamp(0).type(tobj.dtype)
if self.sort_obj_iou:
    j = iou.argsort()
    b, a, gj, gi, iou = b[j], a[j], gj[j], gi[j], iou[j]
if self.gr < 1:
    iou = (1.0 - self.gr) + self.gr * iou
tobj[b, a, gj, gi] = iou  # iou ratio

In this part, since self.sort_obj_iou is false and self.gr is always 1.0, we can simplify it to:

# Objectness
iou = iou.detach().clamp(0).type(tobj.dtype)
tobj[b, a, gj, gi] = iou  # iou ratio

This part prepares the target objectness score for computing the objectness loss in a later step. We set the target objectness score (tobj) for the predictions that should predict an object to be equal to the CIoU calculated in the previous step.

This could alternatively be set to 1.0, indicating that the model should predict there is an object there. However, by setting it to the CIoU loss, the model predicts how well it thinks the bounding box prediction encloses the target object (tobj[b, a, gj, gi] = iou), instead of simply predicting the presence of an object regardless of the bounding box quality (tobj[b, a, gj, gi] = 1.0). This approach, as mentioned by Glenn Jocher in a GitHub Issue, helps sort out low-accuracy detections during Non-Maximum Suppression (NMS).

Classification Loss

# Classification
if self.nc > 1:  # cls loss (only if multiple classes)
    t = torch.full_like(pcls, self.cn, device=self.device)  # targets
    t[range(n), tcls[i]] = self.cp
    lcls += self.BCEcls(pcls, t)  # BCE

This part is straightforward as well. We apply the binary cross-entropy (BCE) loss to the class predictions. The variable t contains the target binary classes for each object, where 1.0 indicates the object belongs to that class and 0 indicates it does not. Remember, YOLOv5 is designed to predict multi-label objects, meaning an object can belong to multiple classes simultaneously (e.g., a dog and a husky). Similar to the bounding box loss, we average the class loss by summing all contributions and dividing by the number of built-targets and the number of classes. This is achieved using the default ‘mean’ reduction parameter of the BCELoss function.

Objectness Loss

obji = self.BCEobj(pi[..., 4], tobj)
lobj += obji * self.balance[i]  # obj loss

The last part is the objectness loss, which involves calculating the binary cross-entropy (BCE) loss between the predicted objectness values and the previously computed target objectness values (0 if no object should be detected and CIoU otherwise). Here, we also average the loss by leaving unchanged the BCE reduction parameter to ‘mean’. Since we use all the predictions from that layer, we sum them and then divide by (batch_size * num_anchors * num_cells_x * num_cells_y). We also apply the corresponding layer objectness loss weight defined in the self.balance variable.

Final Output

Finally, after repeating these steps for each layer and aggregating the loss results, we apply the corresponding weights to each loss component and return the results.

lbox *= self.hyp["box"]
lobj *= self.hyp["obj"]
lcls *= self.hyp["cls"]
bs = tobj.shape[0]  # batch size

return (lbox + lobj + lcls) * bs, torch.cat((lbox, lobj, lcls)).detach()

This function returns two outputs: the first one is the final aggregated loss, which is scaled by the batch size (bs), and the second one is a tensor with each loss component separated and detached from the PyTorch graph. In the train.py file (line 383), you can see that the former output will be used to backpropagate the gradients, while the latter one is solely for visualization in the progress bar during training and for computing the running mean losses. Therefore, it’s important to bear in mind that the actual loss being used is not the same as what you are visualizing, as the first one is scaled and dependent on the size of each input batch. This distinction can be important when training with dynamic input batch sizes.

3. Mathematical formulation

After this intensive analysis covering every aspect of the current YOLOv5 loss implementation, a good way to conclude would be to express it in a mathematical formulation. I believe having a mathematical formulation for this loss function, as implemented in the official source code we have just examined, can be valuable.

The following formula represents the loss function for an input batch of B samples/images:

The “gt” superscript indicates ground truth.
L is the number of prediction layers, X and Y are the number of cells along each axis, A is the number of anchors, C is the number of classes and B is the batch size.
Each model prediction (the last dimension of the output tensor in each layer) is in the form (x, y, w, h, obj, *classes). In the formulas, “b” represents the bounding box (x, y, w, h), “O” represents the confidence or objectness score (obj), and “y” represents the array of class predictions (*classes).

YOLOv5 loss equations. Source: Image by the author

Binary Cross-Entropy, IoU and CIoU equations — Ancillary equations 1. Source: Image by the author

Ancillary equations 2, CIoU expanded equations. Source: Image by the author

And this concludes our journey! I sincerely hope this article has been useful to you (or at least interesting) and relatively easy to follow, despite delving into some rather intricate subjects. If you have any questions or feedback, please don’t hesitate to share them in the comments below 😃.