Deep Learning opened new horizons in the field of Computer Science and left a lasting mark on Artificial Intelligence and Computer Vision. However, it was not until the inception of deep learning toolsets like PyTorch and Tensorflow that deep learning became accessible to the everyday joe. In this article, we focus mainly on PyTorch and how we can define custom loss functions using its NN and AUTOGRAD modules. But first a word on PyTorch.

  1. Why PyTorch?
  2. Loss Functions
  3. Why we need Custom Loss Functions
  4. Implementing Binomial Deviance Loss
  5. Conclusion

Deep Learning PyTorch Custom Loss Function

Why PyTorch?

PyTorch has been, for the longest time, the preferred deep learning framework due to its ability to create robust dynamic computational graphs. What are the computational graphs you might ask? In layman terms, a computational graph is a sequence of operations a deep learning model will perform. Additionally, a computational graph is necessary for keeping track of the variables involved in the calculations that give us output from the model. These variables will be used for calculating gradients in the backpropagation step of training the model.


The great thing about PyTorch is that not only will it take care of the graph for us, but also let us create custom activation and loss functions that it will automatically add to the graph. Furthermore, PyTorch has automatic differentiation built-in as well, freeing us from the hassle of calculating gradients. This means that the deep learning engineers need not define everything from scratch; they can define the functions and let PyTorch do all of the background calculations.

Loss Functions

Understand that the goal of training a neural network is to optimize the weights of its neurons. This ensures that the input data is mapped to the correct output values or labels. Loss functions are responsible for evaluating the cost (the difference between the model’s output and the ground truth) and pointing the model in the right direction, so it corrects its weights for accurate output. Therefore loss functions are essential for training a deep learning model. Also pertinent is the fact that there is a vast number of loss functions; each performs well for the scenario it is designed to work in.

Following are some of the most common/standard loss functions and the scenarios they best fit

Mean Squared Error

MSE or mean squared error is used in problems where the output is not a well-defined label. MSE is used to predict real value quantities. E.g. for predicting the coordinates of a bounding box around a face in an image.

Binary Cross-Entropy

Also called Log-Loss, this loss function is used for classification. Specifically binary classification, where the output is one out of two possibilities. The most straightforward example that comes to mind is classifying whether an image is a cat or not.

Multi-Class Cross-Entropy

This loss function works similarly to binary cross-entropy. It is used for choosing between 1 out of many classes instead of just a binary choice. It comes in extremely handy for object classification etc.

Why we need Custom Loss Functions

Often when implementing a deep learning model from papers, you will encounter loss functions that are not part of the framework. Some of these loss functions are specialized down to the particular model they are being used with. Hence the frameworks can’t add all of them to their libraries, and it is up to the implementer to define them for the task.

Take, for instance, the Contrastive loss function. This loss function can either be used for training a model for pattern matching and recognition or face matching and identification. Once again, due to the variability of its use case, it can’t be reduced down to a fixed implementation.

These were just two examples amongst hundreds of other loss functions. In this article, we will be implementing the binomial deviance loss function as an example.

Implementing Binomial Deviance Loss

The binomial deviance function was used in the paper “Deep Metric Learning for Practical Person Re-Identification” and is defined as

Lossdev = ∑i,j W ◦ ln(e− α(S−β) ◦ M + 1)

Where W, M are inputs to the loss function. S is calculated in run time using X, which is the output of the model. Alpha and beta are hyperparameters that need to be set only once. Our focus is on the implementation; therefore we won’t go into detail where all those variables come from.

We begin by importing the necessary classes:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
  • torch is the base PyTorch library that we will be using.
  • nn gives us the ability to create our loss function as the computational graph, as discussed before.
  • nn.functional gives us access to some of the helper functions that we might use
  • autograd Variable lets us add our variables to the computational graph so they can be automatically used for calculating the gradients by the autograd module.

Now that we have our imports ready, we will create a class for our custom loss function, and we will call it BinomialDevianceLoss. Our class will inherit from NN. Module the standard properties of a computational graph.

class BinomialDevianceLoss(nn.Module):

nn.Module is an interface which means we must implement two functions:

  1. __init__()           which will serve as the constructor of our loss function
  2. forward()           which is where we will perform the calculations of the loss function

Optionally, if we wanted to take control of how the gradient is calculated, we could also define a backward() function, however, it is recommended that you let PyTorch’s autograd calculate the gradients automatically. That being said, let’s define the functions stated above.

def __init__( self, alpha=2, beta=0.5, **kwargs):
  self.alpha = alpha
  self.beta = beta
  self.sim = F.cosine_similarity()

The constructor is where we should define and set all the hyperparameters. We do that after calling the parent class’s constructor. We can also give our hyperparameters in the constructor’s parameters list a default value, just like we have given here. Other objects that need to be initialized before any computation should also be instantiated here, just like the self.sim object.

def forward( self, x, m, w):
  # computer similarity matrix
  s = Variable(self.sim(x, x), requires_grad=True).to(device)
  m = Variable(m, requires_grad=True).to(device)
  w = Variable(w, requires_grad=True).to(device)

  # calculate loss using the function defined in the paper
  loss = torch.mul(w, torch.log(1 + torch.exp(torch.mul(-self.alpha*(s-self.beta), m)))).sum()
  return loss

The forward function is executed because it is used for the forward pass of the model. Here we calculate the loss on the input data. One thing to keep in mind is, we need to convert all of the input tensors to autograd variables; hence we use Variable and set the requires_grad property to True. After calculating the loss, we can simply return all the required variables.

This is all there is to implementing a loss function. The backward pass, gradients, and weight updates will be handled automatically by the autograd module.


Loss functions are a breeze to implement with PyTorch’s autograd and NN modules. The library takes care of everything besides the forward pass for us. Even the gradients and back props are automated. The NN module offers excellent flexibility for creating custom computational graphs without having to know the nitty-gritty details of the library or the math involved.