A guide to object detection with Faster-RCNN and PyTorch

After working with CNNs for the purpose of 2D/3D image segmentation and writing a beginner’s guide about it, I decided to try another important field in Computer Vision (CV) — object detection. There are several popular architectures like RetinaNet, YOLO, SDD and even powerful libraries like detectron2 that make object detection incredibly easy. In this tutorial, however, I want to share with you my approach on how to create a custom dataset and use it to train an object detector with PyTorch and the Faster-RCNN architecture. I will show you how images that were downloaded from the internet can be used to generate annotations (bounding boxes) with the help of the multi-dimensional image viewer napari. The provided code is specifically written for Faster-RCNN models, but parts might work with other model architectures (e.g. YOLO) because general principles apply to all common object detection models that are based on anchor/default boxes. Due to transfer learning, you will see that training an object detector sometimes requires very few images! You can find the code and a jupyter notebook on my github repo.

Image by author

For this tutorial, I am going to train a human head detector. I can imagine that this is a common task for phone camera applications: detecting human faces or heads within an image. If you want to train your own object detector, e.g. for racoon detection, car detection or whatever comes into your mind, you’re at the right place. So please go ahead. It might be useful for you.

For training and experiment management, I will use PyTorch Lightning and neptune. If you’re not familiar with these packages, do not worry, you’ll be able to implement your own training logic and choose your own experiment tracker. Here’s the table of content:

  1. Getting images
  2. Annotating
  3. Dataset building
  4. Faster R-CNN in PyTorch
  5. Training
  6. Inference

Getting images

In order to train an object detector with a deep neural network like Faster-RCNN we require a dataset. For this, I downloaded 20 images (selfies) from the internet. You can do this manually or use web scraping techniques. All images are .jpg or .png rgb or rgba files.

Here is the full dataset:

Training, validation & test data. Image by author

Let’s assume you have downloaded your images into /heads/input. Before adding bounding boxes to our input images, we should first rename our files so that they all follow the same pattern. An example on how to rename them is shown below. This simply renames all files within a directory to something like 000.jpg, 001.png etc. You can find the function get_filenames_of_path() in the utils.py script.


There are plenty of web tools that can be used to create bounding boxes for a custom dataset. These tools usually store the information in a or several specific files, e.g. .json or .xml files. But you could also save your annotations as python dicts if you don’t want to learn another file format. Pytorch’s Faster-RCNN implementation requires the annotations (the target in network training) to be a dict with a boxes and a labels key anyway. The boxes and labels should be torch.tensors where boxes are supposed to be in xyx2y2 format (or xyxy format as stated in their docs) and labels are integer encoded, starting at 1 (as the background is assigned 0). The easiest form to save a dict as a file is using the pickle module. Fortunately, the torch package integrates some functionality of pickle, e.g. it allows us to save a file like a dict with torch.save() and load it with torch.load(). This means that we can store the annotations that we create in a pickled file. If this annotation file happens to have the same name as the image, mapping the image to its annotation file becomes really easy and creating a dataset for neural network training as well. If you already have a labeled dataset at hand, you can skip this section.

As I recently discovered napari, a multi-dimensional image viewer for python, I decided to use it to generate the labels/annotations for my dataset. Please do not expect a full fledged, perfectly working code for creating bounding boxes. This is just me making myself familiar with napari and using for my needs. If you prefer an out-of-the-box solution, I recommend taking a look at myvision.ai.

Let’s take a look at how to generate annotation files for our heads dataset with napari. For this, I heavily made use of napari’s shapes layer and created a specific Annotator class that makes annotating much easier. You can run the code within a jupyter notebook or an IPython kernel. No need to run the magic command %gui qt, as this is automatically called before starting the qt application.

This will open the napari qt-application that shows one image at a time. You can navigate through your list of images by pressing ’n’ to get the next or ‘b’ to get the previous image (custom key-bindings).

Image by author

Now, if you would like to add a label with bounding boxes for the current shown image, just enter the following into your IPython console or jupyter notebook session.

annotator.add_class(label='head', color='red')

You just need to specify the label you want and the color. Now you can start using napari’s functionality to draw bounding boxes.

Note: Don’t worry if you accidentally click ’n’ or ‘b’ on your keyboard. The created bounding boxes are saved automatically. It also doesn’t matter if you delete the image layer, as the image is read from disk every time you display the image (e.g. by clicking ’n’ or ‘b’). However, if you delete the shape layer for a label, this information is lost for this image.

Image by author

We can create as many classes as we want. For each new class, a new shape layer is created, which means we can hide specific labels if the image is cluttered with bounding boxes. We basically can do whatever we want with the bounding boxes, e.g. changing it’s color or width etc.

annotator.add_class(label='eye', color='blue')
Image by author

If you want to export the annotations for this image, you can write the following:


You could also specify a name for the annotation file. When no name is given, the image’s name is taken and the .pt extension appended. I recommend this approach as this makes resuming a labeling session with the Annotator possible. Here’s an example, where annotation_ids is a list of pathlib.Path objects, similar to image_files.

annotator = Annotator(image_ids=image_files,

If the annotation files are in the right format and have the same name as the image itself, these annotations will be used. You can for example start labeling a bigger dataset, export some of the annotations that you managed to create in a certain time and resume labeling at a later time.

Let’s continue to create bounding boxes for every image we have. For this tutorial, we’ll stick to our heads bounding boxes and delete the eye layer that I showed above. Once you’re satisfied with the result, you can export all annotations in one go with:


For this project we now have two directories, something like /heads/input and /heads/target. In /heads/input, we find all images that we downloaded and in the /heads/target directory the corresponding annotations with bounding boxes and labels that we just generated. 20 images and 20 annotation files (pickled python dicts) in total. Let’s quickly take a look at the annotation files.

This gives us the following:

dict_keys(['labels', 'boxes'])
array(['head', 'head', 'head', 'head', 'head', 'head'], dtype='<U4')
[array([ 14.32894795, 217.18092301, 277.02631195, 531.98354928]),
array([199.95394483, 81.49013583, 396.43420467, 287.74013235]),
array([386.66446799, 2.24671611, 588.57235932, 247.57565934]),
array([306.33552198, 251.91776453, 510.41446591, 521.12828631]),
array([525.61183407, 266.0296064 , 741.63156727, 554.77960153]),
array([723.17762021, 116.22697735, 925.08551155, 432.11512991])]

Looks about right. The labels and boxes are both stored in numpy.ndarrays in xyxy format. Now we can build a proper dataset for network training.

Dataset building

Here’s how to build your own dataset that you can use to feed the network with batches of data. The approach is similar to my previous tutorial: 2D/3D semantic segmentation with the UNet.

Let’s take a look at the dataset class ObjectDetectionDataSet:

Builds a dataset with images and their respective targets. A target is expected to be a pickled file of a dict and should contain at least a ‘boxes’ and a ‘labels’ key. inputs and targets are expected to be a list of pathlib.Path objects. In case your labels are strings, you can use mapping (a dict) to int-encode them. Returns a dict with the following keys: ‘x’, ‘x_name’, ‘y’, ‘y_name’

To better understand the arguments, here’s some more information:

  • transform: transformations to be applied to the data.
  • use_cache: Instead of reading the data from disk every time we access them, we can iterate over the dataset once in the initialization method and store the data in memory (using multiprocessing). This is quite useful for network training, where we train in epochs.
  • mapping: As our labels are strings, e.g. ‘head’, we should integer encode them accordingly.
  • convert_to_format: If your bounding boxes happen to be in a different format, e.g. xywh, you can convert them into xyxy format with convert_to_format ‘xyxy’.

Let’s use this class to build the dataset for our head detector.

As you can see in this example I use the class ComposeDouble. This allows to stack different transformations. Clip() is used to identify the bounding boxes that are bigger than the actual image and clips them accordingly. To augment the dataset one can use the albumentation module, for which I wrote the AlbumentationWrapper class. In order to use any numpy based function on the data, one can use the FunctionWrapper class. This wrapper takes in a function and an arbitrary number of arguments to return a functools.partial. I use Double to highlight that the data comes in input-target pairs (image + annotation) as opposed to Single. Whether the input or target should be transformed, can be specified with the boolean arguments input and target. By default, only the input is used. For more information I encourage you to take a look at transformations.py. In this example, we linearly scale our image and bring it in the right dimensional order: [C, HW].

We can now take a look at a sample from the dataset with:

sample = dataset[1]

We can see that the sample is a dict with the keys: ‘x’, ‘x_name’, ‘y’, ‘y_name’.

-> torch.Size([3, 710, 1024])

These transformations are, however, not the only ones. The Faster R-CNN implementation by PyTorch adds some more, which I will talk about in the next section. But first, let us again visualize our dataset. This time, we can pass the dataset as an argument with the DatasetViewer class instead of passing a list of image paths.

This will open a napari application, that we can navigate with the keyboard buttons ’n’ and ‘b’ again. There is, however, only one shape layer that contains the bounding boxes of every label. We can assign a color to different labels by passing a dict to our DatasetViewer or changing the color within the napari viewer instance. The label is shown on the top left corner of every bounding box. You probably can barely see it, as the text’s color is white by default. But you can change the size and color, either accessing the napari viewer instance directly with datasetviewer.viewer or by opening a small GUI application with


This functions takes in the shapes layer for which we would like to change the text properties. The GUI is shown on the bottom left of the viewer and was created with magicgui.

Image by author

The dataset is now ready for network training. In the next chapter we’ll talk about the Faster R-CNN implementation.

Faster R-CNN in PyTorch

In this tutorial I made use of PyTorch’s Faster R-CNN implementation. Taking a look at the provided functions in torchvision, we see that we can easily build a Faster R-CNN model with a pretrained backbone. I decided to go for a ResNet backbone (either with or without FPN). You can take a look at the functions I created in faster_RCNN.py in my github repo. To quickly assemble your model with a ResNet backbone, you can use the get_fasterRCNN_resnet() function and specify the details such as the backbone, anchor size, aspect ratios, min_size, max_size etc. I will talk about these parameters in detail in the following section.

torchvision.models.detection.faster_rcnn.FasterRCNN has some requirements for its input. Here’s an important section from its docstring:

The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each
image, and should be in 0-1 range. Different images can have different sizes.

As you have seen, our dataset outputs the data in a different format — a dict. For this reason, we need to write our own collate_fn when instantiating a dataloader. The dataset’s output should be transformed into a list of tensors and a list of dicts for the target. Here’s how this could be done:

To get a batch from your dataset, you just need to call

batch = next(iter(dataloader))

Remember that I said that there are some additional transformations happening in PyTorch’s Faster R-CNN implementation? As a matter of fact, it uses a transformer that can be found here: torchvision.models.detection.transform.GeneralizedRCNNTransform

The important arguments for this transformer are the following:

min_size (int): minimum size of the image to be rescaled before feeding it to the backbone
max_size (int): maximum size of the image to be rescaled before feeding it to the backbone
image_mean (Tuple[float, float, float]): mean values used for input normalization.
They are generally the mean values of the dataset on which the backbone has been trained
image_std (Tuple[float, float, float]): std values used for input normalization.
They are generally the std values of the dataset on which the backbone has been trained on

To investigate the behavior of this transformer, we should take a look at how our data will look like after it is transformed with GeneralizedRCNNTransform:

As our backbone was trained on ImageNet, we will use the same normalization values for mean and std. For training, I would like to have my 20 images to be comparable in size, so I choose 1024 for both, min_size and max_size. Again, I will use the DatasetViewer to visualize the dataset:

Image by author

Notice how there appears to be a padded border at the bottom. This is automatically added by the transformer to provide adequate image sizes to the model without distorting them too much. To better see the impact of the transformation, we can gather some statistics from our dataset with and without the transformer:

stats and stats_transform are dicts with the following keys:

dict_keys(['image_height', 'image_width', 'image_mean', 'image_std', 'boxes_height', 'boxes_width', 'boxes_num', 'boxes_area'])

Here’s an example:

stats['image_height'].max() -> tensor(1200.)
stats_transform['image_height'].max()-> tensor(1024.)
stats['image_height'].min() -> tensor(333.)
stats_transform['image_height'].min() -> tensor(576.)

Alright, that’s basically all you need to know about the implementation. The next section covers the probably most important hyperparameter for training an object detector — anchor boxes.

Anchor boxes

If you have a hard time understanding anchor boxes, you should probably read more about them first. I think this is mandatory to really understand current object detection approaches. To better understand the relationship between anchor boxes, the input image and the feature map(s) that are returned by the feature extractor (backbone), I think it is best to visualize it. This also helps in choosing good anchor sizes and aspect ratios for your problem. For this reason I created the class AnchorViewer. In this example, I will use a simple ResNet backbone (e.g. ResNet18) without FPN that outputs a (512, 32, 32) feature map when given an image of size (3, 1024, 1024).

The AnchorViewer returns a napari application with three layers: the image, the shape and the points layer. The image layer displays the image that is taken from the dataset, the shape layer shows the first anchor boxes for a given feature map location and the points layer displays all available anchor positions mapped onto the image. You probably can imagine how cluttered the image would be if one would display anchor boxes of every position. Visualizing the anchor boxes and their positions within the image makes it easier to find adequate anchor boxes. In Faster-RCNN, these boxes are compared to the ground truth boxes. Boxes that have an IoU greater than a certain threshold are considered positive cases. In layman’s terms, that’s how the target for a given image is generated for network training.

Image by author — anchor positions and anchor boxes with aspect ratios (1.0). Input size: (3, 1024, 1024). Feature map size: (512, 32, 32)

The image here is the first of the heads dataset, which is a (3, 1024, 1024) image. You can identify the feature_map_size for example by sending a dummy torch.tensor through the backbone model:

About the anchor_size and aspect_ratios parameters

These are expected to be tuples of tuples of integers. In essence, the tuple (128, 256, 512) contains the different sizes for the first feature map of the backbone’s output. In our example, the backbone only returns one feature map, which is why we should write it like this: ((128, 256, 512), ). The same applies to the aspect ratios.


For training, we could use our own training loop logic. However, I think it is best to make use of higher level APIs such as Lightning, Fast.ai or Skorch. But why? This is well explained in this article. You can probably imagine that if you want to integrate functionalities and features, such as logging, metrics, early stopping, mixed precision training and many more to your training loop, you’ll end up doing exactly what others have done already. However, chances are that your code won’t be as good and stable as theirs (hello spagetthi code) and you’ll spent too much time on integrating and debugging these things rather than focusing on your deep learning project (hello me). And although learning a new API can take some time, it might help you a lot in the long run.

Here, I will use Lightning, because it gives you a lot control for training without abstracting away too much. It’s a good fit for researchers. But there’s definitely room for improvement in my opinion. Although Lightning encourages you to integrate your model and the dataset into your lightning module, I will disregard this advice and write a LightnigModule like this:

Some things you might already have noticed but I want to highlight anyway, because that’s what I like about Lightning:

  • There’s no need to call model.train() or model.eval() in training, validation or test phase
  • There’s no need to say when to use your model with or without the computation of gradients.
  • There’s no need to zero.grad() the gradients or perform an update step on the optimizer.
  • There’s no need to call loss.backpropagation()
  • You don’t need to worry about moving your tensors from cpu to gpu or vice versa.

However, you could do all this manually and overwrite the existing behavior as stated on their website.

In my implementation, the __init__ method only requires a few arguments:

  • The Faster-RCNN model
  • A learning rate
  • IoU threshold

While the first two are self explanatory, the IoU deserves some attention. This argument is an important value for the evaluation of the model, for which I use the code of this github repo. This computes the metrics used by the pascal voc challenge. This aspect of object detection took me probably the longest to get a good grasp on, so I’d recommend reading a bit about object detection metrics. The threshold essentially determines when to count a prediction as a true postivite (TP), based on the IoU.

The loss function

Luckily, we do not need to worry about the loss function that was proposed in the Faster-RCNN paper. It is part of the Faster-RCNN module and the loss is automatically returned when the model is in train() mode. In eval() mode, the predictions, their labels and their scores are returned as dicts. Therefore, it is sufficient to write the loss calculation in the training loop like this:

loss_dict = model(x, y)
loss = sum(loss for loss in loss_dict.values())

The optimizer and learning rate scheduling

For training, I’d recommend following the guidelines in the literature and to stick to classic SGD. The learning rate and the parameters for the learning rate scheduler were arbitrarily chosen. It’s possible to move these parameters to the init method and then view them as important hyperparameters, but they’ll work for this example and might as well for others.

Logging and Neptune.ai

In the Lightning module, you may have noticed that I use logging commands to keep track of my running losses and metrics. The logging software I will use is neptune. But your are not bound to use neptune, you could instead use a csv logger, tensorboard, MLflow or others without changing the code. neptune is just personal preference and only the second logger I’ve used so far. Right after my rather disappointing experience with tensorboard.

Training script

Now that we have put everything together and spent some time on building the dataset, talked about the model implementation, the training logic and evaluation, we’re ready to write our training script. Here it is:

For our 20 images, I made a manual split like this:

  • training dataset: 12 images
  • validation dataset: 4 images
  • test dataset: 4 images

Apart from that, there is not much interesting code here. I create the different datasets and assign them different stacks of transformations. I use a dict with parameters that I want to have logged by neptune, which you’ll see in the next section.

This is how I personally initialize my neptune logger, there are other and probably better ways to do so. The project might need to be created beforehand.

Next, we can initialize our Faster-RCNN model with:

and init our lightning module with:

Finally, we can create our trainer with callbacks:

Now we can start training:

You can watch the training progress on neptune. Once the training is finished, we can use the best model, based on the validation datset and according to the metric we used (mAP from pascal VOC) and predict the bounding boxes of our test dataset:

This is how training looks like in neptune:

Image by author

Our neptune logger will create plots based on the values that are given to the logger in the lightning module:

Image by author

We can also monitor memory usage during training:

Image by author

It’s also possible to log some additional information of the experiment. For example, all packages and versions of the conda environment that was used.

This will log html tables that can be accessed in the artifacts section:

Image by author

With lightning, checkpoints (e.g. based on a metric) are automatically saved to a directory that can be specified. After training is finished, I like to upload the model’s weights to neptune and link it to the experiment. Instead of uploading the checkpoint, I prefer to upload the model itself, for which I use:

With lightning still undergoing many changes with every release, I like the model to be separate from the lightning model. This is just personal choice.


We can now use our trained model to make some predictions on similar but unseen data. Let’s download some more images (e.g. selfies) from google, which I will store in /heads/test:

Image by author

To load our model from neptune or from disk, we can write a simple script:

Again, I use a dictionary at the beginning of my script to allow some customization. As my dataset does not have a target, I use ComposeSingle, FunctionWrapperSingle, as well as ObjectDetectionDatasetSingle with collate_single. To visualize this dataset, I will use the DatasetViewerSingle class. Instead of downloading the model from neptune, I’ll just load in the checkpoint and extract the model. I use the parameters saved in the experiment to initialze the correct model and load the weights.

For inference, I simply loop through the dataset and predict the bounding boxes of each image. The resulting dictionary with boxes, labels and scores for every image is saved in the specified path:

In order to visualize the results, I can create a dataset the same way I created the training dataset:

And here’s the result:

The results already look pretty good! However, there might be redundant, overlapping bounding boxes that we need to get rid of. The first thing that comes into mind, is to set a threshold for the score. Every bounding box with a score that falls below a certain threshold is removed. We can play around with a score threshold by creating a small GUI within the napari viewer:

d = datasetviewer_prediction
Image by author

A score of 0.727 allows us to only see the high scoring bounding boxes. This gives us pretty good results and a working object detector. A slightly better solution, however, is using non-maxium supression. A good description of NMS can be found in this article:

NMS greedily selects the bounding box with the highest score and suppresses ones that have a high overlap with it. The overlap is measured by comparing Intersection-over-Union (IoU) threshold to a predefined threshold, usually ranging from 0.3 to 0.5.

We can experiment with different IoU thresholds for NMS by creating another small GUI within the napari viewer (This will destroy the score slider though):

d = datasetviewer_prediction
Image by author

An IoU threshold of 0.2 seems to be a good fit for these test images.

Et voilà, we have trained an object detector, that works remarkably well for detecting human heads!


This tutorial showed you that training an object detector can be as simple as annotating 20 images and running a Jupyter notebook. I hope this made it easier for you to start your own deep learning object detection project. If you have any questions, feel free to contact me on Github or LinkedIn.

Train your own object detector with Faster-RCNN & PyTorch was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.