.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "build/examples_detection/train_faster_rcnn_voc.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_build_examples_detection_train_faster_rcnn_voc.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_build_examples_detection_train_faster_rcnn_voc.py:

06. Train Faster-RCNN end-to-end on PASCAL VOC
================================================

This tutorial goes through the basic steps of training a Faster-RCNN [Ren15]_ object detection model
provided by GluonCV.

Specifically, we show how to build a state-of-the-art Faster-RCNN model by stacking GluonCV components.

It is highly recommended to read the original papers [Girshick14]_, [Girshick15]_, [Ren15]_
to learn more about the ideas behind Faster R-CNN.
Appendix from [He16]_ and experiment detail from [Lin17]_ may also be useful reference.

.. hint::

    You can skip the rest of this tutorial and start training your Faster-RCNN model
    right away by downloading this script:

    :download:`Download train_faster_rcnn.py<../../../scripts/detection/faster_rcnn/train_faster_rcnn.py>`

    Example usage:

    Train a default resnet50_v1b model with Pascal VOC on GPU 0:

    .. code-block:: bash

        python train_faster_rcnn.py --gpus 0

    Train a resnet50_v1b model on GPU 0,1,2,3:

    .. code-block:: bash

        python train_faster_rcnn.py --gpus 0,1,2,3 --network resnet50_v1b

    Check the supported arguments:

    .. code-block:: bash

        python train_faster_rcnn.py --help


.. hint::

    Since lots of contents in this tutorial is very similar to :doc:`./train_ssd_voc`, you can skip any part
    if you feel comfortable.


.. GENERATED FROM PYTHON SOURCE LINES 49-55

Dataset
-------

Please first go through this :ref:`sphx_glr_build_examples_datasets_pascal_voc.py` tutorial to setup Pascal
VOC dataset on your disk.
Then, we are ready to load training and validation images.

.. GENERATED FROM PYTHON SOURCE LINES 55-66

.. code-block:: default


    from gluoncv.data import VOCDetection

    # typically we use 2007+2012 trainval splits for training data
    train_dataset = VOCDetection(splits=[(2007, 'trainval'), (2012, 'trainval')])
    # and use 2007 test as validation data
    val_dataset = VOCDetection(splits=[(2007, 'test')])

    print('Training images:', len(train_dataset))
    print('Validation images:', len(val_dataset))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Training images: 16551
    Validation images: 4952


.. GENERATED FROM PYTHON SOURCE LINES 67-70

Data transform
--------------
We can read an image-label pair from the training dataset:

.. GENERATED FROM PYTHON SOURCE LINES 70-76

.. code-block:: default

    train_image, train_label = train_dataset[6]
    bboxes = train_label[:, :4]
    cids = train_label[:, 4:5]
    print('image:', train_image.shape)
    print('bboxes:', bboxes.shape, 'class ids:', cids.shape)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    image: (375, 500, 3)
    bboxes: (2, 4) class ids: (2, 1)


.. GENERATED FROM PYTHON SOURCE LINES 77-78

Plot the image, together with the bounding box labels:

.. GENERATED FROM PYTHON SOURCE LINES 78-84

.. code-block:: default

    from matplotlib import pyplot as plt
    from gluoncv.utils import viz

    ax = viz.plot_bbox(train_image.asnumpy(), bboxes, labels=cids, class_names=train_dataset.classes)
    plt.show()


.. image-sg:: /build/examples_detection/images/sphx_glr_train_faster_rcnn_voc_001.png
   :alt: train faster rcnn voc
   :srcset: /build/examples_detection/images/sphx_glr_train_faster_rcnn_voc_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 85-87

Validation images are quite similar to training because they were
basically split randomly to different sets

.. GENERATED FROM PYTHON SOURCE LINES 87-93

.. code-block:: default

    val_image, val_label = val_dataset[6]
    bboxes = val_label[:, :4]
    cids = val_label[:, 4:5]
    ax = viz.plot_bbox(val_image.asnumpy(), bboxes, labels=cids, class_names=train_dataset.classes)
    plt.show()


.. image-sg:: /build/examples_detection/images/sphx_glr_train_faster_rcnn_voc_002.png
   :alt: train faster rcnn voc
   :srcset: /build/examples_detection/images/sphx_glr_train_faster_rcnn_voc_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 94-95

For Faster-RCNN networks, the only data augmentation is horizontal flip.

.. GENERATED FROM PYTHON SOURCE LINES 95-99

.. code-block:: default

    from gluoncv.data.transforms import presets
    from gluoncv import utils
    from mxnet import nd


.. GENERATED FROM PYTHON SOURCE LINES 100-104

.. code-block:: default

    short, max_size = 600, 1000  # resize image to short side 600 px, but keep maximum length within 1000
    train_transform = presets.rcnn.FasterRCNNDefaultTrainTransform(short, max_size)
    val_transform = presets.rcnn.FasterRCNNDefaultValTransform(short, max_size)


.. GENERATED FROM PYTHON SOURCE LINES 105-107

.. code-block:: default

    utils.random.seed(233)  # fix seed in this tutorial


.. GENERATED FROM PYTHON SOURCE LINES 108-109

We apply transforms to train image

.. GENERATED FROM PYTHON SOURCE LINES 109-113

.. code-block:: default

    train_image2, train_label2 = train_transform(train_image, train_label)
    print('tensor shape:', train_image2.shape)
    print('box and id shape:', train_label2.shape)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    tensor shape: (3, 600, 800)
    box and id shape: (2, 6)


.. GENERATED FROM PYTHON SOURCE LINES 114-116

Images in tensor are distorted because they no longer sit in (0, 255) range.
Let's convert them back so we can see them clearly.

.. GENERATED FROM PYTHON SOURCE LINES 116-124

.. code-block:: default

    train_image2 = train_image2.transpose((1, 2, 0)) * nd.array((0.229, 0.224, 0.225)) + nd.array(
        (0.485, 0.456, 0.406))
    train_image2 = (train_image2 * 255).asnumpy().astype('uint8')
    ax = viz.plot_bbox(train_image2, train_label2[:, :4],
                       labels=train_label2[:, 4:5],
                       class_names=train_dataset.classes)
    plt.show()


.. image-sg:: /build/examples_detection/images/sphx_glr_train_faster_rcnn_voc_003.png
   :alt: train faster rcnn voc
   :srcset: /build/examples_detection/images/sphx_glr_train_faster_rcnn_voc_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 125-136

Data Loader
-----------
We will iterate through the entire dataset many times during training.
Keep in mind that raw images have to be transformed to tensors
(mxnet uses BCHW format) before they are fed into neural networks.

A handy DataLoader would be very convenient for us to apply different transforms and aggregate data into mini-batches.

Because Faster-RCNN handles raw images with various aspect ratios and various shapes, we provide a
:py:class:`gluoncv.data.batchify.Append`, which neither stack or pad images, but instead return lists.
In such way, image tensors and labels returned have their own shapes, unaware of the rest in the same batch.

.. GENERATED FROM PYTHON SOURCE LINES 136-156

.. code-block:: default


    from gluoncv.data.batchify import Tuple, Append, FasterRCNNTrainBatchify
    from mxnet.gluon.data import DataLoader

    batch_size = 2  # for tutorial, we use smaller batch-size
    num_workers = 0  # you can make it larger(if your CPU has more cores) to accelerate data loading

    # behavior of batchify_fn: stack images, and pad labels
    batchify_fn = Tuple(Append(), Append())
    train_loader = DataLoader(train_dataset.transform(train_transform), batch_size, shuffle=True,
                              batchify_fn=batchify_fn, last_batch='rollover', num_workers=num_workers)
    val_loader = DataLoader(val_dataset.transform(val_transform), batch_size, shuffle=False,
                            batchify_fn=batchify_fn, last_batch='keep', num_workers=num_workers)

    for ib, batch in enumerate(train_loader):
        if ib > 3:
            break
        print('data 0:', batch[0][0].shape, 'label 0:', batch[1][0].shape)
        print('data 1:', batch[0][1].shape, 'label 1:', batch[1][1].shape)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    data 0: (1, 3, 600, 800) label 0: (1, 5, 6)
    data 1: (1, 3, 600, 901) label 1: (1, 9, 6)
    data 0: (1, 3, 600, 800) label 0: (1, 2, 6)
    data 1: (1, 3, 562, 1000) label 1: (1, 1, 6)
    data 0: (1, 3, 600, 904) label 0: (1, 1, 6)
    data 1: (1, 3, 600, 888) label 1: (1, 2, 6)
    data 0: (1, 3, 600, 901) label 0: (1, 1, 6)
    data 1: (1, 3, 600, 901) label 1: (1, 1, 6)


.. GENERATED FROM PYTHON SOURCE LINES 157-172

Faster-RCNN Network
-------------------
GluonCV's Faster-RCNN implementation is a composite Gluon HybridBlock :py:class:`gluoncv.model_zoo.FasterRCNN`.
In terms of structure, Faster-RCNN networks are composed of base feature extraction
network, Region Proposal Network(including its own anchor system, proposal generator),
region-aware pooling layers, class predictors and bounding box offset predictors.

`Gluon Model Zoo <../../model_zoo/index.html>`__ has a few built-in Faster-RCNN networks, more on the way.
You can load your favorite one with one simple line of code:

.. hint::

   To avoid downloading model in this tutorial, we set ``pretrained_base=False``,
   in practice we usually want to load pre-trained imagenet models by setting
   ``pretrained_base=True``.

.. GENERATED FROM PYTHON SOURCE LINES 172-177

.. code-block:: default

    from gluoncv import model_zoo

    net = model_zoo.get_model('faster_rcnn_resnet50_v1b_voc', pretrained_base=False)
    print(net)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    FasterRCNN(
      (features): HybridSequential(
        (0): Conv2D(None -> 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
        (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=64)
        (2): Activation(relu)
        (3): MaxPool2D(size=(3, 3), stride=(2, 2), padding=(1, 1), ceil_mode=False, global_pool=False, pool_type=max, layout=NCHW)
        (4): HybridSequential(
          (0): BottleneckV1b(
            (conv1): Conv2D(None -> 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=64)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=64)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu3): Activation(relu)
            (downsample): HybridSequential(
              (0): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            )
          )
          (1): BottleneckV1b(
            (conv1): Conv2D(None -> 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=64)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=64)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu3): Activation(relu)
          )
          (2): BottleneckV1b(
            (conv1): Conv2D(None -> 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=64)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=64)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu3): Activation(relu)
          )
        )
        (5): HybridSequential(
          (0): BottleneckV1b(
            (conv1): Conv2D(None -> 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=128)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=128)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu3): Activation(relu)
            (downsample): HybridSequential(
              (0): Conv2D(None -> 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
              (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            )
          )
          (1): BottleneckV1b(
            (conv1): Conv2D(None -> 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=128)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=128)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu3): Activation(relu)
          )
          (2): BottleneckV1b(
            (conv1): Conv2D(None -> 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=128)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=128)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu3): Activation(relu)
          )
          (3): BottleneckV1b(
            (conv1): Conv2D(None -> 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=128)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=128)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu3): Activation(relu)
          )
        )
        (6): HybridSequential(
          (0): BottleneckV1b(
            (conv1): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
            (relu3): Activation(relu)
            (downsample): HybridSequential(
              (0): Conv2D(None -> 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
              (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
            )
          )
          (1): BottleneckV1b(
            (conv1): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
            (relu3): Activation(relu)
          )
          (2): BottleneckV1b(
            (conv1): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
            (relu3): Activation(relu)
          )
          (3): BottleneckV1b(
            (conv1): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
            (relu3): Activation(relu)
          )
          (4): BottleneckV1b(
            (conv1): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
            (relu3): Activation(relu)
          )
          (5): BottleneckV1b(
            (conv1): Conv2D(None -> 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
            (relu3): Activation(relu)
          )
        )
      )
      (top_features): HybridSequential(
        (0): HybridSequential(
          (0): BottleneckV1b(
            (conv1): Conv2D(None -> 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=2048)
            (relu3): Activation(relu)
            (downsample): HybridSequential(
              (0): Conv2D(None -> 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
              (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=2048)
            )
          )
          (1): BottleneckV1b(
            (conv1): Conv2D(None -> 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=2048)
            (relu3): Activation(relu)
          )
          (2): BottleneckV1b(
            (conv1): Conv2D(None -> 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu1): Activation(relu)
            (conv2): Conv2D(None -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
            (relu2): Activation(relu)
            (conv3): Conv2D(None -> 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=2048)
            (relu3): Activation(relu)
          )
        )
      )
      (class_predictor): Dense(None -> 21, linear)
      (box_predictor): Dense(None -> 80, linear)
      (cls_decoder): MultiPerClassDecoder(
  
      )
      (box_decoder): NormalizedBoxCenterDecoder(
        (corner_to_center): BBoxCornerToCenter(
    
        )
      )
      (rpn): RPN(
        (anchor_generator): RPNAnchorGenerator(
    
        )
        (conv1): HybridSequential(
          (0): Conv2D(None -> 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): Activation(relu)
        )
        (score): Conv2D(None -> 15, kernel_size=(1, 1), stride=(1, 1))
        (loc): Conv2D(None -> 60, kernel_size=(1, 1), stride=(1, 1))
        (region_proposer): RPNProposal(
          (_box_to_center): BBoxCornerToCenter(
      
          )
          (_box_decoder): NormalizedBoxCenterDecoder(
            (corner_to_center): BBoxCornerToCenter(
        
            )
          )
          (_clipper): BBoxClipToImage(
      
          )
        )
      )
      (sampler): RCNNTargetSampler(
  
      )
    )


.. GENERATED FROM PYTHON SOURCE LINES 178-179

Faster-RCNN network is callable with image tensor

.. GENERATED FROM PYTHON SOURCE LINES 179-185

.. code-block:: default

    import mxnet as mx

    x = mx.nd.zeros(shape=(1, 3, 600, 800))
    net.initialize()
    cids, scores, bboxes = net(x)


.. GENERATED FROM PYTHON SOURCE LINES 186-189

Faster-RCNN returns three values, where ``cids`` are the class labels,
``scores`` are confidence scores of each prediction,
and ``bboxes`` are absolute coordinates of corresponding bounding boxes.

.. GENERATED FROM PYTHON SOURCE LINES 191-192

Faster-RCNN network behave differently during training mode:

.. GENERATED FROM PYTHON SOURCE LINES 192-201

.. code-block:: default

    from mxnet import autograd

    with autograd.train_mode():
        # this time we need ground-truth to generate high quality roi proposals during training
        gt_box = mx.nd.zeros(shape=(1, 1, 4))
        gt_label = mx.nd.zeros(shape=(1, 1, 1))
        cls_pred, box_pred, roi, samples, matches, rpn_score, rpn_box, anchors, cls_targets, \
            box_targets, box_masks, _ = net(x, gt_box, gt_label)


.. GENERATED FROM PYTHON SOURCE LINES 202-208

In training mode, Faster-RCNN returns a lot of intermediate values, which we require to train in an end-to-end flavor,
where ``cls_preds`` are the class predictions prior to softmax,
``box_preds`` are bounding box offsets with one-to-one correspondence to proposals
``roi`` is the proposal candidates, ``samples`` and ``matches`` are the sampling/matching results of RPN anchors.
``rpn_score`` and ``rpn_box`` are the raw outputs from RPN's convolutional layers.
and ``anchors`` are absolute coordinates of corresponding anchors boxes.

.. GENERATED FROM PYTHON SOURCE LINES 211-214

Training losses
---------------
There are four losses involved in end-to-end Faster-RCNN training.

.. GENERATED FROM PYTHON SOURCE LINES 214-224

.. code-block:: default


    # the loss to penalize incorrect foreground/background prediction
    rpn_cls_loss = mx.gluon.loss.SigmoidBinaryCrossEntropyLoss(from_sigmoid=False)
    # the loss to penalize inaccurate anchor boxes
    rpn_box_loss = mx.gluon.loss.HuberLoss(rho=1 / 9.)  # == smoothl1
    # the loss to penalize incorrect classification prediction.
    rcnn_cls_loss = mx.gluon.loss.SoftmaxCrossEntropyLoss()
    # and finally the loss to penalize inaccurate proposals
    rcnn_box_loss = mx.gluon.loss.HuberLoss()  # == smoothl1


.. GENERATED FROM PYTHON SOURCE LINES 225-230

RPN training targets
--------------------
To speed up training, we let CPU to pre-compute RPN training targets.
This is especially nice when your CPU is powerful and you can use ``-j num_workers``
to utilize multi-core CPU.

.. GENERATED FROM PYTHON SOURCE LINES 232-233

If we provide network to the training transform function, it will compute training targets

.. GENERATED FROM PYTHON SOURCE LINES 233-241

.. code-block:: default

    train_transform = presets.rcnn.FasterRCNNDefaultTrainTransform(short, max_size, net)
    # Return images, labels, rpn_cls_targets, rpn_box_targets, rpn_box_masks loosely
    batchify_fn = FasterRCNNTrainBatchify(net)
    # For the next part, we only use batch size 1
    batch_size = 1
    train_loader = DataLoader(train_dataset.transform(train_transform), batch_size, shuffle=True,
                              batchify_fn=batchify_fn, last_batch='rollover', num_workers=num_workers)


.. GENERATED FROM PYTHON SOURCE LINES 242-244

This time we can see the data loader is actually returning the training targets for us.
Then it is very naturally a gluon training loop with Trainer and let it update the weights.

.. GENERATED FROM PYTHON SOURCE LINES 244-263

.. code-block:: default


    for ib, batch in enumerate(train_loader):
        if ib > 0:
            break
        with autograd.train_mode():
            for data, label, rpn_cls_targets, rpn_box_targets, rpn_box_masks in zip(*batch):
                label = label.expand_dims(0)
                gt_label = label[:, :, 4:5]
                gt_box = label[:, :, :4]
                print('data:', data.shape)
                # box and class labels
                print('box:', gt_box.shape)
                print('label:', gt_label.shape)
                # -1 marks ignored label
                print('rpn cls label:', rpn_cls_targets.shape)
                # mask out ignored box label
                print('rpn box label:', rpn_box_targets.shape)
                print('rpn box mask:', rpn_box_masks.shape)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    data: (3, 600, 800)
    box: (1, 6, 4)
    label: (1, 6, 1)
    rpn cls label: (1, 28500)
    rpn box label: (1, 28500, 4)
    rpn box mask: (1, 28500, 4)


.. GENERATED FROM PYTHON SOURCE LINES 264-267

RCNN training targets
---------------------
RCNN targets are generated with the intermediate outputs with the stored target generator.

.. GENERATED FROM PYTHON SOURCE LINES 267-290

.. code-block:: default


    for ib, batch in enumerate(train_loader):
        if ib > 0:
            break
        with autograd.train_mode():
            for data, label, rpn_cls_targets, rpn_box_targets, rpn_box_masks in zip(*batch):
                label = label.expand_dims(0)
                gt_label = label[:, :, 4:5]
                gt_box = label[:, :, :4]
                # network forward
                cls_pred, box_pred, roi, samples, matches, rpn_score, rpn_box, anchors, cls_targets, \
                    box_targets, box_masks, _ = net(data.expand_dims(0), gt_box, gt_label)

                print('data:', data.shape)
                # box and class labels
                print('box:', gt_box.shape)
                print('label:', gt_label.shape)
                # rcnn does not have ignored label
                print('rcnn cls label:', cls_targets.shape)
                # mask out ignored box label
                print('rcnn box label:', box_targets.shape)
                print('rcnn box mask:', box_masks.shape)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    data: (3, 600, 800)
    box: (1, 2, 4)
    label: (1, 2, 1)
    rcnn cls label: (1, 128)
    rcnn box label: (1, 32, 20, 4)
    rcnn box mask: (1, 32, 20, 4)


.. GENERATED FROM PYTHON SOURCE LINES 291-294

Training loop
-------------
After we have defined loss function and generated training targets, we can write the training loop.

.. GENERATED FROM PYTHON SOURCE LINES 294-327

.. code-block:: default


    for ib, batch in enumerate(train_loader):
        if ib > 0:
            break
        with autograd.record():
            for data, label, rpn_cls_targets, rpn_box_targets, rpn_box_masks in zip(*batch):
                label = label.expand_dims(0)
                gt_label = label[:, :, 4:5]
                gt_box = label[:, :, :4]
                # network forward
                cls_preds, box_preds, roi, samples, matches, rpn_score, rpn_box, anchors, cls_targets, \
                    box_targets, box_masks, _ = net(data.expand_dims(0), gt_box, gt_label)

                # losses of rpn
                rpn_score = rpn_score.squeeze(axis=-1)
                num_rpn_pos = (rpn_cls_targets >= 0).sum()
                rpn_loss1 = rpn_cls_loss(rpn_score, rpn_cls_targets,
                                         rpn_cls_targets >= 0) * rpn_cls_targets.size / num_rpn_pos
                rpn_loss2 = rpn_box_loss(rpn_box, rpn_box_targets,
                                         rpn_box_masks) * rpn_box.size / num_rpn_pos

                # losses of rcnn
                num_rcnn_pos = (cls_targets >= 0).sum()
                rcnn_loss1 = rcnn_cls_loss(cls_preds, cls_targets,
                                           cls_targets >= 0) * cls_targets.size / cls_targets.shape[
                                 0] / num_rcnn_pos
                rcnn_loss2 = rcnn_box_loss(box_preds, box_targets, box_masks) * box_preds.size / \
                             box_preds.shape[0] / num_rcnn_pos

            # some standard gluon training steps:
            # autograd.backward([rpn_loss1, rpn_loss2, rcnn_loss1, rcnn_loss2])
            # trainer.step(batch_size)


.. GENERATED FROM PYTHON SOURCE LINES 328-331

.. hint::

  Please checkout the full :download:`training script <../../../scripts/detection/faster_rcnn/train_faster_rcnn.py>` for complete implementation.

.. GENERATED FROM PYTHON SOURCE LINES 334-342

References
----------

.. [Girshick14] Ross Girshick and Jeff Donahue and Trevor Darrell and Jitendra Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014.
.. [Girshick15] Ross Girshick. Fast {R-CNN}. ICCV 2015.
.. [Ren15] Shaoqing Ren and Kaiming He and Ross Girshick and Jian Sun. Faster {R-CNN}: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
.. [He16] Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.
.. [Lin17] Tsung-Yi Lin and Piotr Dollár and Ross Girshick and Kaiming He and Bharath Hariharan and Serge Belongie. Feature Pyramid Networks for Object Detection. CVPR 2017.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  28.802 seconds)


.. _sphx_glr_download_build_examples_detection_train_faster_rcnn_voc.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: train_faster_rcnn_voc.py <train_faster_rcnn_voc.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: train_faster_rcnn_voc.ipynb <train_faster_rcnn_voc.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_