Faster-RCNN源码解析（simple-faster-rcnn-pytorch）

这里采用源码地址：点我

想了很多种方式详细解析Faster-rcnn的源码，但是Faster-rcnn源码比较复杂，有比较长，功能模块又非常多，一一介绍的话可能会看的晕头转向，所以我还是从预测和训练两个过程种用到的一些功能模块进行一些介绍，这是我个人阅读过程的理解（自己复盘的时候也能快速上手），当然能供大家参考就更好了，如有错误还望指正。

整体工作的流程图

1 预测过程

1.1 vgg16网络结构

代码位置：./model/faster_rcnn_vgg16.py

def decom_vgg16():
    # the 30th layer of features is relu of conv5_3
    if opt.caffe_pretrain:
        model = vgg16(pretrained=False)		# 如果使用自己预训练的模型
        if not opt.load_path:
            model.load_state_dict(t.load(opt.caffe_pretrain_path))
    else:
        model = vgg16(not opt.load_path)	# 如果不使用自己预训练的模型

    features = list(model.features)[:30]    # 特征提取部分 取前30层
    classifier = model.classifier       # 分类器部分

    classifier = list(classifier)
    del classifier[6]	# 删除最后的分类层
    if not opt.use_drop:    # 如果不使用dropout则删去相应的层
        del classifier[5]
        del classifier[2]
    classifier = nn.Sequential(*classifier)     # 打包好分类部分

    # freeze top4 conv
    for layer in features[:10]:     # 将特征提取部分前十层冻结好（已经经过了预训练）
        for p in layer.parameters():
            p.requires_grad = False

    return nn.Sequential(*features), classifier

上述vgg16是直接加载的torchvision.models中现有的vgg16。

上面代码的vgg16分为两个部分，提取特征的feature部分和最后分类classifier部分：

对于提取特征的feature部分，作者只采用了vgg16的原有特征部分的前30层（其实就是去掉了最后一层maxpool，所以整个网络只有四层maxpool），并且冻结了前十层；
对于最后分类classifier部分，作者根据需要选择是否删除其中的drop out层。

对于torchvision.models中自带的vgg16的网络结构在下面介绍：

我们首先直接看看原始feature和classifier的网络结构

加载feature和classifier

from torchvision.models import vgg16
model = vgg16(not None)
features = list(model.features)
classifier = list(model.classifier)

查看feature结构

for i in range(len(features)):
    print(i,features[i])
# 0 Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 1 ReLU(inplace=True)
# 2 Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 3 ReLU(inplace=True)
# 4 MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
# 5 Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 6 ReLU(inplace=True)
# 7 Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 8 ReLU(inplace=True)
# 9 MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
# 10 Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 11 ReLU(inplace=True)
# 12 Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 13 ReLU(inplace=True)
# 14 Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 15 ReLU(inplace=True)
# 16 MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
# 17 Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 18 ReLU(inplace=True)
# 19 Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 20 ReLU(inplace=True)
# 21 Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 22 ReLU(inplace=True)
# 23 MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
# 24 Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 25 ReLU(inplace=True)
# 26 Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 27 ReLU(inplace=True)
# 28 Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
# 29 ReLU(inplace=True)
# 30 MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

查看classifier结构：

for i in range(len(classifier)):
    print(i, classifier[i])
# 0 Linear(in_features=25088, out_features=4096, bias=True)
# 1 ReLU(inplace=True)
# 2 Dropout(p=0.5, inplace=False)
# 3 Linear(in_features=4096, out_features=4096, bias=True)
# 4 ReLU(inplace=True)
# 5 Dropout(p=0.5, inplace=False)
# 6 Linear(in_features=4096, out_features=1000, bias=True)

1.2 RPN网络结构

其中最重要的就是forward()方法，下面逐步进行介绍（假设卷积最后的输出的特征图为60*40*512，经过四次maxpool，即缩小了2^4=16倍）：

将特征图的高和宽（hh和ww）、slide window的滑动步长（默认设置为16，因为缩放了16倍，所以再特征图上移动一个像素点相当于在原图上移动16个像素点）以及anchor_base（就是那九个框）输入_enumerate_shifted_anchor()得到所有的anchor box。（如果按照stride = 16的话，最终将得到60*40*9=21600个anchor，注意：这些anchor是这个batch图片共有的）
然后将特征图使用3*3的卷积，保持通道数不变，此时输出为h。
将h输入到self.loc网络中（用于定位anchor位置的全连接网络，其实是1*1的卷积，通道数变为n_anchor*4，对于每一个像素点来说就是全连接），然后再做了一些矩阵形状上的变化（具体形状太小可以参考下面代码注释），最后输出为rpn_locs，形状为shape = (n, hh*ww*n_anchor, 4)
将h输入到self.score网络中（用于求出anchor框的分类得分全连接网络，通道数变为n_anchor*2），最后输出为rpn_scores，形状为shape = (n, hh*ww*n_anchor, 2)
将上述分类得分送入二分类的softmax网络得到概率值（即为前景的概率，更通俗的就是框内有需要检测物体的概率），最后输出为rpn_fg_scores，形状为shape = (n, hh*ww*n_anchor)
然后遍历这个batch的每一张图片，依次将其送入proposal_layer()获取建议区域roi（请参看本文1.2.1），然后将相应的区域保存，并记录其下标（说明是第几张图片）。最后将所有的区域和下标整合得到rois和roi_indices

代码位置：./model/region_proposal_network.py

下面有RPN网络的完整代码（我做了一些个人理解的注释）

class RegionProposalNetwork(nn.Module):
    """Region Proposal Network introduced in Faster R-CNN.

    This is Region Proposal Network introduced in Faster R-CNN [#]_.            这是RPN网络部分
    This takes features extracted from images and propose                       传入图片特征图和bbox建议框
    class agnostic bounding boxes around "objects".

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.

    Args:
        in_channels (int): The channel size of input.                           输入通道数
        mid_channels (int): The channel size of the intermediate tensor.        中间通道数
        ratios (list of floats): This is ratios of width to height of           比率：anchor box的宽长比
            the anchors.
        anchor_scales (list of numbers): This is areas of anchors.              anchor box缩放比
            Those areas will be the product of the square of an element in
            :obj:`anchor_scales` and the original area of the reference
            window.
        feat_stride (int): Stride size after extracting features from an        anchor box移动的stride
            image.
        initialW (callable): Initial weight value. If :obj:`None` then this     W参数初始值（如果为none则随机初始化）
            function uses Gaussian distribution scaled by 0.1 to
            initialize weight.
            May also be a callable that takes an array and edits its values.
        proposal_creator_params (dict): Key valued paramters for                建议框生成参数
            :class:`model.utils.creator_tools.ProposalCreator`.

    .. seealso::
        :class:`~model.utils.creator_tools.ProposalCreator`

    """

    def __init__(
            self, in_channels=512, mid_channels=512, ratios=[0.5, 1, 2],
            anchor_scales=[8, 16, 32], feat_stride=16,
            proposal_creator_params=dict(),
    ):
        super(RegionProposalNetwork, self).__init__()
        self.anchor_base = generate_anchor_base(
            anchor_scales=anchor_scales, ratios=ratios)
        self.feat_stride = feat_stride
        self.proposal_layer = ProposalCreator(self, **proposal_creator_params)
        n_anchor = self.anchor_base.shape[0]
        self.conv1 = nn.Conv2d(in_channels, mid_channels, 3, 1, 1)  # kernel_size:3 stride=1 padding=1 卷积后大小不变
        self.score = nn.Conv2d(mid_channels, n_anchor * 2, 1, 1, 0) # 确定种类的卷积网络，卷积核大小为1，其实等效为全连接
        self.loc = nn.Conv2d(mid_channels, n_anchor * 4, 1, 1, 0)   # 确定位置的卷积网络
        normal_init(self.conv1, 0, 0.01)
        normal_init(self.score, 0, 0.01)
        normal_init(self.loc, 0, 0.01)

    def forward(self, x, img_size, scale=1.):
        """Forward Region Proposal Network.

        Here are notations.

        * :math:`N` is batch size.          batch的大小
        * :math:`C` channel size of the input.          通道数
        * :math:`H` and :math:`W` are height and witdh of the input feature.    图片的宽和高
        * :math:`A` is number of anchors assigned to each pixel.    每个像素的anchor box个数

        Args:
            x (~torch.autograd.Variable): The Features extracted from images.
                Its shape is :math:`(N, C, H, W)`.      输入的特征图
            img_size (tuple of ints): A tuple :obj:`height, width`,
                which contains image size after scaling.    图片的大小（宽和长）
            scale (float): The amount of scaling done to the input images after
                reading them from files.    

        Returns:
            (~torch.autograd.Variable, ~torch.autograd.Variable, array, array, array):

            This is a tuple of five following values.

            * **rpn_locs**: Predicted bounding box offsets and scales for \
                anchors. Its shape is :math:`(N, H W A, 4)`.
            * **rpn_scores**:  Predicted foreground scores for \
                anchors. Its shape is :math:`(N, H W A, 2)`.
            * **rois**: A bounding box array containing coordinates of \
                proposal boxes.  This is a concatenation of bounding box \
                arrays from multiple images in the batch. \
                Its shape is :math:`(R', 4)`. Given :math:`R_i` predicted \
                bounding boxes from the :math:`i` th image, \
                :math:`R' = \\sum _{i=1} ^ N R_i`.
            * **roi_indices**: An array containing indices of images to \
                which RoIs correspond to. Its shape is :math:`(R',)`.
            * **anchor**: Coordinates of enumerated shifted anchors. \
                Its shape is :math:`(H W A, 4)`.

        """
        n, _, hh, ww = x.shape      # 获取batch的大小n、高hh、宽ww
        anchor = _enumerate_shifted_anchor(  # 获得了所有的框
            np.array(self.anchor_base),     # 传入每个像素点的anchor base
            self.feat_stride, hh, ww)       # sliding window移动的stride，图片的

        n_anchor = anchor.shape[0] // (hh * ww)     # 就是anchor base的个数
        h = F.relu(self.conv1(x))      # 第一层卷积 3*3的卷积核，通道数不变，大小也不变

        rpn_locs = self.loc(h)      # 得到位置 n*(n_anchor*4)*hh*ww
        # UNNOTE: check whether need contiguous
        # A: Yes
        rpn_locs = rpn_locs.permute(0, 2, 3, 1).contiguous().view(n, -1, 4)     # 得到的位置，最后形状：n*n_anchor*4 permute()将矩阵维度换位（类似于numpy的transpose）  contiguous()复制一份数据  view()更改矩阵形状
        rpn_scores = self.score(h)  # 得到分类得分  n*(n_anchor*2)*hh*ww
        rpn_scores = rpn_scores.permute(0, 2, 3, 1).contiguous()    # 同样的操作 n*hh*ww*(n_anchor*2)
        rpn_softmax_scores = F.softmax(rpn_scores.view(n, hh, ww, n_anchor, 2), dim=4)      # 输入形状: n*hh*ww*n_anchor*2
        rpn_fg_scores = rpn_softmax_scores[:, :, :, :, 1].contiguous()  # 求出得分
        rpn_fg_scores = rpn_fg_scores.view(n, -1)   # n*(hh*ww*n_anchor)
        rpn_scores = rpn_scores.view(n, -1, 2)  # n*(hh*ww*n_anchor)*2

        rois = list()       # 定义两个列表
        roi_indices = list()
        for i in range(n):  # 遍历每张图片
            roi = self.proposal_layer(
                rpn_locs[i].cpu().data.numpy(),
                rpn_fg_scores[i].cpu().data.numpy(),
                anchor, img_size,
                scale=scale)
            batch_index = i * np.ones((len(roi),), dtype=np.int32)  # 记录下标
            rois.append(roi)    # 添加所有的roi
            roi_indices.append(batch_index)

        rois = np.concatenate(rois, axis=0)     # 将所有的roi连接起来
        roi_indices = np.concatenate(roi_indices, axis=0)
        return rpn_locs, rpn_scores, rois, roi_indices, anchor


def _enumerate_shifted_anchor(anchor_base, feat_stride, height, width):
    # Enumerate all shifted anchors:
    #
    # add A anchors (1, A, 4) to
    # cell K shifts (K, 1, 4) to get
    # shift anchors (K, A, 4)
    # reshape to (K*A, 4) shifted anchors
    # return (K*A, 4)

    # !TODO: add support for torch.CudaTensor
    # xp = cuda.get_array_module(anchor_base)
    # it seems that it can't be boosed using GPU
    import numpy as xp      # 将numpy引用成xp这是我没想到的
    shift_y = xp.arange(0, height * feat_stride, feat_stride)       # y方向上的偏移量 0，1 * feat_stride,..., height * feat_stride
    shift_x = xp.arange(0, width * feat_stride, feat_stride)        # x方向上的偏移量 0，1 * feat_stride,..., height * feat_stride
    shift_x, shift_y = xp.meshgrid(shift_x, shift_y)                # 得到网格点坐标，可以参考这篇博客：https://blog.csdn.net/lllxxq141592654/article/details/81532855
    shift = xp.stack((shift_y.ravel(), shift_x.ravel(),             # ravel()与flatten()类似，进行扁平化操作，但是reval返回值没有分配新的内存
                      shift_y.ravel(), shift_x.ravel()), axis=1)    # 使用stack函数将四个向量在axis=1方向上堆积起来，最后就是得到所有点的坐标[[y1,x1,y1,x1],[y2,x2,y2,x2],...]

    A = anchor_base.shape[0]    # 9
    K = shift.shape[0]     # 位置点个数
    anchor = anchor_base.reshape((1, A, 4)) + \     # anchor:1*A*4, shift:K*1*4，然后使用广播机制相加得到K*A*4，其实就是得到了所有的框[y_min,x_min,y_max,x_max]
             shift.reshape((1, K, 4)).transpose((1, 0, 2))
    anchor = anchor.reshape((K * A, 4)).astype(np.float32)
    return anchor


def _enumerate_shifted_anchor_torch(anchor_base, feat_stride, height, width):
    # Enumerate all shifted anchors:
    #
    # add A anchors (1, A, 4) to
    # cell K shifts (K, 1, 4) to get
    # shift anchors (K, A, 4)
    # reshape to (K*A, 4) shifted anchors
    # return (K*A, 4)

    # !TODO: add support for torch.CudaTensor
    # xp = cuda.get_array_module(anchor_base)
    import torch as t
    shift_y = t.arange(0, height * feat_stride, feat_stride)
    shift_x = t.arange(0, width * feat_stride, feat_stride)
    shift_x, shift_y = xp.meshgrid(shift_x, shift_y)
    shift = xp.stack((shift_y.ravel(), shift_x.ravel(),
                      shift_y.ravel(), shift_x.ravel()), axis=1)

    A = anchor_base.shape[0]
    K = shift.shape[0]
    anchor = anchor_base.reshape((1, A, 4)) + \
             shift.reshape((1, K, 4)).transpose((1, 0, 2))
    anchor = anchor.reshape((K * A, 4)).astype(np.float32)
    return anchor


def normal_init(m, mean, stddev, truncated=False):
    """
    weight initalizer: truncated normal and random normal.
    """
    # x is a parameter
    if truncated:
        m.weight.data.normal_().fmod_(2).mul_(stddev).add_(mean)  # not a perfect approximation
    else:
        m.weight.data.normal_(mean, stddev)
        m.bias.data.zero_()

1.2.1 获取建议区域

计算边框回归后的结果（请参看本文1.2.2），即调用loc2bbox()函数，得到回归框roi（大约会有20000个）
然后将roi限制在图片范围内（即即将超出图片的部分裁剪掉）
筛选过滤掉一些太小的框
然后将rpn的得分矩阵socre排序，取分数的前n_pre_nums个（训练情况下为12000，测试情况下为2000）
对剩下的roi框进行nms处理（默认的nms阈值为0.7）
最后再取剩下的roi框的前n_post_nms个（训练情况下为6000，测试情况下为300）

下面是代码，做了一些个人理解的注释：

代码位置./model/utils/creator_tool.py

class ProposalCreator:
    # unNOTE: I'll make it undifferential
    # unTODO: make sure it's ok
    # It's ok
    """Proposal regions are generated by calling this object.       获得建议区域

    The :meth:`__call__` of this object outputs object detection proposals by
    applying estimated bounding box offsets
    to a set of anchors.

    This class takes parameters to control number of bounding boxes to
    pass to NMS and keep after NMS.
    If the paramters are negative, it uses all the bounding boxes supplied
    or keep all the bounding boxes returned by NMS.

    This class is used for Region Proposal Networks introduced in
    Faster R-CNN [#]_.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.

    Args:
        nms_thresh (float): Threshold value used when calling NMS.
        n_train_pre_nms (int): Number of top scored bounding boxes
            to keep before passing to NMS in train mode.
        n_train_post_nms (int): Number of top scored bounding boxes
            to keep after passing to NMS in train mode.
        n_test_pre_nms (int): Number of top scored bounding boxes
            to keep before passing to NMS in test mode.
        n_test_post_nms (int): Number of top scored bounding boxes
            to keep after passing to NMS in test mode.
        force_cpu_nms (bool): If this is :obj:`True`,
            always use NMS in CPU mode. If :obj:`False`,
            the NMS mode is selected based on the type of inputs.
        min_size (int): A paramter to determine the threshold on
            discarding bounding boxes based on their sizes.

    """

    def __init__(self,                          # 用默认的
                 parent_model,
                 nms_thresh=0.7,
                 n_train_pre_nms=12000,
                 n_train_post_nms=2000,
                 n_test_pre_nms=6000,
                 n_test_post_nms=300,
                 min_size=16
                 ):
        self.parent_model = parent_model
        self.nms_thresh = nms_thresh
        self.n_train_pre_nms = n_train_pre_nms
        self.n_train_post_nms = n_train_post_nms
        self.n_test_pre_nms = n_test_pre_nms
        self.n_test_post_nms = n_test_post_nms
        self.min_size = min_size

    def __call__(self, loc, score,
                 anchor, img_size, scale=1.):
        """input should  be ndarray
        Propose RoIs.

        Inputs :obj:`loc, score, anchor` refer to the same anchor when indexed
        by the same index.

        On notations, :math:`R` is the total number of anchors. This is equal       R是总anchor数
        to product of the height and the width of an image and the number of
        anchor bases per pixel.

        Type of the output is same as the inputs.

        Args:
            loc (array): Predicted offsets and scaling to anchors.
                Its shape is :math:`(R, 4)`.
            score (array): Predicted foreground probability for anchors.
                Its shape is :math:`(R,)`.
            anchor (array): Coordinates of anchors. Its shape is
                :math:`(R, 4)`.
            img_size (tuple of ints): A tuple :obj:`height, width`,
                which contains image size after scaling.
            scale (float): The scaling factor used to scale an image after
                reading it from a file.

        Returns:
            array:
            An array of coordinates of proposal boxes.
            Its shape is :math:`(S, 4)`. :math:`S` is less than
            :obj:`self.n_test_post_nms` in test time and less than
            :obj:`self.n_train_post_nms` in train time. :math:`S` depends on
            the size of the predicted bounding boxes and the number of
            bounding boxes discarded by NMS.

        """
        # NOTE: when test, remember
        # faster_rcnn.eval()
        # to set self.traing = False
        if self.parent_model.training:          # 训练情况下
            n_pre_nms = self.n_train_pre_nms    # 预选取数量，默认为12000
            n_post_nms = self.n_train_post_nms  # 最终选取数量，默认为2000
        else:                                   # 测试情况下
            n_pre_nms = self.n_test_pre_nms     # 预选取数量，默认为6000
            n_post_nms = self.n_test_post_nms   # 最终选取数量，默认为300

        # Convert anchors into proposal via bbox transformations.
        # roi = loc2bbox(anchor, loc)
        roi = loc2bbox(anchor, loc)     # 得到预测后的regions of interest

        # Clip predicted boxes to image.
        roi[:, slice(0, 4, 2)] = np.clip(       # clip(array, min, max)将array的值限制在最小值和最大值之间
            roi[:, slice(0, 4, 2)], 0, img_size[0])     # 这里其实就是将最后预测的范围限制在图片区域内，防止超出边框
        roi[:, slice(1, 4, 2)] = np.clip(
            roi[:, slice(1, 4, 2)], 0, img_size[1])

        # Remove predicted boxes with either height or width < threshold.
        min_size = self.min_size * scale        # 筛选过滤掉太小的框
        hs = roi[:, 2] - roi[:, 0]
        ws = roi[:, 3] - roi[:, 1]
        keep = np.where((hs >= min_size) & (ws >= min_size))[0]
        roi = roi[keep, :]
        score = score[keep]

        # Sort all (proposal, score) pairs by score from highest to lowest.
        # Take top pre_nms_topN (e.g. 6000).
        order = score.ravel().argsort()[::-1]   # 进行排序
        if n_pre_nms > 0:
            order = order[:n_pre_nms]   # 取前n_pre_nums个
        roi = roi[order, :]
        score = score[order]

        # Apply nms (e.g. threshold = 0.7).
        # Take after_nms_topN (e.g. 300).

        # unNOTE: somthing is wrong here!
        # TODO: remove cuda.to_gpu
        keep = nms(
            torch.from_numpy(roi).cuda(),
            torch.from_numpy(score).cuda(),
            self.nms_thresh)
        if n_post_nms > 0:
            keep = keep[:n_post_nms]
        roi = roi[keep.cpu().numpy()]
        return roi

1.2.2 计算回归边框位置

其实就是根据以下回归公式，获取回归后目标框

以下是代码，附带我的一些理解注释：

代码位置./model/utils/bbox_tools.py

def loc2bbox(src_bbox, loc):         #已知源bbox 和参数组loc（即偏移量和缩放量），求目标框G
    """Decode bounding boxes from bounding box offsets and scales.

    Given bounding box offsets and scales computed by
    :meth:`bbox2loc`, this function decodes the representation to
    coordinates in 2D image coordinates.

    Given scales and offsets :math:`t_y, t_x, t_h, t_w` and a bounding
    box whose center is :math:`(y, x) = p_y, p_x` and size :math:`p_h, p_w`,
    the decoded bounding box's center :math:`\\hat{g}_y`, :math:`\\hat{g}_x`
    and size :math:`\\hat{g}_h`, :math:`\\hat{g}_w` are calculated
    by the following formulas.

    * :math:`\\hat{g}_y = p_h t_y + p_y`
    * :math:`\\hat{g}_x = p_w t_x + p_x`
    * :math:`\\hat{g}_h = p_h \\exp(t_h)`
    * :math:`\\hat{g}_w = p_w \\exp(t_w)`

    The decoding formulas are used in works such as R-CNN [#]_.

    The output is same type as the type of the inputs.

    .. [#] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. \
    Rich feature hierarchies for accurate object detection and semantic \
    segmentation. CVPR 2014.

    Args:
        src_bbox (array): A coordinates of bounding boxes.
            Its shape is :math:`(R, 4)`. These coordinates are
            :math:`p_{ymin}, p_{xmin}, p_{ymax}, p_{xmax}`.
        loc (array): An array with offsets and scales.
            The shapes of :obj:`src_bbox` and :obj:`loc` should be same.
            This contains values :math:`t_y, t_x, t_h, t_w`.

    Returns:
        array:
        Decoded bounding box coordinates. Its shape is :math:`(R, 4)`. \
        The second axis contains four values \
        :math:`\\hat{g}_{ymin}, \\hat{g}_{xmin},
        \\hat{g}_{ymax}, \\hat{g}_{xmax}`.

    """

    if src_bbox.shape[0] == 0:
        return xp.zeros((0, 4), dtype=loc.dtype)

    src_bbox = src_bbox.astype(src_bbox.dtype, copy=False)

    src_height = src_bbox[:, 2] - src_bbox[:, 0]       # 高
    src_width = src_bbox[:, 3] - src_bbox[:, 1]        # 宽
    src_ctr_y = src_bbox[:, 0] + 0.5 * src_height      # 中心y
    src_ctr_x = src_bbox[:, 1] + 0.5 * src_width       # 中心x

    dy = loc[:, 0::4]       # 提取出各个参数
    dx = loc[:, 1::4]
    dh = loc[:, 2::4]
    dw = loc[:, 3::4]

    ctr_y = dy * src_height[:, xp.newaxis] + src_ctr_y[:, xp.newaxis]       # 得到预测后的坐标
    ctr_x = dx * src_width[:, xp.newaxis] + src_ctr_x[:, xp.newaxis]
    h = xp.exp(dh) * src_height[:, xp.newaxis]      # 得到预测后的高和宽
    w = xp.exp(dw) * src_width[:, xp.newaxis]

    dst_bbox = xp.zeros(loc.shape, dtype=loc.dtype)
    dst_bbox[:, 0::4] = ctr_y - 0.5 * h     # 将其转化为y_min x_min y_max x_max
    dst_bbox[:, 1::4] = ctr_x - 0.5 * w
    dst_bbox[:, 2::4] = ctr_y + 0.5 * h
    dst_bbox[:, 3::4] = ctr_x + 0.5 * w

    return dst_bbox

1.3 分类网络

由RPN网络已经得到了rois，接下来只需要对rois进行边框位置回归以及框内类别的回归了，分为以下几个步骤：

按照缩放比例，将rois映射到feature map上，然后进行RoiPooling，从而达到相同大小的特征输出，最后输出为pool，shape = (R', 7, 7, 512)。
进行flat操作，pool，shape = (R', 7*7*512) = (R', 25088)
将pool送入坐标回归的全连接网络，输出为roi_cls_locs，shape = (R', n_class*4)
将pool送入类别回归的全连接网络，输出为roi_scores，shape = (R', n_class)

这是这一一部分的源码（在我的理解上做了一些注释）

代码位置：./model/faster_rcnn_vgg16.py

class VGG16RoIHead(nn.Module):
    """Faster R-CNN Head for VGG-16 based implementation.
    This class is used as a head for Faster R-CNN.
    This outputs class-wise localizations and classification based on feature
    maps in the given RoIs.
    
    Args:
        n_class (int): The number of classes possibly including the background.
        roi_size (int): Height and width of the feature maps after RoI-pooling.
        spatial_scale (float): Scale of the roi is resized.
        classifier (nn.Module): Two layer Linear ported from vgg16

    """

    def __init__(self, n_class, roi_size, spatial_scale,
                 classifier):
        # n_class includes the background
        super(VGG16RoIHead, self).__init__()

        self.classifier = classifier
        self.cls_loc = nn.Linear(4096, n_class * 4)
        self.score = nn.Linear(4096, n_class)

        normal_init(self.cls_loc, 0, 0.001)     # 初始化参数
        normal_init(self.score, 0, 0.01)

        self.n_class = n_class
        self.roi_size = roi_size
        self.spatial_scale = spatial_scale
        self.roi = RoIPool( (self.roi_size, self.roi_size),self.spatial_scale)      # 输出7*7   缩放比例

    def forward(self, x, rois, roi_indices):
        """Forward the chain.

        We assume that there are :math:`N` batches.

        Args:
            x (Variable): 4D image variable.
            rois (Tensor): A bounding box array containing coordinates of
                proposal boxes.  This is a concatenation of bounding box
                arrays from multiple images in the batch.
                Its shape is :math:`(R', 4)`. Given :math:`R_i` proposed
                RoIs from the :math:`i` th image,
                :math:`R' = \\sum _{i=1} ^ N R_i`.
            roi_indices (Tensor): An array containing indices of images to
                which bounding boxes correspond to. Its shape is :math:`(R',)`.

        """
        # in case roi_indices is  ndarray
        roi_indices = at.totensor(roi_indices).float()
        rois = at.totensor(rois).float()
        indices_and_rois = t.cat([roi_indices[:, None], rois], dim=1)       # shape = (R', 5)
        # NOTE: important: yx->xy
        xy_indices_and_rois = indices_and_rois[:, [0, 2, 1, 4, 3]]      # (index, x_min, y_min, x_max, y_max)
        indices_and_rois =  xy_indices_and_rois.contiguous()    # 让矩阵在内存中连续分布

        pool = self.roi(x, indices_and_rois)    # ROIPooling，将大小不同的ROI固定成相同大小的输出 shape = (R', 7, 7, 512)
        pool = pool.view(pool.size(0), -1)      # flat操作 shape = (R', 7*7*512) = (R', 25088)
        fc7 = self.classifier(pool)     # 分类
        roi_cls_locs = self.cls_loc(fc7)    # 全连接
        roi_scores = self.score(fc7)    # 全连接
        return roi_cls_locs, roi_scores

1.4 得到最后结果

代码位置：./model/faster_rcnn.py 这里主要介绍的是_suppress()函数

下面是主要步骤：

将roi_scores输入到softmax层中，得到prob置信概率（这部分代码在predict()函数中，以下描述都是在_suppress()函数中）
然后对于第l类的框cls_bbox_l以及相应的置信概率prob_l，首先对其筛选出prob_l大于阈值的，然后再对框进行nms处理，这样就得到了最后的结果，将相应的位置、标签以及置信概率记录下来即可

下面是代码（加了个人注释）

class FasterRCNN(nn.Module):
    """Base class for Faster R-CNN.

    This is a base class for Faster R-CNN links supporting object detection
    API [#]_. The following three stages constitute Faster R-CNN.

    1. **Feature extraction**: Images are taken and their \
        feature maps are calculated.
    2. **Region Proposal Networks**: Given the feature maps calculated in \
        the previous stage, produce set of RoIs around objects.
    3. **Localization and Classification Heads**: Using feature maps that \
        belong to the proposed RoIs, classify the categories of the objects \
        in the RoIs and improve localizations.

    Each stage is carried out by one of the callable
    :class:`torch.nn.Module` objects :obj:`feature`, :obj:`rpn` and :obj:`head`.

    There are two functions :meth:`predict` and :meth:`__call__` to conduct
    object detection.
    :meth:`predict` takes images and returns bounding boxes that are converted
    to image coordinates. This will be useful for a scenario when
    Faster R-CNN is treated as a black box function, for instance.
    :meth:`__call__` is provided for a scnerario when intermediate outputs
    are needed, for instance, for training and debugging.

    Links that support obejct detection API have method :meth:`predict` with
    the same interface. Please refer to :meth:`predict` for
    further details.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.

    Args:
        extractor (nn.Module): A module that takes a BCHW image
            array and returns feature maps.
        rpn (nn.Module): A module that has the same interface as
            :class:`model.region_proposal_network.RegionProposalNetwork`.
            Please refer to the documentation found there.
        head (nn.Module): A module that takes
            a BCHW variable, RoIs and batch indices for RoIs. This returns class
            dependent localization paramters and class scores.
        loc_normalize_mean (tuple of four floats): Mean values of
            localization estimates.
        loc_normalize_std (tupler of four floats): Standard deviation
            of localization estimates.

    """

    def __init__(self, extractor, rpn, head,
                loc_normalize_mean = (0., 0., 0., 0.),
                loc_normalize_std = (0.1, 0.1, 0.2, 0.2)
    ):
        super(FasterRCNN, self).__init__()
        self.extractor = extractor
        self.rpn = rpn
        self.head = head

        # mean and std
        self.loc_normalize_mean = loc_normalize_mean
        self.loc_normalize_std = loc_normalize_std
        self.use_preset('evaluate')

    @property
    def n_class(self):
        # Total number of classes including the background.
        return self.head.n_class

    def forward(self, x, scale=1.):
        """Forward Faster R-CNN.

        Scaling paramter :obj:`scale` is used by RPN to determine the
        threshold to select small objects, which are going to be
        rejected irrespective of their confidence scores.

        Here are notations used.

        * :math:`N` is the number of batch size
        * :math:`R'` is the total number of RoIs produced across batches. \
            Given :math:`R_i` proposed RoIs from the :math:`i` th image, \
            :math:`R' = \\sum _{i=1} ^ N R_i`.
        * :math:`L` is the number of classes excluding the background.

        Classes are ordered by the background, the first class, ..., and
        the :math:`L` th class.

        Args:
            x (autograd.Variable): 4D image variable.
            scale (float): Amount of scaling applied to the raw image
                during preprocessing.

        Returns:
            Variable, Variable, array, array:
            Returns tuple of four values listed below.

            * **roi_cls_locs**: Offsets and scalings for the proposed RoIs. \
                Its shape is :math:`(R', (L + 1) \\times 4)`.
            * **roi_scores**: Class predictions for the proposed RoIs. \
                Its shape is :math:`(R', L + 1)`.
            * **rois**: RoIs proposed by RPN. Its shape is \
                :math:`(R', 4)`.
            * **roi_indices**: Batch indices of RoIs. Its shape is \
                :math:`(R',)`.

        """
        img_size = x.shape[2:]

        h = self.extractor(x)       # 获得特征图
        rpn_locs, rpn_scores, rois, roi_indices, anchor = \
            self.rpn(h, img_size, scale)    # 获取建议框
        roi_cls_locs, roi_scores = self.head(
            h, rois, roi_indices)   # 得到最终的结果
        return roi_cls_locs, roi_scores, rois, roi_indices

    def use_preset(self, preset):
        """Use the given preset during prediction.

        This method changes values of :obj:`self.nms_thresh` and
        :obj:`self.score_thresh`. These values are a threshold value
        used for non maximum suppression and a threshold value
        to discard low confidence proposals in :meth:`predict`,
        respectively.

        If the attributes need to be changed to something
        other than the values provided in the presets, please modify
        them by directly accessing the public attributes.

        Args:
            preset ({'visualize', 'evaluate'): A string to determine the
                preset to use.

        """
        if preset == 'visualize':
            self.nms_thresh = 0.3
            self.score_thresh = 0.7
        elif preset == 'evaluate':
            self.nms_thresh = 0.3
            self.score_thresh = 0.05
        else:
            raise ValueError('preset must be visualize or evaluate')

    def _suppress(self, raw_cls_bbox, raw_prob):
        bbox = list()
        label = list()
        score = list()
        # skip cls_id = 0 because it is the background class
        for l in range(1, self.n_class):
            cls_bbox_l = raw_cls_bbox.reshape((-1, self.n_class, 4))[:, l, :]   # 所有的第L类的框
            prob_l = raw_prob[:, l]     # 第L类的概率
            mask = prob_l > self.score_thresh   # 如果概率大于阈值，对应mask相应的位置就会被选取
            cls_bbox_l = cls_bbox_l[mask]   # 得到概率大于阈值的框
            prob_l = prob_l[mask]   # 得到概率大于阈值的框
            keep = nms(cls_bbox_l, prob_l,self.nms_thresh)      # 对这些框进行nms处理
            # import ipdb;ipdb.set_trace()
            # keep = cp.asnumpy(keep)
            bbox.append(cls_bbox_l[keep].cpu().numpy())     # 将L类的框保存
            # The labels are in [0, self.n_class - 2].
            label.append((l - 1) * np.ones((len(keep),)))   # 保存相应标签
            score.append(prob_l[keep].cpu().numpy())        # 保存相应置信概率
        bbox = np.concatenate(bbox, axis=0).astype(np.float32)      # 将所有的数据链接成一个矩阵，下同
        label = np.concatenate(label, axis=0).astype(np.int32)
        score = np.concatenate(score, axis=0).astype(np.float32)
        return bbox, label, score

    @nograd
    def predict(self, imgs,sizes=None,visualize=False):
        """Detect objects from images.

        This method predicts objects for each image.

        Args:
            imgs (iterable of numpy.ndarray): Arrays holding images.
                All images are in CHW and RGB format
                and the range of their value is :math:`[0, 255]`.

        Returns:
           tuple of lists:
           This method returns a tuple of three lists,
           :obj:`(bboxes, labels, scores)`.

           * **bboxes**: A list of float arrays of shape :math:`(R, 4)`, \
               where :math:`R` is the number of bounding boxes in a image. \
               Each bouding box is organized by \
               :math:`(y_{min}, x_{min}, y_{max}, x_{max})` \
               in the second axis.
           * **labels** : A list of integer arrays of shape :math:`(R,)`. \
               Each value indicates the class of the bounding box. \
               Values are in range :math:`[0, L - 1]`, where :math:`L` is the \
               number of the foreground classes.
           * **scores** : A list of float arrays of shape :math:`(R,)`. \
               Each value indicates how confident the prediction is.

        """
        self.eval()
        if visualize:
            self.use_preset('visualize')
            prepared_imgs = list()
            sizes = list()
            for img in imgs:
                size = img.shape[1:]    # 保存图片原始的长和宽
                img = preprocess(at.tonumpy(img))   # 进行图像输入的预处理
                prepared_imgs.append(img)
                sizes.append(size)
        else:
             prepared_imgs = imgs 
        bboxes = list()
        labels = list()
        scores = list()
        for img, size in zip(prepared_imgs, sizes):     # zip将可迭代对象打包成一个个元组  [(prepared_imgs[0], sizes[0]), (prepared_imgs[1], sizes[1]), ...]
            img = at.totensor(img[None]).float()
            scale = img.shape[3] / size[1]
            roi_cls_loc, roi_scores, rois, _ = self(img, scale=scale)   # 最后结果
            # We are assuming that batch size is 1.
            roi_score = roi_scores.data
            roi_cls_loc = roi_cls_loc.data
            roi = at.totensor(rois) / scale

            # Convert predictions to bounding boxes in image coordinates.
            # Bounding boxes are scaled to the scale of the input images.
            mean = t.Tensor(self.loc_normalize_mean).cuda(). \
                repeat(self.n_class)[None]
            std = t.Tensor(self.loc_normalize_std).cuda(). \
                repeat(self.n_class)[None]

            roi_cls_loc = (roi_cls_loc * std + mean)    # 乘以标准差加上均值
            roi_cls_loc = roi_cls_loc.view(-1, self.n_class, 4)     # shape = (R, n_class, 4)
            roi = roi.view(-1, 1, 4).expand_as(roi_cls_loc)     # 复制n_class份 shape由(R', 1, 4)变为(R, n_class, 4)
            cls_bbox = loc2bbox(at.tonumpy(roi).reshape((-1, 4)),
                                at.tonumpy(roi_cls_loc).reshape((-1, 4)))   # 使用边框回归公式得到最终的位置
            cls_bbox = at.totensor(cls_bbox)
            cls_bbox = cls_bbox.view(-1, self.n_class * 4)      # shape = (R, n_class*4)
            # clip bounding box
            cls_bbox[:, 0::2] = (cls_bbox[:, 0::2]).clamp(min=0, max=size[0])   # 将框限制在图片范围内（超出部分用最大值或最小值替代）
            cls_bbox[:, 1::2] = (cls_bbox[:, 1::2]).clamp(min=0, max=size[1])

            prob = (F.softmax(at.totensor(roi_score), dim=1))   # 将分类得分送入softmax，得到各类的概率

            bbox, label, score = self._suppress(cls_bbox, prob)     # 得到最终的框以及对应的标签和置信概率
            bboxes.append(bbox)
            labels.append(label)
            scores.append(score)

        self.use_preset('evaluate')
        self.train()
        return bboxes, labels, scores

    def get_optimizer(self):
        """
        return optimizer, It could be overwriten if you want to specify 
        special optimizer
        """
        lr = opt.lr
        params = []
        for key, value in dict(self.named_parameters()).items():
            if value.requires_grad:
                if 'bias' in key:
                    params += [{'params': [value], 'lr': lr * 2, 'weight_decay': 0}]
                else:
                    params += [{'params': [value], 'lr': lr, 'weight_decay': opt.weight_decay}]
        if opt.use_adam:
            self.optimizer = t.optim.Adam(params)
        else:
            self.optimizer = t.optim.SGD(params, momentum=0.9)
        return self.optimizer

    def scale_lr(self, decay=0.1):
        for param_group in self.optimizer.param_groups:
            param_group['lr'] *= decay
        return self.optimizer

2 训练过程

2.1 RPN网络的训练

2.1.1 由目标框和源框获取偏移参数

其实就是1.2.2过程的逆过程，根据回归公式的逆过程得到偏移参数。

以下时代码部分（做了一些个人理解的注释）

代码位置./utils/bbox_tools.py

def bbox2loc(src_bbox, dst_bbox):         #已知源bbox 和目标框，求相应的参数组  就是loc2bbox的逆过程
    """Encodes the source and the destination bounding boxes to "loc".

    Given bounding boxes, this function computes offsets and scales
    to match the source bounding boxes to the target bounding boxes.
    Mathematcially, given a bounding box whose center is
    :math:`(y, x) = p_y, p_x` and
    size :math:`p_h, p_w` and the target bounding box whose center is
    :math:`g_y, g_x` and size :math:`g_h, g_w`, the offsets and scales
    :math:`t_y, t_x, t_h, t_w` can be computed by the following formulas.

    * :math:`t_y = \\frac{(g_y - p_y)} {p_h}`
    * :math:`t_x = \\frac{(g_x - p_x)} {p_w}`
    * :math:`t_h = \\log(\\frac{g_h} {p_h})`
    * :math:`t_w = \\log(\\frac{g_w} {p_w})`

    The output is same type as the type of the inputs.
    The encoding formulas are used in works such as R-CNN [#]_.

    .. [#] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. \
    Rich feature hierarchies for accurate object detection and semantic \
    segmentation. CVPR 2014.

    Args:
        src_bbox (array): An image coordinate array whose shape is
            :math:`(R, 4)`. :math:`R` is the number of bounding boxes.
            These coordinates are
            :math:`p_{ymin}, p_{xmin}, p_{ymax}, p_{xmax}`.
        dst_bbox (array): An image coordinate array whose shape is
            :math:`(R, 4)`.
            These coordinates are
            :math:`g_{ymin}, g_{xmin}, g_{ymax}, g_{xmax}`.

    Returns:
        array:
        Bounding box offsets and scales from :obj:`src_bbox` \
        to :obj:`dst_bbox`. \
        This has shape :math:`(R, 4)`.
        The second axis contains four values :math:`t_y, t_x, t_h, t_w`.

    """

    height = src_bbox[:, 2] - src_bbox[:, 0]    # 预测框
    width = src_bbox[:, 3] - src_bbox[:, 1]
    ctr_y = src_bbox[:, 0] + 0.5 * height
    ctr_x = src_bbox[:, 1] + 0.5 * width

    base_height = dst_bbox[:, 2] - dst_bbox[:, 0]   # 目标框
    base_width = dst_bbox[:, 3] - dst_bbox[:, 1]
    base_ctr_y = dst_bbox[:, 0] + 0.5 * base_height
    base_ctr_x = dst_bbox[:, 1] + 0.5 * base_width

    eps = xp.finfo(height.dtype).eps    # eps是一个很小的非负数eps = 2.220446049250313e-16
    height = xp.maximum(height, eps)    # 将height中的0用eps替换
    width = xp.maximum(width, eps)      # 将width中的0用eps替换

    dy = (base_ctr_y - ctr_y) / height
    dx = (base_ctr_x - ctr_x) / width
    dh = xp.log(base_height / height)
    dw = xp.log(base_width / width)

    loc = xp.vstack((dy, dx, dh, dw)).transpose()
    return loc

2.1.2 获取ground truth

将人工标注的bounding box以及所有的anchor输入到bbox2loc()中就能得到偏移参数的ground truth，但是anchor中绝大多数样本都是负样本（即与bounding box相差很远），而正样本非常少，所以只筛选一部分正负样本。

计算bounding box与所有anchor的iou值，在iou值大于0.7的样本中随机抽取128个正样本生成相应的标签（1）
在iou值小于0.3的样本中随机抽取128个负样本生成相应的标签（0）
由此我们就获得了256个样本sample_roi，并记录其标签gt_roi_label（表示前景或背景）
然后将256个样本sample_roi（源框）以及人工标注的bounding box（目标框）输入到bbox2loc()中就能得到偏移量的准确值gt_roi_loc

下面是代码部分（做了个人理解的注释）

代码位置：./utils/creator_tool.py

class ProposalTargetCreator(object):    # 挑出128个roi框并赋予groud truth（准确的位置参数和分类参数）
    """Assign ground truth bounding boxes to given RoIs.

    The :meth:`__call__` of this class generates training targets
    for each object proposal.
    This is used to train Faster RCNN [#]_.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.

    Args:
        n_sample (int): The number of sampled regions.
        pos_ratio (float): Fraction of regions that is labeled as a
            foreground.
        pos_iou_thresh (float): IoU threshold for a RoI to be considered as a
            foreground.
        neg_iou_thresh_hi (float): RoI is considered to be the background
            if IoU is in
            [:obj:`neg_iou_thresh_hi`, :obj:`neg_iou_thresh_hi`).
        neg_iou_thresh_lo (float): See above.

    """

    def __init__(self,
                 n_sample=128,
                 pos_ratio=0.25, pos_iou_thresh=0.5,
                 neg_iou_thresh_hi=0.5, neg_iou_thresh_lo=0.0
                 ):
        self.n_sample = n_sample
        self.pos_ratio = pos_ratio
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh_hi = neg_iou_thresh_hi
        self.neg_iou_thresh_lo = neg_iou_thresh_lo  # NOTE:default 0.1 in py-faster-rcnn

    def __call__(self, roi, bbox, label,
                 loc_normalize_mean=(0., 0., 0., 0.),
                 loc_normalize_std=(0.1, 0.1, 0.2, 0.2)):
        """Assigns ground truth to sampled proposals.

        This function samples total of :obj:`self.n_sample` RoIs
        from the combination of :obj:`roi` and :obj:`bbox`.
        The RoIs are assigned with the ground truth class labels as well as
        bounding box offsets and scales to match the ground truth bounding
        boxes. As many as :obj:`pos_ratio * self.n_sample` RoIs are
        sampled as foregrounds.

        Offsets and scales of bounding boxes are calculated using
        :func:`model.utils.bbox_tools.bbox2loc`.
        Also, types of input arrays and output arrays are same.

        Here are notations.

        * :math:`S` is the total number of sampled RoIs, which equals \
            :obj:`self.n_sample`.
        * :math:`L` is number of object classes possibly including the \
            background.

        Args:
            roi (array): Region of Interests (RoIs) from which we sample.
                Its shape is :math:`(R, 4)`
            bbox (array): The coordinates of ground truth bounding boxes.
                Its shape is :math:`(R', 4)`.
            label (array): Ground truth bounding box labels. Its shape
                is :math:`(R',)`. Its range is :math:`[0, L - 1]`, where
                :math:`L` is the number of foreground classes.
            loc_normalize_mean (tuple of four floats): Mean values to normalize
                coordinates of bouding boxes.
            loc_normalize_std (tupler of four floats): Standard deviation of
                the coordinates of bounding boxes.

        Returns:
            (array, array, array):

            * **sample_roi**: Regions of interests that are sampled. \
                Its shape is :math:`(S, 4)`.
            * **gt_roi_loc**: Offsets and scales to match \
                the sampled RoIs to the ground truth bounding boxes. \
                Its shape is :math:`(S, 4)`.
            * **gt_roi_label**: Labels assigned to sampled RoIs. Its shape is \
                :math:`(S,)`. Its range is :math:`[0, L]`. The label with \
                value 0 is the background.

        """
        n_bbox, _ = bbox.shape      # 每张图片的bbox的数量

        roi = np.concatenate((roi, bbox), axis=0)   # 将bbox也作为roi的一部分

        pos_roi_per_image = np.round(self.n_sample * self.pos_ratio)    # 正样本数量，round()函数用于四舍五入
        iou = bbox_iou(roi, bbox)   # 计算iou,形状为(R, R')
        gt_assignment = iou.argmax(axis=1)  # 形状为(R,) 每个roi框的最大iou值的坐标
        max_iou = iou.max(axis=1)       # 形状为(R,) 每个roi框的最大iou值
        # Offset range of classes from [0, n_fg_class - 1] to [1, n_fg_class].
        # The label with value 0 is the background.
        gt_roi_label = label[gt_assignment] + 1     # 为每个roi框打上标签，其中0表示背景

        # Select foreground RoIs as those with >= pos_iou_thresh IoU.
        pos_index = np.where(max_iou >= self.pos_iou_thresh)[0]     # 正样本坐标 返回iou大于阈值iou（这里等于0.5）的索引
        pos_roi_per_this_image = int(min(pos_roi_per_image, pos_index.size))    # roi的个数取pos_roi_per_image和len(pos_index)的较小值
        if pos_index.size > 0:
            pos_index = np.random.choice(
                pos_index, size=pos_roi_per_this_image, replace=False)      # 在iou大于阈值0.5的roi框中随机选取pos_roi_per_this_image个

        # Select background RoIs as those within
        # [neg_iou_thresh_lo, neg_iou_thresh_hi).
        neg_index = np.where((max_iou < self.neg_iou_thresh_hi) &       # 负样本坐标 返回iou小于阈值的索引
                             (max_iou >= self.neg_iou_thresh_lo))[0]
        neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image     # 总样本数减去正样本数
        neg_roi_per_this_image = int(min(neg_roi_per_this_image,        # 求得负样本个数
                                         neg_index.size))
        if neg_index.size > 0:
            neg_index = np.random.choice(
                neg_index, size=neg_roi_per_this_image, replace=False)      # 随机抽取负样本

        # The indices that we're selecting (both positive and negative).
        keep_index = np.append(pos_index, neg_index)        # 将两个矩阵合并起来
        gt_roi_label = gt_roi_label[keep_index]     # 样本标签
        gt_roi_label[pos_roi_per_this_image:] = 0  # negative labels --> 0  负样本标签赋值为0
        sample_roi = roi[keep_index]        # 所有的样本

        # Compute offsets and scales to match sampled RoIs to the GTs.
        gt_roi_loc = bbox2loc(sample_roi, bbox[gt_assignment[keep_index]])      # 使用回归公式逆过程
        gt_roi_loc = ((gt_roi_loc - np.array(loc_normalize_mean, np.float32)
                       ) / np.array(loc_normalize_std, np.float32))

        return sample_roi, gt_roi_loc, gt_roi_label

2.1.3 计算损失值

RPN网络需要训练的部分就是位置修正参数和种类参数（前景和背景）的卷积网络。

对于位置修正参数，我们只需要根据人工标定的bounding box，然后结合生成的anchor box就能获得推得位置修正参数的ground truth（其实就是由anchor box和和位置修正参数推到ROI的逆过程），然后计算Smooth L1 Loss(探测边框回归)

对于种类参数，计算Softmax Loss(探测分类概率)

得到上述的两个损失值，然后根据下列的损失函数就能得到最终的损失函数：

$L(\left\{ p_i \right\},\left\{ t_i \right\})=\cfrac{1}{N_{cls}}\underset{i}{\sum}L_{cls}(p_i,p_i^*)+\lambda \cfrac{1}{N_{cls}}\underset{i}{\sum}p_i^*L_{reg}(t_i,t_i^*)$

代码部分（代码位置：./trainer.py）：

def _smooth_l1_loss(x, t, in_weight, sigma):
    sigma2 = sigma ** 2
    diff = in_weight * (x - t)
    abs_diff = diff.abs()
    flag = (abs_diff.data < (1. / sigma2)).float()
    y = (flag * (sigma2 / 2.) * (diff ** 2) +
         (1 - flag) * (abs_diff - 0.5 / sigma2))
    return y.sum()


def _fast_rcnn_loc_loss(pred_loc, gt_loc, gt_label, sigma):
    in_weight = t.zeros(gt_loc.shape).cuda()
    # Localization loss is calculated only for positive rois.   # 仅仅计算正样本
    # NOTE:  unlike origin implementation, 
    # we don't need inside_weight and outside_weight, they can calculate by gt_label
    in_weight[(gt_label > 0).view(-1, 1).expand_as(in_weight).cuda()] = 1   # 正样本区域赋值为1
    loc_loss = _smooth_l1_loss(pred_loc, gt_loc, in_weight.detach(), sigma)
    # Normalize by total number of negtive and positive rois.
    loc_loss /= ((gt_label >= 0).sum().float()) # ignore gt_label==-1 for rpn_loss
    return loc_loss

2.2 最后预测网络的训练

2.2.1 获取ground truth

其实最后预测网络的训练和RPN网络的训练实质是相同的，因为也是一个位置偏移量和分类结果的训练，但是稍有不同：

由bounding box（目标框）和rois（源框）输入到bbox2loc()中得到得到预测网络的准确值gt_roi_loc（同样需要筛选出一些正样本和负样本）
由iou值获取相应的label（n_class+1中，多一种表示背景），即gt_roi_label

然后相应的代码和上述RPN过程是一样的，之后损失值的计算也同样相同，所以在此不再赘述。