YOLOv3的源代码精度理解(九) YoloLoss类(重点)-一一网

代码主要是参考bubbliiing的github YOLOv3的代码：github.com/bubbliiiing…

对于源代码的解读

训练部分

yolo_training.py文件

class YOLOLoss(nn.Module):
    # 初始化部分，
    # 参数anchors = [[10,16],[16,30],..] num_class:voc 20
    # input_shape = [416,416],cuda：是否使用GPU
    def __init__(self, anchors, num_classes, input_shape, cuda, anchors_mask = [[6,7,8], [3,4,5], [0,1,2]]):
        super(YOLOLoss, self).__init__()
        #-----------------------------------------------------------#
        #   13x13的特征层对应的anchor是[116,90],[156,198],[373,326]
        #   26x26的特征层对应的anchor是[30,61],[62,45],[59,119]
        #   52x52的特征层对应的anchor是[10,13],[16,30],[33,23]
        #-----------------------------------------------------------#
        
        # [[10,16],[16,30],..]
        self.anchors        = anchors
        # voc:20
        self.num_classes    = num_classes
        # (x,y,w,h,conf,20) = 25
        self.bbox_attrs     = 5 + num_classes
        # [416,416]
        self.input_shape    = input_shape
        # 框的索引下标
        self.anchors_mask   = anchors_mask

        # 我们计算方式使用giou的方式，下面有详细解释
        self.giou           = True
        # 
        self.balance        = [0.4, 1.0, 4]
        self.box_ratio      = 0.05
        self.obj_ratio      = 5 * (input_shape[0] * input_shape[1]) / (416 ** 2)
        self.cls_ratio      = 1 * (num_classes / 80)
        
        # conf的阈值，用于nms进行削减框的数量使用
        self.ignore_threshold = 0.5
        # 是否使用gpu计算
        self.cuda           = cuda
        
    def forward(self, l, input, targets=None):
        # 我们计算loss 是从model中输出的预测值和label之间进行计算，因为我们有三层所以l代表是第几个特征图，以13 * 13的特征图为例，input = [bs,13,13,75],我们的target就是[[[归一化坐标，class_num],[归一化坐标，class_num]],[[归一化坐标，class_num]],[[归一化坐标，class_num],[归一化坐标，class_num],[归一化坐标，class_num]],...] batch_size个label信息(多个【归一化坐标，类别数字的列表】)
        #----------------------------------------------------#
        #   l代表的是，当前输入进来的有效特征层，是第几个有效特征层
        #   input的shape为  bs, 3*(5+num_classes), 13, 13
        #                   bs, 3*(5+num_classes), 26, 26
        #                   bs, 3*(5+num_classes), 52, 52
        #   targets代表的是真实框。
        #----------------------------------------------------#
        #--------------------------------#
        #   获得图片数量，特征层的高和宽
        #   13和13
        #--------------------------------#
        
        # batch_size 16
        bs      = input.size(0)
        
        # 特征图的宽高(主要是为了计算缩小的倍数，将我们的真实的框转化到特征图上去)
        # in_h 13
        # in_w 13
        in_h    = input.size(2)
        in_w    = input.size(3)
        #-----------------------------------------------------------------------#
        #   计算步长
        #   每一个特征点对应原来的图片上多少个像素点
        #   如果特征层为13x13的话，一个特征点就对应原来的图片上的32个像素点
        #   如果特征层为26x26的话，一个特征点就对应原来的图片上的16个像素点
        #   如果特征层为52x52的话，一个特征点就对应原来的图片上的8个像素点
        #   stride_h = stride_w = 32、16、8
        #   stride_h和stride_w都是32。
        #-----------------------------------------------------------------------#
        
        # 计算特征图上框的宽高缩减的倍数 stride_h = stride_w = 32
        stride_h = self.input_shape[0] / in_h
        stride_w = self.input_shape[1] / in_w
        #-------------------------------------------------#
        #   此时获得的scaled_anchors大小是相对于特征层的
        #-------------------------------------------------#
        
        
        # 我们将我们的对应特征图的三个框都缩减到特征图等比例
        scaled_anchors  = [(a_w / stride_w, a_h / stride_h) for a_w, a_h in self.anchors]
        #-----------------------------------------------#
        #   输入的input一共有三个，他们的shape分别是
        #   bs, 3*(5+num_classes), 13, 13 => batch_size, 3, 13, 13, 5 + num_classes
        #   batch_size, 3, 26, 26, 5 + num_classes
        #   batch_size, 3, 52, 52, 5 + num_classes
        #-----------------------------------------------#
        
        # 将我们的input整合成[batch_size,3,13，13，25]的形式
        # self.bbox_attrs 框的属性(四坐标 + 置信度 + 20分类)
        prediction = input.view(bs, len(self.anchors_mask[l]), self.bbox_attrs, in_h, in_w).permute(0, 1, 3, 4, 2).contiguous()
        
        #-----------------------------------------------#
        #   先验框的中心位置的调整参数
        #-----------------------------------------------#
        
        # 对于验证框的x,y偏移量和w,h宽高调整参数进行整理得到真实的框的数据(这个理解很重要)
        # 我们对预测的中心点偏移量x,y进行sigmoid计算
        x = torch.sigmoid(prediction[..., 0])
        y = torch.sigmoid(prediction[..., 1])
        #-----------------------------------------------#
        #   先验框的宽高调整参数
        #-----------------------------------------------#
        
        # 调整宽高
        w = prediction[..., 2]
        h = prediction[..., 3]
        #-----------------------------------------------#
        #   获得置信度，是否有物体
        #-----------------------------------------------#
        
        # 置信度是不是有物体
        conf = torch.sigmoid(prediction[..., 4])
        #-----------------------------------------------#
        #   种类置信度
        #-----------------------------------------------#
        
        # 获取物体的种类的向量信息
        pred_cls = torch.sigmoid(prediction[..., 5:])

        #-----------------------------------------------#
        #   获得网络应该有的预测结果
        #-----------------------------------------------#
        
        # 获取无物体矩阵，和预测框的box[16,3,13,13,4] 注意这个地方是最后一维度是4
        y_true, noobj_mask, box_loss_scale = self.get_target(l, targets, scaled_anchors, in_h, in_w)

        #---------------------------------------------------------------#
        #   将预测结果进行解码，判断预测结果和真实值的重合程度
        #   如果重合程度过大则忽略，因为这些特征点属于预测比较准确的特征点
        #   作为负样本不合适
        #----------------------------------------------------------------#
        
        # 调用get_ignore将预测的507个框中小于规定阈值的，认为预测的框中没有物体，更新noobj_mask
        # 同时获得box[16,3,13,13,4]的预测的框相对于特征图的中心点真实位置，宽高的4维位置矩阵
        # 用于后面的loss计算
        noobj_mask, pred_boxes = self.get_ignore(l, x, y, h, w, targets, scaled_anchors, in_h, in_w, noobj_mask)
        
        # 开启gpu的话，数据都放到gpu上
        if self.cuda:
            y_true          = y_true.cuda()
            noobj_mask      = noobj_mask.cuda()
            box_loss_scale  = box_loss_scale.cuda()
        #--------------------------------------------------------------------------#
        #   box_loss_scale是真实框宽高的乘积，宽高均在0-1之间，因此乘积也在0-1之间。
        #   2-宽高的乘积代表真实框越大，比重越小，小框的比重更大。
        #--------------------------------------------------------------------------#
        
        # 这个地方就是在get_target中说的给予小目标更高的权重，给大目标更小的权重，更加关注小目标
        box_loss_scale = 2 - box_loss_scale
            
        loss        = 0
        # 获取真正存在物体的mask矩阵
        obj_mask    = y_true[..., 4] == 1
        # 对其存在多少物体进行计数，0代表没有物体
        n           = torch.sum(obj_mask)
        if n != 0:
            # 存在物体的话
            # 假设我们的giou是真，那么我们使用giou的方式计算位置损失，要是假的话，我们使用MSE和BCE的误差进行计算位置损失，这就是为什么前面y_true构建的时候我们要用两种方式构建的原因
            if self.giou:
                #---------------------------------------------------------------#
                #   计算预测结果和真实结果的giou
                #----------------------------------------------------------------#
                giou        = self.box_giou(pred_boxes, y_true[..., :4])
                loss_loc    = torch.mean((1 - giou)[obj_mask])
            else:
                #-----------------------------------------------------------#
                #   计算中心偏移情况的loss，使用BCELoss效果好一些
                #-----------------------------------------------------------#
                loss_x      = torch.mean(self.BCELoss(x[obj_mask], y_true[..., 0][obj_mask]) * box_loss_scale)
                loss_y      = torch.mean(self.BCELoss(y[obj_mask], y_true[..., 1][obj_mask]) * box_loss_scale)
                #-----------------------------------------------------------#
                #   计算宽高调整值的loss
                #-----------------------------------------------------------#
                loss_w      = torch.mean(self.MSELoss(w[obj_mask], y_true[..., 2][obj_mask]) * box_loss_scale)
                loss_h      = torch.mean(self.MSELoss(h[obj_mask], y_true[..., 3][obj_mask]) * box_loss_scale)
                loss_loc    = (loss_x + loss_y + loss_h + loss_w) * 0.1
            
            # 我们在计算类损失的时候，直接使用BCE进行计算
            loss_cls    = torch.mean(self.BCELoss(pred_cls[obj_mask], y_true[..., 5:][obj_mask]))
            # 我们对box损失和类损失分别成上相对应的权重得到加和损失
            loss        += loss_loc * self.box_ratio + loss_cls * self.cls_ratio
        
        # 我们在对置信度进行损失计算，使用的依旧是BCE
        loss_conf   = torch.mean(self.BCELoss(conf, obj_mask.type_as(conf))[noobj_mask.bool() | obj_mask])
        # 最后将置信度乘上对应的权重和前面计算出来的损失值进行加和得到最终的损失
        loss        += loss_conf * self.balance[l] * self.obj_ratio
        # if n != 0:
        #     print(loss_loc * self.box_ratio, loss_cls * self.cls_ratio, loss_conf * self.balance[l] * self.obj_ratio)
        # 最后将损失返回回去即可
        return loss
复制代码

对引用到的函数进行详细解读

对get_target的详细解读

def get_target(self, l, targets, anchors, in_h, in_w):
    # -----------------------------------------------------#
    #   计算一共有多少张图片
    # -----------------------------------------------------#

    # batch_size = 16
    bs = len(targets)
    # -----------------------------------------------------#
    #   用于选取哪些先验框不包含物体
    # -----------------------------------------------------#

    # 这个容器的目的就是将记录那个位置没有物体(loss中最重要的三个点之一)
    # 我们先做一个数据容器，容器的size是[batch_size,3,13,13] 不需要梯度下降
    noobj_mask = torch.ones(bs, len(self.anchors_mask[l]), in_h, in_w, requires_grad=False)
    # -----------------------------------------------------#
    #   让网络更加去关注小目标
    # -----------------------------------------------------#

    
    # 这个容器的目的就是将记录存在物体的位置的权重(loss中最重要的三个点之一)
    # 设置方框的损失程度，主要是想让网络更加的关注小目标，size是[batch_size,3,13,13] 不需要梯度下降
    box_loss_scale = torch.zeros(bs, len(self.anchors_mask[l]), in_h, in_w, requires_grad=False)
    # -----------------------------------------------------#
    #   batch_size, 3, 13, 13, 5 + num_classes
    # -----------------------------------------------------#

    # # 这个容器的目的就是将记录真实的数据信息(loss中最重要的三个点之一)
    # y_true的尺寸是[batch_size,3,13,13,25] 不需要梯度下降
    y_true = torch.zeros(bs, len(self.anchors_mask[l]), in_h, in_w, self.bbox_attrs, requires_grad=False)

    # 我们对每张图片进行循环
    for b in range(bs):
        # 假设我们的有图片，但是图片上并没有目标 我们直接跳过
        if len(targets[b]) == 0:
            continue

        # targets就是一个列表
        # targets[b] 就是我们的 例如：tensor([[ 0.8798,  0.1526,  0.1490,  0.2524,  8.0000],[ 0.6178,  0.0998,  0.1106,  0.1562, 14.0000],[ 0.7007,  0.3149,  0.3149,  0.4231, 11.0000]]) 大小是3 * 5的，数据分别是中心点x,y,w,h,class_num
        # 我们在构造一个数据容器，用于计算。大小和当前的target一致
        batch_target = torch.zeros_like(targets[b])
        # -------------------------------------------------------#
        #   计算出正样本在特征层上的中心点
        # -------------------------------------------------------#

        # 我们得到的是tensor([[11.4375,  1.9844,  1.9375,  3.2812,  8.0000],[ 8.0312,  1.2969,  1.4375,  2.0312, 14.0000],[ 9.1094,  4.0937,  4.0938,  5.5000, 11.0000]])
        # 因为我们在构造数据的时候我们都是使用了x1,y1,x2,y2/416 得到的是归一化之后的坐标，然后我们使用右下 - 左上 得到宽高，(右下 + 左上)/2 = 中心点坐标
        # 下面这个部分是相对坐标在13 * 13的特征图中的高度还原，还是【中心点x,y,w,h，class_num】的形式
        batch_target[:, [0, 2]] = targets[b][:, [0, 2]] * in_w
        batch_target[:, [1, 3]] = targets[b][:, [1, 3]] * in_h
        batch_target[:, 4] = targets[b][:, 4]
        batch_target = batch_target.cpu()

        # (gt_box,anchors_shapes的计算的核心)其实接下来这个地方的核心就是想看一下我们的
        # 真实的框和9个先验框之间的重合程度，所以不需要真正的中心点，我们将其中心点进行
        # 统一即可，所以在下面构造张量的时候前两维都是0
        
        # -------------------------------------------------------#
        #   将真实框转换一个形式
        #   num_true_box, 4
        # -------------------------------------------------------#

        # 我们根据这个图片有多少个真实的框，构建一个新的数据容器，并且将宽高拼接到这个容器的后两维
        # 例：2个真实的框，tensor([[0.0000, 0.0000, 3.2812, 7.1875],[0.0000, 0.0000, 2.7500, 8.6875]])
        gt_box = torch.FloatTensor(torch.cat((torch.zeros((batch_target.size(0), 2)), batch_target[:, 2:4]), 1))
        # -------------------------------------------------------#
        #   将先验框转换一个形式
        #   9,4
        #   9个先验框, 4
        # -------------------------------------------------------#

        # 我们将9个先验框进行和上面一样的操作
        # tensor([[ 0.0000,  0.0000,  0.3125,  0.4062],[ 0.0000,  0.0000,  0.5000,  0.9375],[ 0.0000,  0.0000,  1.0312,  0.7188],[ 0.0000,  0.0000,  0.9375,  1.9062],[ 0.0000,  0.0000,  1.9375,  1.4062],[ 0.0000,  0.0000,  1.8438,  3.7188],[ 0.0000,  0.0000,  3.6250,  2.8125],[ 0.0000,  0.0000,  4.8750,  6.1875],[ 0.0000,  0.0000, 11.6562, 10.1875]])
        anchor_shapes = torch.FloatTensor(torch.cat((torch.zeros((len(anchors), 2)), torch.FloatTensor(anchors)), 1))
        # -------------------------------------------------------#
        #   计算交并比
        #   self.calculate_iou(gt_box, anchor_shapes) = [num_true_box, 9]每一个真实框和9个先验框的重合情况
        #   best_ns:
        #   [每个真实框最大的重合度max_iou, 每一个真实框最重合的先验框的序号]
        # -------------------------------------------------------#

        #  [每个真实框最大的重合度max_iou, 每一个真实框最重合的先验框的序号] [7,7] 说明我们的两个真实的框和我们的先验框7都是最吻合的
        best_ns = torch.argmax(self.calculate_iou(gt_box, anchor_shapes), dim=-1)

        for t, best_n in enumerate(best_ns):

            # 首先先对特征层进行判断，非这个特征层的 我们直接跳过，说明预测的不对
            if best_n not in self.anchors_mask[l]:
                continue
            # ----------------------------------------#
            #   判断这个先验框是当前特征点的哪一个先验框
            # ----------------------------------------#

            # k = 1  <class 'list'>: [[6, 7, 8], [3, 4, 5], [0, 1, 2]] 我们知道首先先取这个特征层的[6,7,8] 
            # 然后我们在取出index也就是1
            k = self.anchors_mask[l].index(best_n)
            # ----------------------------------------#
            #   获得真实框属于哪个网格点
            # ----------------------------------------#

            # 查看物体落在了那个网格中，我们想要左上角坐标
            # i = 11，j = 6
            i = torch.floor(batch_target[t, 0]).long()
            j = torch.floor(batch_target[t, 1]).long()
            # ----------------------------------------#
            #   取出真实框的种类
            # ----------------------------------------#

            # 取出真实的种类
            c = batch_target[t, 4].long()

            # ----------------------------------------#
            #   noobj_mask代表无目标的特征点
            # ----------------------------------------#

            # 这个代表的是没有目标的矩阵，我们上面的(11,6)中已经检查除了真实物体，所以我们的无目标矩阵中相对应位置就是置成0
            noobj_mask[b, k, j, i] = 0
            # ----------------------------------------#
            #   tx、ty代表中心调整参数的真实值
            # ----------------------------------------#

            # self.giou = True 
            if not self.giou:
                # 要是不适用giou的计算方式，我们构造y_true
                # ----------------------------------------#
                #   tx、ty代表中心调整参数的真实值
                # ----------------------------------------#
                # 我们这个地方是得到真实值在13 * 13特征图上的坐标，减去方格左上角的坐标，其实就是调整量
                y_true[b, k, j, i, 0] = batch_target[t, 0] - i.float()
                y_true[b, k, j, i, 1] = batch_target[t, 1] - j.float()
                
                # 同理我们根据之前通过w,h宽高因子得到真实宽高的公式反向得到w,h因子
                y_true[b, k, j, i, 2] = math.log(batch_target[t, 2] / anchors[best_n][0])
                y_true[b, k, j, i, 3] = math.log(batch_target[t, 3] / anchors[best_n][1])
                # 置信度 框中存在物体，置信度是1
                y_true[b, k, j, i, 4] = 1
                # 在类别标签中指定位置(类别) 置成1 
                y_true[b, k, j, i, c + 5] = 1
            else:
                # 要是使用giou的计算方式，我们构造y_true
                # ----------------------------------------#
                #   tx、ty代表中心调整参数的真实值
                # ----------------------------------------#

                # 我们在真实的tensor中放入我们构造好的真实的数据
                # 放入的数据就是在特征图中的真实的宽高
                y_true[b, k, j, i, 0] = batch_target[t, 0]
                y_true[b, k, j, i, 1] = batch_target[t, 1]
                y_true[b, k, j, i, 2] = batch_target[t, 2]
                y_true[b, k, j, i, 3] = batch_target[t, 3]
                y_true[b, k, j, i, 4] = 1
                y_true[b, k, j, i, c + 5] = 1
            # ----------------------------------------#
            #   用于获得xywh的比例
            #   大目标loss权重小，小目标loss权重大
            # ----------------------------------------#

            # 我们给我们的这个物体的这个框一个权重，和y_true放在同一个位置上，这个部分框的宽高越大计算出来的值越大
            # 但是后面我们在计算的时候我们使用的是2 - box_loss_scale，也就是~大目标loss权重小，小目标loss权重大
            box_loss_scale[b, k, j, i] = batch_target[t, 2] * batch_target[t, 3] / in_w / in_h

    # 最后将我们构造好的y_true,无目标的矩阵和对于指定框的权重矩阵进行返回
    return y_true, noobj_mask, box_loss_scale
复制代码

y_true的数据展示，因为我们使用giou，所以前四维都是相对特征图(13 * 13)真实的宽高

调用 calculate_iou 详解

# 比较简单，不做过多解释，唯一需要注意的是4维度分别是相对特征图的[非归一化中心点坐标x,非归一化中心点坐标y,非归一化宽w,非归一化高h]
def calculate_iou(self, _box_a, _box_b):
    # -----------------------------------------------------------#
    #   计算真实框的左上角和右下角
    # -----------------------------------------------------------#

    b1_x1, b1_x2 = _box_a[:, 0] - _box_a[:, 2] / 2, _box_a[:, 0] + _box_a[:, 2] / 2
    b1_y1, b1_y2 = _box_a[:, 1] - _box_a[:, 3] / 2, _box_a[:, 1] + _box_a[:, 3] / 2
    # -----------------------------------------------------------#
    #   计算先验框获得的预测框的左上角和右下角
    # -----------------------------------------------------------#
    b2_x1, b2_x2 = _box_b[:, 0] - _box_b[:, 2] / 2, _box_b[:, 0] + _box_b[:, 2] / 2
    b2_y1, b2_y2 = _box_b[:, 1] - _box_b[:, 3] / 2, _box_b[:, 1] + _box_b[:, 3] / 2

    # -----------------------------------------------------------#
    #   将真实框和预测框都转化成左上角右下角的形式
    # -----------------------------------------------------------#
    box_a = torch.zeros_like(_box_a)
    box_b = torch.zeros_like(_box_b)
    box_a[:, 0], box_a[:, 1], box_a[:, 2], box_a[:, 3] = b1_x1, b1_y1, b1_x2, b1_y2
    box_b[:, 0], box_b[:, 1], box_b[:, 2], box_b[:, 3] = b2_x1, b2_y1, b2_x2, b2_y2

    # -----------------------------------------------------------#
    #   A为真实框的数量，B为先验框的数量
    # -----------------------------------------------------------#
    A = box_a.size(0)
    B = box_b.size(0)

    # -----------------------------------------------------------#
    #   计算交的面积
    # -----------------------------------------------------------#
    max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2), box_b[:, 2:].unsqueeze(0).expand(A, B, 2))
    min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2), box_b[:, :2].unsqueeze(0).expand(A, B, 2))
    inter = torch.clamp((max_xy - min_xy), min=0)
    inter = inter[:, :, 0] * inter[:, :, 1]
    # -----------------------------------------------------------#
    #   计算预测框和真实框各自的面积
    # -----------------------------------------------------------#
    area_a = ((box_a[:, 2] - box_a[:, 0]) * (box_a[:, 3] - box_a[:, 1])).unsqueeze(1).expand_as(inter)  # [A,B]
    area_b = ((box_b[:, 2] - box_b[:, 0]) * (box_b[:, 3] - box_b[:, 1])).unsqueeze(0).expand_as(inter)  # [A,B]
    # -----------------------------------------------------------#
    #   求IOU
    # -----------------------------------------------------------#
    union = area_a + area_b - inter
    return inter / union  # [A,B]
复制代码

对get_ignored的详细解读

# 这个方法主要是对预测出来的数据进行忽略
def get_ignore(self, l, x, y, h, w, targets, scaled_anchors, in_h, in_w, noobj_mask):
    # -----------------------------------------------------#
    #   计算一共有多少张图片
    # -----------------------------------------------------#

    # batch_size
    bs = len(targets)

    FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
    LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
    # -----------------------------------------------------#
    #   生成网格，先验框中心，网格左上角
    # -----------------------------------------------------#

    # 这个部分和预测的时候构建网格点的方式一样，主要就是获取指定的格子
    grid_x = torch.linspace(0, in_w - 1, in_w).repeat(in_h, 1).repeat(
        int(bs * len(self.anchors_mask[l])), 1, 1).view(x.shape).type(FloatTensor)
    grid_y = torch.linspace(0, in_h - 1, in_h).repeat(in_w, 1).t().repeat(
        int(bs * len(self.anchors_mask[l])), 1, 1).view(y.shape).type(FloatTensor)

    # 生成先验框的宽高
    # 我们得到先验框的按照序号进行抽取([6,7,8]) 
    scaled_anchors_l = np.array(scaled_anchors)[self.anchors_mask[l]]
    # 我们得到每个三个框的宽的tensor、和三个高的tensor
    anchor_w = FloatTensor(scaled_anchors_l).index_select(1, LongTensor([0]))
    anchor_h = FloatTensor(scaled_anchors_l).index_select(1, LongTensor([1]))

    # 我们将齐扩展成[16,3(个框),13,13]的size的大小
    anchor_w = anchor_w.repeat(bs, 1).repeat(1, 1, in_h * in_w).view(w.shape)
    anchor_h = anchor_h.repeat(bs, 1).repeat(1, 1, in_h * in_w).view(h.shape)
    # -------------------------------------------------------#
    #   计算调整后的预测框中心与宽高
    # -------------------------------------------------------#

    # 我们得到调整后的坐标、宽高
    pred_boxes_x = torch.unsqueeze(x + grid_x, -1)
    pred_boxes_y = torch.unsqueeze(y + grid_y, -1)
    pred_boxes_w = torch.unsqueeze(torch.exp(w) * anchor_w, -1)
    pred_boxes_h = torch.unsqueeze(torch.exp(h) * anchor_h, -1)
    # 按照最后一维度进行堆叠，也就是将[x,y,w,h]进行存储 【16，3，13，13，4】
    pred_boxes = torch.cat([pred_boxes_x, pred_boxes_y, pred_boxes_w, pred_boxes_h], dim=-1)
    
    # 单独取出来一张图片
    for b in range(bs):
        # -------------------------------------------------------#
        #   将预测结果转换一个形式
        #   pred_boxes_for_ignore      num_anchors, 4
        # -------------------------------------------------------#

        # 对batch_size进行遍历，也就是遍历每一张图片
        # 在这个部分我们得到的是[507,4] 就是针对一张图片我们对框进行合并，最后一维是x,y,w,h
        pred_boxes_for_ignore = pred_boxes[b].view(-1, 4)
        # -------------------------------------------------------#
        #   计算真实框，并把真实框转换成相对于特征层的大小
        #   gt_box      num_true_box, 4
        # -------------------------------------------------------#

        # 假设我们的图片中有框的话
        if len(targets[b]) > 0:
            # 我们制造出来一个
            batch_target = torch.zeros_like(targets[b])
            # -------------------------------------------------------#
            #   计算出正样本在特征层上的中心点
            # -------------------------------------------------------#

            # 我们可以得到中心点、宽高和class_num
            batch_target[:, [0, 2]] = targets[b][:, [0, 2]] * in_w
            batch_target[:, [1, 3]] = targets[b][:, [1, 3]] * in_h
            batch_target = batch_target[:, :4]
            # -------------------------------------------------------#
            #   计算交并比
            #   anch_ious       num_true_box, num_anchors
            # -------------------------------------------------------#

            # 然后我们计算iou,我们可以得到一个[2,507]的，是我们的图片中有两个真实的框，同时我们507个框，每个框都和真实框之间计算iou
            anch_ious = self.calculate_iou(batch_target, pred_boxes_for_ignore)
            # -------------------------------------------------------#
            #   每个先验框对应真实框的最大重合度
            #   anch_ious_max   num_anchors
            # -------------------------------------------------------#

            # 我们每个框都算出来一个最大值size = [507]
            anch_ious_max, _ = torch.max(anch_ious, dim=0)

            # pred_boxes[b].size()[:3] = [3,13,13]
            anch_ious_max = anch_ious_max.view(pred_boxes[b].size()[:3])

            # 假设我们的iou计算的值大于我们设定的阈值，我们就认为我们的这个框中是有物体的，所以我们将没有物体的矩阵中相应位置设置成0
            noobj_mask[b][anch_ious_max > self.ignore_threshold] = 0
    # 最后返回我们的我们的预测的box[16,3,13,13,4]和无物体的矩阵返回回去
    return noobj_mask, pred_boxes
复制代码

对 box_giou 进行解读

def box_giou(self, b1, b2):
    """
    输入为：
    ----------
    b1: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
    b2: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh

    返回为：
    -------
    giou: tensor, shape=(batch, feat_w, feat_h, anchor_num, 1)
    """
    # ----------------------------------------------------#
    #   求出预测框左上角右下角
    # ----------------------------------------------------#

    # 根据传入的x,y,w,h我们可以得到左上角和右下角的坐标
    b1_xy = b1[..., :2]
    b1_wh = b1[..., 2:4]
    b1_wh_half = b1_wh / 2.
    b1_mins = b1_xy - b1_wh_half
    b1_maxes = b1_xy + b1_wh_half
    # ----------------------------------------------------#
    #   求出真实框左上角右下角
    # ----------------------------------------------------#

    # 根据传入的x,y,w,h我们可以得到左上角和右下角的坐标
    b2_xy = b2[..., :2]
    b2_wh = b2[..., 2:4]
    b2_wh_half = b2_wh / 2.
    b2_mins = b2_xy - b2_wh_half
    b2_maxes = b2_xy + b2_wh_half

    # ----------------------------------------------------#
    #   求真实框和预测框所有的iou
    # ----------------------------------------------------#

    # 我们取得上面两个坐标组的最小值中的大者，作为最小坐标
    intersect_mins = torch.max(b1_mins, b2_mins)
    # 我们取得上面两个坐标组的最大值中的小者，作为最大坐标
    intersect_maxes = torch.min(b1_maxes, b2_maxes)
    # 我们完成最大坐标和最小坐标的减法，和0进行比较，存在iou的话保存其值，小于的存放0
    intersect_wh = torch.max(intersect_maxes - intersect_mins, torch.zeros_like(intersect_maxes))
    # 计算交集iou面积
    intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]

    # 计算两个的面积
    b1_area = b1_wh[..., 0] * b1_wh[..., 1]
    b2_area = b2_wh[..., 0] * b2_wh[..., 1]

    # 计算并集的面积
    union_area = b1_area + b2_area - intersect_area
    # 得到iou
    iou = intersect_area / union_area

    # ----------------------------------------------------#
    #   找到包裹两个框的最小框的左上角和右下角
    # ----------------------------------------------------#

    # 我们完成最大坐标和最小坐标的减法，和0进行比较，存在iou的话保存其值，小于的存放0
    enclose_mins = torch.min(b1_mins, b2_mins)
    enclose_maxes = torch.max(b1_maxes, b2_maxes)
    enclose_wh = torch.max(enclose_maxes - enclose_mins, torch.zeros_like(intersect_maxes))
    # ----------------------------------------------------#
    #   计算对角线距离
    # ----------------------------------------------------#
    # 计算交集的面积
    enclose_area = enclose_wh[..., 0] * enclose_wh[..., 1]
    # 先
    giou = iou - (enclose_area - union_area) / enclose_area

    return giou
复制代码

giou的计算公式图解如下：(配合图，理解程序比较容易)

对 BCELoss进行解读

# target.size = pred.size = [16,20]
def BCELoss(self, pred, target):
    epsilon = 1e-7
    pred    = self.clip_by_tensor(pred, epsilon, 1.0 - epsilon)
    output  = - target * torch.log(pred) - (1.0 - target) * torch.log(1.0 - pred)
    return output
复制代码

对 MSELoss进行解读

# 实际就是对位的平方差运算
def MSELoss(self, pred, target):
    return torch.pow(pred - target, 2)
复制代码

对 clip_by_tensor进行解读

def clip_by_tensor(self, t, t_min, t_max):
    t = t.float()
    result = (t >= t_min).float() * t + (t < t_min).float() * t_min
    result = (result <= t_max).float() * result + (result > t_max).float() * t_max
    return result
复制代码

总结

get_target 获取label的函数，我们一共需要构造三个变量，第一个变量，noobj_mask就是我们构造一个矩阵[bs,3,13,13]中不存在物体为1，有物体为0；第二个变量，box_loss_scale就是我们想要给小的框更大的权重，例如13 * 13 我们是对大物体进行识别，52 * 52是对小物体进行识别，box_loss_scale[b, k, j, i] = batch_target[t, 2] * batch_target[t, 3] / in_w / in_h通过这个公式我们能为存在物体的位置填上一个权重，我们看到除以in_w和in_h，52 * 52的这个值会比13 * 13的小，我们后面使用的时候，是2 – box_loss_scale,所以52 * 52的权重大于13 * 13的权重，这就是为什么会给大物体小权重，给小物体大权重的原因，我们希望更加关注小物体；第三个变量，y_true就是我们的真实的值(x,y,w,h,conf,num_class=20)在矩阵的指定位置进行设置。有两个判断比较重要，第一个就是我们使用真实框的宽高和9个先验框的宽高进行比对，取出最像的那个，假设这个和我们的l层是对应的，就进行数据的构建，否则就是不匹配，直接跳过即可，构造前两维都是0的原因是我们想让真实框和先验框位于同一中心点而已；第二个判断是是否使用giou,使用giou，y_true中保存相对特征图的真实宽高数据；若不使用giou的话，我们的y_true中前四位存储的还是偏移量x,y，宽高变化因子w,h的值，需要注意一下。

get_ignore函数主要是根据iou_阈值，通过我们预测框和我们的真实框之间进行iou的计算，得到的iou值小于iou_阈值的，我们就认为是没有物体的，所以在这个地方我们还是需要更新一下我们的noobj_mask将其中大于iou_阈值的位置认为是有物体的，同时这个函数还返回了box[16,3,13,13,4]的预测的框相对于特征图的中心点真实位置，宽高的4维位置矩阵。

我们的损失实际上包含三个部分，存在物体：box_loss,计算方法有两种，一种使用giou进行计算，另一种使用原始的MSE和BCE进行计算，这个部分我们将融合我们的位置损失权重box_loss_scale；cls_loss:使用BCE对分类标签进行计算损失；conf_loss:置信度损失，使用BCE进行计算即可；不存在物体：只有conf_loss进行计算；最终我们得总体的loss就是上述的各种loss和对应权重之间乘积的加和。