深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)

深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

QQ交流群:993965802
本文版权属于GiantPandaCV,未经允许请勿转载
前言: 毕设的一部分,前段时间,在yolov5上进行一系列的消融实验,让他更轻(Flops更小,内存占用更低,参数更少),更快(加入shuffle channel,yolov5 head进行通道裁剪,在320的input_size至少能在树莓派4B上一秒推理10帧),更易部署(摘除Focus层和四次slice操作,让模型量化精度下降在可接受范围内)。
2021-08-26 更新 ---------------------------------------
我们试着提高分辨率再降的方式进行训练(train in 640-1280-640,其中第一轮640训练150个epochs,第二轮1280训练50个epochs,第三轮640训练100个epochs),发现模型仍有一定的学习能力,map还在提升

# evaluate in 640×640: Average Precision(AP) @[ IoU=0.50:0.95 | area=all | maxDets=100 ] = 0.271 Average Precision(AP) @[ IoU=0.50| area=all | maxDets=100 ] = 0.457 Average Precision(AP) @[ IoU=0.75| area=all | maxDets=100 ] = 0.274

# evaluate in 416×416: Average Precision(AP) @[ IoU=0.50:0.95 | area=all | maxDets=100 ] = 0.244 Average Precision(AP) @[ IoU=0.50| area=all | maxDets=100 ] = 0.413 Average Precision(AP) @[ IoU=0.75| area=all | maxDets=100 ] = 0.246

# evaluate in 320×320: Average Precision(AP) @[ IoU=0.50:0.95 | area=all | maxDets=100 ] = 0.208 Average Precision(AP) @[ IoU=0.50| area=all | maxDets=100 ] = 0.362 Average Precision(AP) @[ IoU=0.75| area=all | maxDets=100 ] = 0.206

一、消融实验结果比对
ID Model Input_size Flops Params Size(M) M a p @ 0.5 Map^{@0.5} Map@0.5 M a p @ 0.5 : 0.95 Map^{@0.5:0.95} Map@0.5:0.95
001 yolo-faster 320×320 0.25G 0.35M 1.4 24.4 -
002 nanodet-m 320×320 0.72G 0.95M 1.8 - 20.6
003 yolo-faster-xl 320×320 0.72G 0.92M 3.5 34.3 -
004 yolov5-lite 320×320 1.43G 1.62M 3.3 36.2 20.8
005 yolov3-tiny 416×416 6.96G 6.06M 23.0 33.1 16.6
006 yolov4-tiny 416×416 5.62G 8.86M 33.7 40.2 21.7
007 nanodet-m 416×416 1.2G 0.95M 1.8 - 23.5
008 yolov5-lite 416×416 2.42G 1.62M 3.3 41.3 24.4
009 yolov5-lite 640×640 2.42G 1.62M 3.3 45.7 27.1
010 yolov5s 640×640 17.0G 7.3M 14.2 55.4 36.7
注:yolov5原FLOPS计算脚本有bug,请使用thop库进行计算:
input = torch.randn(1, 3, 416, 416) flops, params = thop.profile(model, inputs=(input,)) print('flops:', flops / 900000000*2) print('params:', params)

二、检测效果 P y t o r c h @ 640 × 640 : Pytorch^{@640×640}: Pytorch@640×640:
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

N C N N 640 × 640 @ F P 16 NCNN^{@FP16}_{640\times640} NCNN640×640@FP16?
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

N C N N 640 × 640 @ I n t 8 NCNN^{@Int8}_{640\times640} NCNN640×640@Int8?
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

三、Relect Work YOLOv5-Lite的网络结构实际上非常简单,backbone主要使用的是含shuffle channel的shuffle block,头依旧用的是yolov5 head,但用的是阉割版的yolov5 head
shuffle block:
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

yolov5 head:
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

yolov5 backbone:
在原先U版的yolov5 backbone中,作者在特征提取的上层结构中采用了四次slice操作组成了Focus层
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

对于Focus层,在一个正方形中每 4 个相邻像素,并生成一个具有 4 倍通道数的feature map,类似与对上级图层进行了四次下采样操作,再将结果concat到一起,最主要的功能还是在不降低模型特征提取能力的前提下,对模型进行降参和加速。
1.7.0+cu101 cuda _CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)ParamsFLOPSforward (ms)backward (ms)inputoutput 704023.0762.8987.79(16, 3, 640, 640)(16, 64, 320, 320) 704023.0715.5248.69(16, 3, 640, 640)(16, 64, 320, 320) 1.7.0+cu101 cuda _CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15079MB, multi_processor_count=40)ParamsFLOPSforward (ms)backward (ms)inputoutput 704023.0711.6179.72(16, 3, 640, 640)(16, 64, 320, 320) 704023.0712.5442.94(16, 3, 640, 640)(16, 64, 320, 320)

深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

从上图可以看出,Focus层确实在参数降低的情况下,对模型实现了加速。
但!这个加速是有前提的,必须在GPU的使用下才可以体现这一优势,对于云端部署这种处理方式,GPU不太需要考虑缓存的占用,即取即处理的方式让Focus层在GPU设备上十分work。
对于的芯片,特别是不含GPU、NPU加速的芯片,频繁的slice操作只会让缓存占用严重,加重计算处理的负担。同时,在芯片部署的时候,Focus层的转化对新手极度不友好。
四、轻量化的理念
shufflenetv2的设计理念,在资源紧缺的芯片端,有着许多参考意义,它提出模型轻量化的四条准则:
(G1)同等通道大小可以最小化内存访问量
(G2)过量使用组卷积会增加MAC
(G3)网络过于碎片化(特别是多路)会降低并行度
(G4)不能忽略元素级操作(比如shortcut和Add)
YOLOv5-Lite 设计理念:
(G1)摘除Focus层,避免多次采用slice操作
(G2)避免多次使用C3 Leyer以及高通道的C3 Layer
C3 Leyer是YOLOv5作者提出的CSPBottleneck改进版本,它更简单、更快、更轻,在近乎相似的损耗上能取得更好的结果。但C3 Layer采用多路分离卷积,测试证明,频繁使用C3 Layer以及通道数较高的C3 Layer,占用较多的缓存空间,减低运行速度。
(为什么通道数越高的C3 Layer会对cpu不太友好,主要还是因为shufflenetv2的G1准则,通道数越高,hidden channels与c1、c2的阶跃差距更大,来个不是很恰当的比喻,想象下跳一个台阶和十个台阶,虽然跳十个台阶可以一次到达,但是你需要助跑,调整,蓄力才能跳上,可能花费的时间更久)
class C3(nn.Module): # CSP Bottleneck with 3 convolutions def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):# ch_in, ch_out, number, shortcut, groups, expansion super(C3, self).__init__() c_ = int(c2 * e)# hidden channels self.cv1 = Conv(c1, c_, 1, 1) self.cv2 = Conv(c1, c_, 1, 1) self.cv3 = Conv(2 * c_, c2, 1) self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)]) # self.m = nn.Sequential(*[CrossConv(c_, c_, 3, 1, g, 1.0, shortcut) for _ in range(n)])

(G3)对yolov5 head进行通道剪枝,剪枝细则参考G1
(G4)摘除shufflenetv2 backbone的1024 conv 和 5×5 pooling
这是为imagenet打榜而设计的模块,在实际业务场景并没有这么多类的情况下,可以适当摘除,精度不会有太大影响,但对于速度是个大提升,在消融实验中也证实了这点。
五、What can be used for?
(G1)训练
这不废话吗。。。确实有点废话了,YOLOv5-Lite基于yolov5第五版(也就是最新版)上进行的消融实验,所以你可以无需修改直接延续第五版的所有功能,比如:
导出热力图:
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

导出混淆矩阵进行数据分析:
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

导出PR曲线:
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

(G2)导出onnx后无需其他修改(针对部署而言)
(G3)DNN或ort调用不再需要额外对Focus层进行拼接(之前玩yolov5在这里卡了很久,虽然能调用但精度也下降挺多):
(G4)ncnn进行int8量化可保证精度的延续(在下篇会讲)
(G5)在0.1T算力的树莓派上玩yolov5也能实时
以前在树莓派上跑yolov5,是一件想都不敢想的事,单单检测一帧画面就需要1000ms左右,就连160*120输入下都需要200ms左右,实在是啃不动。
但现在YOLOv5-Lite做到了,毕设的检测场景在类似电梯轿厢和楼道拐角处等空间,实际检测距离只需保证3m即可,分辨率调整为160*120的情况下,YOLOv5-Lite最高可达18帧,加上后处理基本也能稳定在15帧左右。
除去前三次预热,设备温度稳定在45°以上,向前推理框架为ncnn,记录两次benchmark对比:
# 第四次 pi@raspberrypi:~/Downloads/ncnn/build/benchmark $ ./benchncnn 8 4 0 loop_count = 8 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 1 YOLOv5-Litemin =90.86max =93.53avg =91.56 YOLOv5-Lite-int8min =83.15max =84.17avg =83.65 YOLOv5-Lite-416min =154.51max =155.59avg =155.09 yolov4-tinymin =298.94max =302.47avg =300.69 nanodet_mmin =86.19max =142.79avg =99.61 squeezenetmin =59.89max =60.75avg =60.41 squeezenet_int8min =50.26max =51.31avg =50.75 mobilenetmin =73.52max =74.75avg =74.05 mobilenet_int8min =40.48max =40.73avg =40.63 mobilenet_v2min =72.87max =73.95avg =73.31 mobilenet_v3min =57.90max =58.74avg =58.34 shufflenetmin =40.67max =41.53avg =41.15 shufflenet_v2min =30.52max =31.29avg =30.88 mnasnetmin =62.37max =62.76avg =62.56 proxylessnasnetmin =62.83max =64.70avg =63.90 efficientnet_b0min =94.83max =95.86avg =95.35 efficientnetv2_b0min =103.83max =105.30avg =104.74 regnety_400mmin =76.88max =78.28avg =77.46 blazefacemin =13.99max =21.03avg =15.37 googlenetmin =144.73max =145.86avg =145.19 googlenet_int8min =123.08max =124.83avg =123.96 resnet18min =181.74max =183.07avg =182.37 resnet18_int8min =103.28max =105.02avg =104.17 alexnetmin =162.79max =164.04avg =163.29 vgg16min =867.76max =911.79avg =889.88 vgg16_int8min =466.74max =469.51avg =468.15 resnet50min =333.28max =338.97avg =335.71 resnet50_int8min =239.71max =243.73avg =242.54 squeezenet_ssdmin =179.55max =181.33avg =180.74 squeezenet_ssd_int8min =131.71max =133.34avg =132.54 mobilenet_ssdmin =151.74max =152.67avg =152.32 mobilenet_ssd_int8min =85.51max =86.19avg =85.77 mobilenet_yolomin =327.67max =332.85avg =330.36 mobilenetv2_yolov3min =221.17max =224.84avg =222.60# 第八次 pi@raspberrypi:~/Downloads/ncnn/build/benchmark $ ./benchncnn 8 4 0 loop_count = 8 num_threads = 4 powersave = 0 gpu_device = -1 cooling_down = 1 nanodet_mmin =81.15max =81.71avg =81.33 nanodet_m-416min =143.89max =145.06avg =144.67 YOLOv5-Litemin =84.30max =86.34avg =85.79 YOLOv5-Lite-int8min =80.98max =82.80avg =81.25 YOLOv5-Lite-416min =142.75max =146.10avg =144.34 yolov4-tinymin =276.09max =289.83avg =285.99 squeezenetmin =59.37max =61.19avg =60.35 squeezenet_int8min =49.30max =49.66avg =49.43 mobilenetmin =72.40max =74.13avg =73.37 mobilenet_int8min =39.92max =40.23avg =40.07 mobilenet_v2min =71.57max =73.07avg =72.29 mobilenet_v3min =54.75max =56.00avg =55.40 shufflenetmin =40.07max =41.13avg =40.58 shufflenet_v2min =29.39max =30.25avg =29.86 mnasnetmin =59.54max =60.18avg =59.96 proxylessnasnetmin =61.06max =62.63avg =61.75 efficientnet_b0min =91.86max =95.01avg =92.84 efficientnetv2_b0min =101.03max =102.61avg =101.71 regnety_400mmin =76.75max =78.58avg =77.60 blazefacemin =13.18max =14.67avg =13.79 googlenetmin =136.56max =138.05avg =137.14 googlenet_int8min =118.30max =120.17avg =119.23 resnet18min =164.78max =166.80avg =165.70 resnet18_int8min =98.58max =99.23avg =98.96 alexnetmin =155.06max =156.28avg =155.56 vgg16min =817.64max =832.21avg =827.37 vgg16_int8min =457.04max =465.19avg =460.64 resnet50min =318.57max =323.19avg =320.06 resnet50_int8min =237.46max =238.73avg =238.06 squeezenet_ssdmin =171.61max =173.21avg =172.10 squeezenet_ssd_int8min =128.01max =129.58avg =128.84 mobilenet_ssdmin =145.60max =149.44avg =147.39 mobilenet_ssd_int8min =82.86max =83.59avg =83.22 mobilenet_yolomin =311.95max =374.33avg =330.15 mobilenetv2_yolov3min =211.89max =286.28avg =228.01

深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

(G6)YOLOv5-Lite与yolov5s的对比
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

注:随机抽取一百张图片进行推理,四舍五入计算每张平均耗时。
后语:
之前使用自己的数据集跑过yolov3-tiny,yolov4-tiny,nanodet,efficientnet-lite等轻量级网络,但效果都没有达到预期,反而使用yolov5取得了超过自己预想的效果,但也确实,yolov5并不在轻量级网络设计理念内,于是萌生了对yolov5修改的idea,希望能在它强大的数据增强和正负anchor机制下能取得满意的效果。总的来说,YOLOv5-Lite在基于yolov5的平台进行训练,对少样本数据集还是很work的。
没有太多复杂的穿插并行结构,尽最大限度保证网络模型的简洁,YOLOv5-Lite纯粹为了工业落地而设计,更适配Arm架构的处理器,但你用这东西跑GPU,性价比贼低。
深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

优化这玩意,一部分基于情怀,毕竟前期很多工作是基于yolov5开展的,一部分也确实这玩意对于我个人的数据集十分work(确切的说,应该是对于极度匮乏数据集资源的我来说,yolov5的各种机制对于少样本数据集确实鲁棒)。
项目地址:
https://github.com/ppogg/YOLOv5-Lite
另外,会持续更新和迭代此项目,欢迎star和fork!
最后插个题外话,其实一直都在关注YOLOv5的动态,最近U版大神更新的频率快了许多,估计很快YOLOv5会迎来第六版~
【深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)】深度学习|YOLOv5-Lite(更轻更快易于部署的YOLOv5)
文章图片

    推荐阅读