从实例掌握|从实例掌握 pytorch 进行图像分类

【从实例掌握|从实例掌握 pytorch 进行图像分类】背景
从入门 Tensorflow 到沉迷 keras 再到跳出安逸选择pytorch,根本原因是在参加天池雪浪AI制造数据竞赛的时候,几乎同样的网络模型和参数,以及相似的数据预处理方式,结果得到的成绩差距之大让我无法接受,故转为 pytorch,keras 只用来做一些 NLP 的项目(毕竟积累了一些"祖传模型")~
注:本项目以 交通标志数据集 为例,需要的可以进行下载 traffic-sign,完整代码地址:pytorch-image-classification
更新 :2018年10月22日第二次更新,版本 0.1.1
更改:

  1. 数据增强方式由 pytorch 内置方式改为自定义,便于后期多 channels 模型更改,同时也可以借用 opencv 的强大库进行数据预处理(pytorch 的数据读取采用的是 PIL 库)。
  2. 输出打印方式采用 logger 的形式,动态更新。
  3. 保存最优模型的方式采用半个 epoch 计算一次
pytorch 0.4.0
0. 图像分类框架结构
在我们学习完机器学习、深度学习、卷积神经网络以及结构化机器学习项目等理论知识后,如何动手完成一个实际的项目往往是一个瓶颈期,只有将所学知识灵活运用,才敢说自己学了这些。前面的那些课程,我在研一上学期的时候都学习过,但直到研一下开始实习后,才逐渐能够独立完成项目,甚至参加一些数据竞赛。
在我使用 pytorch 的过程中,将其分为七大部分:数据加载,模型定义,评测标准定义,训练过程定义,验证过程定义,测试过程定义,参数定义
文件组织如下:
==============================================================
  • checkpoints/
    • bestmodels/
  • dataset/
    • aug.py
    • dataloader.py
  • logs/
  • models/
    • pretrained_models/
    • model.py
  • submit/
  • config.py
  • main.py
  • utils.py
==============================================================
  • checkpoints/ : 存放训练保存的模型( bestmodels/ 保存在验证集上效果最好的模型);
  • models/ : 存放一些自定义的模型,如果不想使用 pytorch 自定义的网络模型,可以在这里添加(记得添加__init__.py文件);
  • submit/ : 输出的预测文件或者说比赛所需要你提交的结果文件,常见的是csv格式的;
  • logs/: 存放记录训练日志(.txt格式文件)
  • dataset/:包含 aug.py dataloader.py 两文件,主要实现数据增强和数据加载两个功能
  • config.py: 参数定义文件,以参数类的形式定义所需要提前设定或者修改的参数,例如:数据路径,学习率,训练 epoch 等;
  • model.py: 定义模型加载,可有可无,为了方便进行模型的 fine tune 我喜欢单独列出来;
  • utils.py: 定义了一些常用的评测标准,比如 mAP,Accuracy,loss 等。
  • main.py: 主文件,包含训练、测试、验证等过程;
    1. 参数定义: config.py
参数定义的方式有很多种,有的人喜欢直接在主文件中进行设置;有的喜欢用 argparse 这个模块;也有人喜欢用 json 格式的文件,但是总的来说都不够简洁,我个人喜欢单独创建个 config.py 然后创建个 Python 类,以类属性的形式定义参数,详情见下:
class DefaultConfigs(object): #1.string parameters train_datahttps://www.it610.com/article/= "https://www.it610.com/data/train/" test_datahttps://www.it610.com/article/= "" val_datahttps://www.it610.com/article/= "../data/val/" model_name = "resnet50" weights = "./checkpoints/" best_models = weights + "best_model/" submit = "./submit/" logs = "./logs/" gpus = "1"#2.numeric parameters epochs = 40 batch_size = 4 img_height = 224 img_weight = 224 num_classes = 62 seed = 888 lr = 1e-3 lr_decay = 1e-4 weight_decay = 1e-4config = DefaultConfigs()

2. 数据加载: data_loader.py
pytorch 的数据读取方式有两种,一种是不同类别的图像按照文件夹进行划分,比如交通标志数据集:
  • train/
    • 00000/
      • 01153_00000.png
      • 01153_00001.png
    • 00001/
      • 00025_00000.png
      • 00025_00001.png
train_data = https://www.it610.com/article/torchvision.datasets.ImageFolder("/data2/dockspace_zcj/traffic-sign/train/",#图片文件存放路径 transform = None #定义的数据增强方式 ) data_loader = torch.utils.data.DataLoader(train_data, batch_size=20, shuffle=True ) """ 在模型训练过程时只需要加载data_loader就可以了, 具体方式在main文件中可见 """

aug.py 由于代码较多,不在此展示,详情请移步 github 。常用的增强方式均在此文件中列举出来,如果需要添加,可根据样例,自行添加。
因此采用继承 torch.utils.data.Dataset 类,新建一个数据加载的 python 类,在__get_item__(self,index)函数中添加数据增强,代码如下:
from torch.utils.data import Dataset from torchvision import transforms as T from config import config from PIL import Image from dataset.aug import * from itertools import chain from glob import glob from tqdm import tqdm import random import numpy as np import pandas as pd import os import cv2 import torch #1.set random seed random.seed(config.seed) np.random.seed(config.seed) torch.manual_seed(config.seed) torch.cuda.manual_seed_all(config.seed)#2.define dataset class ChaojieDataset(Dataset): def __init__(self,label_list,transforms=None,train=True,test=False): self.test = test self.train = train imgs = [] if self.test: for index,row in label_list.iterrows(): imgs.append((row["filename"])) self.imgs = imgs else: for index,row in label_list.iterrows(): imgs.append((row["filename"],row["label"])) self.imgs = imgs if transforms is None: if self.test or not train: self.transforms = Compose([ Resize((config.img_weight,config.img_height)), Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) else: self.transforms = Compose([ Resize((config.img_weight,config.img_height)), FixRandomRotate(bound='Random'), RandomHflip(), RandomVflip(), Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) else: self.transforms = transforms def __getitem__(self,index): if self.test: filename = self.imgs[index] img = cv2.imread(filename) img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB) img = self.transforms(img) return torch.from_numpy(img).float(),filename else: filename,label = self.imgs[index] img = cv2.imread(filename) img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB) img = self.transforms(img) return torch.from_numpy(img).float(),label def __len__(self): return len(self.imgs)def collate_fn(batch): imgs = [] label = [] for sample in batch: imgs.append(sample[0]) label.append(sample[1])return torch.stack(imgs, 0), \ labeldef get_files(root,mode): #for test if mode == "test": files = [] for img in os.listdir(root): files.append(root + img) files = pd.DataFrame({"filename":files}) return files elif mode != "test": #for train and val all_data_path,labels = [],[] image_folders = list(map(lambda x:root+x,os.listdir(root))) all_images = list(chain.from_iterable(list(map(lambda x:glob(x+"/*.png"),image_folders)))) print("loading train dataset") for file in tqdm(all_images): all_data_path.append(file) labels.append(int(file.split("/")[-2])) all_files = pd.DataFrame({"filename":all_data_path,"label":labels}) return all_files else: print("check the mode please!")

注:定义的 get_files(root,mode) 函数是为了使用 pandas 的读取方式,便于在没有提供验证集的数据集上对训练集进行随机划分,主要体现在平衡随机划分的数据上。
3. 数据加载: model.py
创建这个文件夹的原因一是因为太多的代码放到主文件中显得过于臃肿,另外也不利于修改模型进行 fine tune ,以 resnet101 为例:
import torchvision import torch.nn.functional as F from torch import nn from config import configdef get_net(): #return MyModel(torchvision.models.resnet101(pretrained = True)) model = torchvision.models.resnet101(pretrained = True) model.avgpool = nn.AdaptiveAvgPool2d(1) model.fc = nn.Linear(2048,config.num_classes) return model

4. 评价指标: utils.py
pytrorch 不像 keras 那样把一些模型评价指标都给封装好了,只需要在 fit 的过程中附上 metrics=[acc] 就可以了,其他的指标添加一下即可。当然也可以利用 torchnet 这个模块,但是我在使用过程中发现想要的指标没有,就自己定义了,其实大差不差。
class AverageMeter(object): """Computes and stores the average and current value"""def __init__(self): self.reset()def reset(self): self.val = 0 self.avg = 0 self.sum = 0 self.count = 0def update(self, val, n=1): self.val = val self.sum += val * n self.count += n self.avg = self.sum / self.countdef accuracy(y_pred, y_actual, topk=(1, )): """Computes the precision@k for the specified values of k""" maxk = max(topk) batch_size = y_actual.size(0)_, pred = y_pred.topk(maxk, 1, True, True) pred = pred.t() correct = pred.eq(y_actual.view(1, -1).expand_as(pred))res = [] for k in topk: correct_k = correct[:k].view(-1).float().sum(0) res.append(correct_k.mul_(100.0 / batch_size))return res

5. 主要文件: main.py
之所以要自己定义训练、验证和测试函数,就是因为 pytorch 没有封装好,需要我们自己来设定,详细内容在代码中有注释,如果有疑问可以联系我
# -*- coding: utf-8 -*- # @Time: 2018/7/31 09:41 # @Author: Spytensor # @File: main.py # @Email: zhuchaojie@buaa.edu.cn #==================================================== #定义模型训练/验证/预测等 #==================================================== import os import random import time import json import torch import torchvision import numpy as np import pandas as pd import warnings from datetime import datetime from torch import nn,optim from config import config from collections import OrderedDict from torch.autograd import Variable from torch.utils.data import DataLoader from dataset.dataloader import * from sklearn.model_selection import train_test_split,StratifiedKFold from timeit import default_timer as timer from models.model import * from utils import *#1. set random.seed and cudnn performance random.seed(config.seed) np.random.seed(config.seed) torch.manual_seed(config.seed) torch.cuda.manual_seed_all(config.seed) os.environ["CUDA_VISIBLE_DEVICES"] = config.gpus torch.backends.cudnn.benchmark = True warnings.filterwarnings('ignore')#2. evaluate func def evaluate(val_loader,model,criterion): #2.1 define meters losses = AverageMeter() top1 = AverageMeter() top2 = AverageMeter() #2.2 switch to evaluate mode and confirm model has been transfered to cuda model.cuda() model.eval() with torch.no_grad(): for i,(input,target) in enumerate(val_loader): input = Variable(input).cuda() target = Variable(torch.from_numpy(np.array(target)).long()).cuda()#2.2.1 compute output output = model(input) loss = criterion(output,target)#2.2.2 measure accuracy and record loss precision1,precision2 = accuracy(output,target,topk=(1,2)) losses.update(loss.item(),input.size(0)) top1.update(precision1[0],input.size(0)) top2.update(precision2[0],input.size(0))return [losses.avg,top1.avg,top2.avg]#3. test model on public dataset and save the probability matrix def test(test_loader,model,folds): #3.1 confirm the model converted to cuda csv_map = OrderedDict({"filename":[],"probability":[]}) model.cuda() model.eval() for i,(input,filepath) in enumerate(tqdm(test_loader)): #3.2 change everything to cuda and get only basename filepath = [os.path.basename(x) for x in filepath] with torch.no_grad(): image_var = Variable(input).cuda() #3.3.output #print(filepath) #print(input,input.shape) y_pred = model(image_var) print(y_pred.shape) smax = nn.Softmax(1) smax_out = smax(y_pred) #3.4 save probability to csv files csv_map["filename"].extend(filepath) for output in smax_out: prob = "; ".join([str(i) for i in output.data.tolist()]) csv_map["probability"].append(prob) result = pd.DataFrame(csv_map) result["probability"] = result["probability"].map(lambda x : [float(i) for i in x.split("; ")]) result.to_csv("./submit/{}_submission.csv" .format(config.model_name + "_" + str(folds)),index=False,header = None)#4. more details to build main function def main(): fold = 0 #4.1 mkdirs if not os.path.exists(config.submit): os.mkdir(config.submit) if not os.path.exists(config.weights): os.mkdir(config.weights) if not os.path.exists(config.best_models): os.mkdir(config.best_models) if not os.path.exists(config.logs): os.mkdir(config.logs) if not os.path.exists(config.weights + config.model_name + os.sep +str(fold) + os.sep): os.makedirs(config.weights + config.model_name + os.sep +str(fold) + os.sep) if not os.path.exists(config.best_models + config.model_name + os.sep +str(fold) + os.sep): os.makedirs(config.best_models + config.model_name + os.sep +str(fold) + os.sep) #4.2 get model and optimizer model = get_net() model = torch.nn.DataParallel(model) model.cuda() optimizer = optim.SGD(model.parameters(),lr = config.lr,momentum=0.9,weight_decay=config.weight_decay) #optimizer = optim.Adam(model.parameters(),lr = config.lr,amsgrad=True,weight_decay=config.weight_decay) criterion = nn.CrossEntropyLoss().cuda() log = Logger() log.open(config.logs + "log_train.txt",mode="a") log.write("\n------------------------------------ [START %s] %s\n\n" % (datetime.now().strftime('%Y-%m-%d %H:%M:%S'), '-' * 40)) #4.3 some parameters forK-fold and restart model start_epoch = 0 best_precision1 = 0 resume = False#4.4 restart the training process if resume: checkpoint = torch.load(config.best_models + str(fold) + "/model_best.pth.tar") start_epoch = checkpoint["epoch"] fold = checkpoint["fold"] best_precision1 = checkpoint["best_precision1"] model.load_state_dict(checkpoint["state_dict"]) optimizer.load_state_dict(checkpoint["optimizer"])#4.5 get files and split for K-fold dataset #4.5.1 read files train_data_list = get_files(config.train_data,"train") val_data_list = get_files(config.val_data,"val") #test_files = get_files(config.test_data,"test")""" #如果没有提供验证集,可在此进行划分 #4.5.2 split split_fold = StratifiedKFold(n_splits=3) folds_indexes = split_fold.split(X=origin_files["filename"],y=origin_files["label"]) folds_indexes = np.array(list(folds_indexes)) fold_index = folds_indexes[fold]#4.5.3 using fold index to split for train data and val data train_data_list = pd.concat([origin_files["filename"][fold_index[0]],origin_files["label"][fold_index[0]]],axis=1) val_data_list = pd.concat([origin_files["filename"][fold_index[1]],origin_files["label"][fold_index[1]]],axis=1) """ #train_data_list,val_data_list = train_test_split(origin_files,test_size = 0.1,stratify=origin_files["label"]) #4.5.4 load dataset train_dataloader = DataLoader(ChaojieDataset(train_data_list),batch_size=config.batch_size,shuffle=True,collate_fn=collate_fn,pin_memory=True) val_dataloader = DataLoader(ChaojieDataset(val_data_list,train=False),batch_size=config.batch_size * 2,shuffle=True,collate_fn=collate_fn,pin_memory=False) #test_dataloader = DataLoader(ChaojieDataset(test_files,test=True),batch_size=1,shuffle=False,pin_memory=False) #scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer,"max",verbose=1,patience=3) scheduler =optim.lr_scheduler.StepLR(optimizer,step_size = 5,gamma=0.1) #4.5.5.1 define metrics train_losses = AverageMeter() train_top1 = AverageMeter() train_top2 = AverageMeter() valid_loss = [np.inf,0,0] model.train()#logs log.write('** start training here! **\n') log.write('|------------ VALID -------------|----------- TRAIN -------------|\n') log.write('lriterepoch| losstop-1top-2| losstop-1top-2|time\n') log.write('----------------------------------------------------------------------------------------------------\n') #4.5.5 train start = timer() for epoch in range(start_epoch,config.epochs): scheduler.step(epoch) #4.5.5.2 train for iter,(input,target) in enumerate(train_dataloader):lr = get_learning_rate(optimizer) #evaluate every half epoch if iter == len(train_dataloader) // 2: valid_loss = evaluate(val_dataloader,model,criterion) is_best = valid_loss[1] > best_precision1 best_precision1 = max(valid_loss[1],best_precision1) save_checkpoint({ "epoch":epoch + 1, "model_name":config.model_name, "state_dict":model.state_dict(), "best_precision1":best_precision1, "optimizer":optimizer.state_dict(), "fold":fold, "valid_loss":valid_loss, },is_best,fold) #adjust learning rate #scheduler.step(valid_loss[1]) print("\r",end="",flush=True) log.write('%0.8f %5.1f%6.1f| %0.3f%0.3f%0.3f| %0.3f%0.3f%0.3f| %s' % (\ lr, iter/len(train_dataloader) + epoch, epoch, valid_loss[0], valid_loss[1], valid_loss[2], train_losses.avg,train_top1.avg,train_top2.avg, time_to_str((timer() - start),'min')) ) log.write('\n') time.sleep(0.01)#4.5.5 switch to continue train process #scheduler.step(epoch) model.train() input = Variable(input).cuda() target = Variable(torch.from_numpy(np.array(target)).long()).cuda() output = model(input) loss = criterion(output,target)precision1_train,precision2_train = accuracy(output,target,topk=(1,2)) train_losses.update(loss.item(),input.size(0)) train_top1.update(precision1_train[0],input.size(0)) train_top2.update(precision2_train[0],input.size(0)) #backward optimizer.zero_grad() loss.backward() optimizer.step() lr = get_learning_rate(optimizer) print('\r',end='',flush=True) print('%0.8f %5.1f%6.1f| %0.3f%0.3f%0.3f| %0.3f%0.3f%0.3f| %s' % (\ lr, iter/len(train_dataloader) + epoch, epoch, valid_loss[0], valid_loss[1], valid_loss[2], train_losses.avg, train_top1.avg, train_top2.avg, time_to_str((timer() - start),'min')) , end='',flush=True) # best_model = torch.load(config.best_models + os.sep+ str(fold) + 'model_best.pth.tar') # model.load_state_dict(best_model["state_dict"]) # test(test_dataloader,model,fold)if __name__ =="__main__": main()

6. 训练结果
------------------------------------ [START 2018-10-22 19:47:48] ----------------------------------------loading train dataset 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4572/4572 [00:00<00:00, 589769.58it/s] loading train dataset 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2520/2520 [00:00<00:00, 603496.98it/s] ** start training here! ** |------------ VALID -------------|----------- TRAIN -------------| lriterepoch| losstop-1top-2| losstop-1top-2|time ---------------------------------------------------------------------------------------------------- 0.000100000.50.0| 0.57882.06391.706| 1.66163.35472.242|0 hr 01 min 0.000100001.51.0| 0.25493.53296.270| 0.93678.44285.356|0 hr 04 min 0.000100002.52.0| 0.22694.56397.619| 0.69183.56789.771|0 hr 06 min 0.000100003.53.0| 0.18691.94497.976| 0.55186.73892.206|0 hr 09 min 0.000100004.54.0| 0.21495.35799.087| 0.46188.77193.700|0 hr 11 min 0.000100005.55.0| 0.11197.22299.246| 0.39990.16194.699|0 hr 14 min

7. 总结无论使用哪种框架,自己用起来舒服才是最好的,因为 `pytorch` 相比 `keras` `tensorflow` 而言还不够完善,存在一些难以理解的 `bug` 所以最好能够对应版本去使用,整个项目使用的是 `pytorch 0.4.0` 。最后声明一下,本篇文章是我在做图像分类问题时,自己整合多份代码,最后完成的,按照自己的需要增添了一些模块,具体参考的代码在参考文献中给出。 另外,附上完整代码地址:[pytorch-image-classification](https://github.com/spytensor/pytorch-image-classification)!8. 参考文献- [pytorch-classification](https://github.com/bearpaw/pytorch-classification) - [pytorch-best-practice](https://github.com/chenyuntc/pytorch-best-practice)

    推荐阅读