1. Introduction 今天是尝试用 PyTorch 框架来跑 MNIST 手写数字数据集的第三天,主要学习训练网络。本 blog 主要记录一个学习的路径以及学习资料的汇总。
注意:这是用 Python 2.7 版本写的代码
第一天(LeNet 网络的搭建):https://blog.csdn.net/qq_36627158/article/details/108098147
第二天(加载 MNIST 数据集):https://blog.csdn.net/qq_36627158/article/details/108119048
第三天(训练模型):https://blog.csdn.net/qq_36627158/article/details/108163693
第四天(单例测试):https://blog.csdn.net/qq_36627158/article/details/108183655
2. Code(mnist_train.py) 感谢 凯神 提供的代码与耐心指导!
from lenet import Net
import torch
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
from mnist_load import testset_loader, trainset_loaderLEARNING_RATE = 0.001
MOMENTUM = 0.9
EPOCH = 5if torch.cuda.is_available():
device = torch.device('cuda')
print 'cuda'
else:
device = torch.device('cpu')
print 'cpu'mnist_model = Net().to(device)optimizer = optim.SGD(
mnist_model.parameters(),
lr=LEARNING_RATE,
momentum=MOMENTUM
)# save_model
def save_checkpoint(checkpoint_path, model, optimizer):
# state_dict: a Python dictionary object that:
# - for a model, maps each layer to its parameter tensor;
# - for an optimizer, contains info about the optimizer's states and hyperparameters used.
state = {
'model': model.state_dict(),
'optimizer' : optimizer.state_dict()
}
torch.save(state, checkpoint_path)
print 'model saved to ', checkpoint_path# train
def mnist_train(epoch, save_interval):
mnist_model.train()# set training modeiteration = 0
loss_plt = []for ep in range(epoch):
for batch_idx, batch_data in enumerate(trainset_loader):
images, labels = batch_data
images = images.to(device)
labels = labels.to(device)optimizer.zero_grad()output = mnist_model(images)loss = F.cross_entropy(output, labels)
loss_plt.append(loss)loss.backward()
optimizer.step()print 'Train Epoch:', ep+1, '\tBatch: ', batch_idx+1, '/', len(trainset_loader), '\tLoss: ', loss.item()# different from before: saving model checkpoints
if iteration % save_interval == 0 and iteration > 0:
save_checkpoint('module/pytorch-mnist-batchsize-128-%i.pth' % iteration, mnist_model, optimizer)iteration += 1mnist_test()# save the final model
save_checkpoint('module/pytorch-mnist-batch-128-%i.pth' % iteration, mnist_model, optimizer)
plt.plot(loss_plt, label='loss')
plt.legend()
plt.show()# test
def mnist_test():
mnist_model.eval()# set evaluation modetest_loss = 0
correct = 0with torch.no_grad():
for images, labels in testset_loader:
images = images.to(device)
labels = labels.to(device)output = mnist_model(images)test_loss += F.cross_entropy(output, labels).item()pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(labels.view_as(pred)).sum().item()test_loss /= len(testset_loader.dataset)print '\nTest set: Average loss:', test_loss, '\tAccuracy:', (100. * correct / len(testset_loader.dataset)), '%\n'if __name__ == '__main__':
mnist_train(EPOCH, save_interval=1000)
3. Materials 1、torch.optim 优化算法包:
https://pytorch.org/docs/stable/optim.html
4. Details 1、OSError: [Errno 12] Cannot allocate memory
一开始以为是自己电脑配置(内存不够大)太低,每次 load 一个 batch 的图片数量不能太多,所以就一直在改 BATCH_SIZE 这个超参数。后面不停降低 BATCH_SIZE 还总报错,就意识到应该不是内存容量的问题。
后来查了一下,是加载数据(batch)的线程数目问题
https://blog.csdn.net/breeze210/article/details/99679048
2、需要自己先新建好 Module 文件夹
好吧,原来 Python 写文件的时候,如果路径中的文件夹不存在,是不会自动创建好的。Mark!
3、优化器中的 momentum 参数(待查阅更多有关优化器的资料)
凯神的解释:MOMENTUM 动量是随机梯度下降中用于更新模型权重的一个参数
https://www.lizenghai.com/archives/29512.html
https://pytorch.org/docs/stable/optim.html
4、model.to(device)
将所有最开始读取数据时的 tensor 变量 copy 一份到指定设备 device 上,之后的运算都在指定设备上进行。
https://www.jb51.net/article/178049.htm
5、Module.parameters()
https://blog.csdn.net/qq_39463274/article/details/105295272?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param
6、
if __name__ == "__main__"
https://blog.csdn.net/yjk13703623757/article/details/77918633
7、state_dict()
- https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.state_dict
- https://blog.csdn.net/VictoriaW/article/details/72821329?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param
8、checkpoint
a way to save the current state of your experiment so that you can pick up from where you left off.
https://www.cnblogs.com/Arborday/p/9740253.html
9、为什么要使用 optimizer.zero_grad()
因为后面反向传播时优化器会自动计算梯度,不要让上一次迭代的梯度影响到本次迭代的梯度
https://blog.csdn.net/scut_salmon/article/details/82414730
10、optimizer.step() 和 loss.backward() 的区别
最开始有点搞不清楚这两个函数分别是干什么的。后来看视频拿个类比,我就明白了
线性回归中,权值参数的公式为:w_new = w_old + lr * gradient
loss.backward() 就相当于计算 gradient 的
optimizer.step() 就相当于根据 gradient 计算 w_new = w_old + lr * gradient 的
https://v.qq.com/x/page/t0554h33liw.html
11、with torch.no_grad() 和 model.eval()
Use both. They do different things, and have different scopes.
with torch.no_grad: disables tracking of gradients in autograd.
model.eval(): changes the forward() behaviour of the module it is called upon. eg, it disables dropout and has batch norm use the entire population statistics
https://www.cnblogs.com/shiwanghualuo/p/11789018.html
https://blog.csdn.net/songyu0120/article/details/103884586?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param
【Python|(系列更新完毕)深度学习零基础使用 PyTorch 框架跑 MNIST 数据集的第三天(训练模型)】
推荐阅读
- Python|(系列更新完毕)深度学习零基础使用 PyTorch 框架跑 MNIST 数据集的第四天(单例测试)
- 论文笔记|【论文笔记】基于深度卷积神经网络的传感器融合实现自主驾驶
- R语言从入门到机器学习|R语言对dataframe的行数据进行排序(Ordering rows)实战(使用R原生方法、data.table、dplyr等方案)
- R语言入门课|R语言使用reshape包的rename函数修改数据变量的名称、例如、使用rename函数自定义修改dataframe数据列的名称
- 人脸识别|推荐 6 个 yyds 的人脸识别系统
- 深度学习|神经网络中的激活函数与损失函数&深入理解推导softmax交叉熵
- R语言入门课|R语言使用dim函数查看数据维度、例如、使用dim函数查看dataframe数据有多少行多少列
- 深度学习|Alex Net 论文学习笔记(含手撕代码)
- 算法|Pytorch框架训练MNIST数据集