2021Homework 1: COVID-19 Cases Prediction (Regression) 我的最终优化版:
https://github.com/Orange-yy/ML2021/blob/main/%E2%80%9CML2021Spring_HW1_ipynb%E2%80%9D%EF%BC%88%E6%94%B9%E8%BF%9B%E7%89%88%EF%BC%89.ipynb
Objectives:
- Solve a regression problem with deep neural networks (DNN).
- Understand basic DNN training tips.
- Get familiar with PyTorch.
tr_path = 'covid.train.csv'# path to training data
tt_path = 'covid.test.csv'# path to testing data!gdown --id '19CCyCgJrUxtvgZF53vnctJiOJ23T5mqF' --output covid.train.csv
!gdown --id '1CE240jLm2npU-tdz81-oVKEF3T2yfT1O' --output covid.test.csv
Import Some Packages
# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader# For data preprocess
import numpy as np
import csv
import os# For plotting
import matplotlib.pyplot as plt
from matplotlib.pyplot import figuremyseed = 42069# set a random seed for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
#下面几行代码是把将来可能会用到的参数用随机种子固定
np.random.seed(myseed)
torch.manual_seed(myseed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(myseed)
torch.backends.cudnn.deterministic是啥?
顾名思义,设置为True的话,每次返回的卷积算法将是确定的,即默认算法。如果配合上设置 Torch 的随机种子为固定值的话,可以保证每次运行网络的时候相同输入的输出是固定的。
torch.backends.cudnn.benchmark = False
设置 torch.backends.cudnn.benchmark=True 将会让程序在开始时花费一点额外时间,为整个网络的每个卷积层搜索最适合它的卷积实现算法,进而实现网络的加速。适用场景是网络结构固定(不是动态变化的),网络的输入形状(包括 batch size,图片大小,输入的通道)是不变的,其实也就是一般情况下都比较适用。反之,如果卷积层的设置一直变化,将会导致程序不停地做优化,反而会耗费更多的时间。
【李宏毅机器学习HW|Homework 1: COVID-19 Cases Prediction (Regression)】具体请参照: https://blog.csdn.net/byron123456sfsfsfa/article/details/96003317
Some Utilities (画图用) You do not need to modify this part.
def get_device():
''' Get device (if GPU is available, use GPU) '''
return 'cuda' if torch.cuda.is_available() else 'cpu'def plot_learning_curve(loss_record, title=''):
''' Plot learning curve of your DNN (train & dev loss) '''
total_steps = len(loss_record['train'])
x_1 = range(total_steps)
x_2 = x_1[::len(loss_record['train']) // len(loss_record['dev'])]
figure(figsize=(6, 4))
plt.plot(x_1, loss_record['train'], c='tab:red', label='train')
plt.plot(x_2, loss_record['dev'], c='tab:cyan', label='dev')
plt.ylim(0.0, 5.)
plt.xlabel('Training steps')
plt.ylabel('MSE loss')
plt.title('Learning curve of {}'.format(title))
plt.legend()
plt.show()def plot_pred(dv_set, model, device, lim=35., preds=None, targets=None):
''' Plot prediction of your DNN '''
if preds is None or targets is None:
model.eval()
preds, targets = [], []
for x, y in dv_set:
x, y = x.to(device), y.to(device)
with torch.no_grad():
pred = model(x)
preds.append(pred.detach().cpu())
targets.append(y.detach().cpu())
preds = torch.cat(preds, dim=0).numpy()
targets = torch.cat(targets, dim=0).numpy()figure(figsize=(5, 5))
plt.scatter(targets, preds, c='r', alpha=0.5)
plt.plot([-0.2, lim], [-0.2, lim], c='b')
plt.xlim(-0.2, lim)
plt.ylim(-0.2, lim)
plt.xlabel('ground truth value')
plt.ylabel('predicted value')
plt.title('Ground Truth v.s. Prediction')
plt.show()
Preprocess We have three kinds of datasets:
train
: for trainingdev
: for validationtest
: for testing (w/o target value)
COVID19Dataset
below does:- read
.csv
files - extract features
- split
covid.train.csv
into train/dev sets - normalize features
TODO
below might make you pass medium baseline.有关COVID19Dataset的类,有以下注解:
在处理任何机器学习问题之前都需要数据读取,并进行预处理。Pytorch提供了许多方法使得数据读取和预处理变得很容易。
torch.utils.data.Dataset是代表自定义数据集方法的抽象类,你可以自己定义你的数据类继承这个抽象类,非常简单,只需要定义__len__和__getitem__这两个方法就可以。
通过继承torch.utils.data.Dataset的这个抽象类,我们可以定义好我们需要的数据类。当我们通过迭代的方式来取得每一个数据,但是这样很难实现取batch,shuffle或者多线程读取数据,所以pytorch还提供了一个简单的方法来做这件事情,通过torch.utils.data.DataLoader类来定义一个新的迭代器,用来将自定义的数据读取接口的输出或者PyTorch已有的数据读取接口的输入按照batch size封装成Tensor,后续只需要再包装成Variable即可作为模型的输入。
总之,通过torch.utils.data.Dataset和torch.utils.data.DataLoader这两个类,使数据的读取变得非常简单、快捷。
具体参照:https://blog.csdn.net/qq_36653505/article/details/83351808
class COVID19Dataset(Dataset):
''' Dataset for loading and preprocessing the COVID19 dataset '''
def __init__(self,
path,
mode='train',
target_only=False):
self.mode = mode# Read data into numpy arrays
with open(path, 'r') as fp:
data = https://www.it610.com/article/list(csv.reader(fp))#按行读取,数据放进列表
data = np.array(data[1:])[:, 1:].astype(float)if not target_only:
feats = list(range(93))
else:
# TODO: Using 40 states & 2 tested_positive features (indices = 57 & 75)
passif mode =='test':
# Testing data
# data: 893 x 93 (40 states + day 1 (18) + day 2 (18) + day 3 (17))
data = data[:, feats]
self.data = torch.FloatTensor(data)
else:
# Training data (train/dev sets)
# data: 2700 x 94 (40 states + day 1 (18) + day 2 (18) + day 3 (18))
target = data[:, -1]
data = data[:, feats]# Splitting training data into train & dev sets
if mode == 'train':
indices = [i for i in range(len(data)) if i % 10 != 0]
elif mode == 'dev':
indices = [i for i in range(len(data)) if i % 10 == 0]# Convert data into PyTorch tensors
self.data = https://www.it610.com/article/torch.FloatTensor(data[indices])
self.target = torch.FloatTensor(target[indices])# Normalize features (you may remove this part to see what will happen)
self.data[:, 40:] = /
(self.data[:, 40:] - self.data[:, 40:].mean(dim=0, keepdim=True)) /
/ self.data[:, 40:].std(dim=0, keepdim=True)self.dim = self.data.shape[1]print('Finished reading the {} set of COVID19 Dataset ({} samples found, each dim = {})'
.format(mode, len(self.data), self.dim))def __getitem__(self, index):
# Returns one sample at a time
if self.mode in ['train', 'dev']:
# For training
return self.data[index], self.target[index]
else:
# For testing (no target)
return self.data[index]def __len__(self):
# Returns the size of the dataset
return len(self.data)
DataLoader A
DataLoader
loads data from a given Dataset
into batches.def prep_dataloader(path, mode, batch_size, n_jobs=0, target_only=False):
''' Generates a dataset, then is put into a dataloader. '''
dataset = COVID19Dataset(path, mode=mode, target_only=target_only)# Construct dataset
dataloader = DataLoader(
dataset, batch_size,
shuffle=(mode == 'train'), drop_last=False,
num_workers=n_jobs, pin_memory=True)# Construct dataloader
return dataloader
Deep Neural Network
NeuralNet
is an nn.Module
designed for regression.The DNN consists of 2 fully-connected layers with ReLU activation.
This module also included a function
cal_loss
for calculating loss.class NeuralNet(nn.Module):
''' A simple fully-connected deep neural network '''
def __init__(self, input_dim):
super(NeuralNet, self).__init__()# Define your neural network here
# TODO: How to modify this model to achieve better performance?
self.net = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, 1)
)# Mean squared error loss
self.criterion = nn.MSELoss(reduction='mean')def forward(self, x):
''' Given input of size (batch_size x input_dim), compute output of the network '''
return self.net(x).squeeze(1)def cal_loss(self, pred, target):
''' Calculate loss '''
# TODO: you may implement L1/L2 regularization here
return self.criterion(pred, target)
Train/Dev/Test Training
def train(tr_set, dv_set, model, config, device):
''' DNN training '''n_epochs = config['n_epochs']# Maximum number of epochs# Setup optimizer
optimizer = getattr(torch.optim, config['optimizer'])(
model.parameters(), **config['optim_hparas'])min_mse = 1000.
loss_record = {'train': [], 'dev': []}# for recording training loss
early_stop_cnt = 0
epoch = 0
while epoch < n_epochs:
model.train()# set model to training mode
for x, y in tr_set:# iterate through the dataloader
optimizer.zero_grad()# set gradient to zero
x, y = x.to(device), y.to(device)# move data to device (cpu/cuda)
pred = model(x)# forward pass (compute output)
mse_loss = model.cal_loss(pred, y)# compute loss
mse_loss.backward()# compute gradient (backpropagation)
optimizer.step()# update model with optimizer
loss_record['train'].append(mse_loss.detach().cpu().item())# After each epoch, test your model on the validation (development) set.
dev_mse = dev(dv_set, model, device)
if dev_mse < min_mse:
# Save model if your model improved
min_mse = dev_mse
print('Saving model (epoch = {:4d}, loss = {:.4f})'
.format(epoch + 1, min_mse))
torch.save(model.state_dict(), config['save_path'])# Save model to specified path
early_stop_cnt = 0
else:
early_stop_cnt += 1epoch += 1
loss_record['dev'].append(dev_mse)
if early_stop_cnt > config['early_stop']:
# Stop training if your model stops improving for "config['early_stop']" epochs.
breakprint('Finished training after {} epochs'.format(epoch))
return min_mse, loss_record
Validation
def dev(dv_set, model, device):
model.eval()# set model to evalutation mode
total_loss = 0
for x, y in dv_set:# iterate through the dataloader
x, y = x.to(device), y.to(device)# move data to device (cpu/cuda)
with torch.no_grad():# disable gradient calculation
pred = model(x)# forward pass (compute output)
mse_loss = model.cal_loss(pred, y)# compute loss
total_loss += mse_loss.detach().cpu().item() * len(x)# accumulate loss
total_loss = total_loss / len(dv_set.dataset)# compute averaged lossreturn total_loss
Testing
def test(tt_set, model, device):
model.eval()# set model to evalutation mode
preds = []
for x in tt_set:# iterate through the dataloader
x = x.to(device)# move data to device (cpu/cuda)
with torch.no_grad():# disable gradient calculation
pred = model(x)# forward pass (compute output)
preds.append(pred.detach().cpu())# collect prediction
preds = torch.cat(preds, dim=0).numpy()# concatenate all predictions and convert to a numpy array
return preds
Setup Hyper-parameters
config
contains hyper-parameters for training and the path to save your model.device = get_device()# get the current available device ('cpu' or 'cuda')
os.makedirs('models', exist_ok=True)# The trained model will be saved to ./models/
target_only = False# TODO: Using 40 states & 2 tested_positive features# TODO: How to tune these hyper-parameters to improve your model's performance?
config = {
'n_epochs': 3000,# maximum number of epochs
'batch_size': 270,# mini-batch size for dataloader
'optimizer': 'SGD',# optimization algorithm (optimizer in torch.optim)
'optim_hparas': {# hyper-parameters for the optimizer (depends on which optimizer you are using)
'lr': 0.001,# learning rate of SGD
'momentum': 0.9# momentum for SGD
},
'early_stop': 200,# early stopping epochs (the number epochs since your model's last improvement)
'save_path': 'models/model.pth'# your model will be saved here
}
Load data and model
tr_set = prep_dataloader(tr_path, 'train', config['batch_size'], target_only=target_only)
dv_set = prep_dataloader(tr_path, 'dev', config['batch_size'], target_only=target_only)
tt_set = prep_dataloader(tt_path, 'test', config['batch_size'], target_only=target_only)
model = NeuralNet(tr_set.dataset.dim).to(device)# Construct model and move to device
Start Training!
model_loss, model_loss_record = train(tr_set, dv_set, model, config, device)
plot_learning_curve(model_loss_record, title='deep model')
del model
model = NeuralNet(tr_set.dataset.dim).to(device)
ckpt = torch.load(config['save_path'], map_location='cpu')# Load your best model
model.load_state_dict(ckpt)
plot_pred(dv_set, model, device)# Show prediction on the validation set
Testing The predictions of your model on testing set will be stored at
pred.csv
.def save_pred(preds, file):
''' Save predictions to specified file '''
print('Saving results to {}'.format(file))
with open(file, 'w') as fp:
writer = csv.writer(fp)
writer.writerow(['id', 'tested_positive'])
for i, p in enumerate(preds):
writer.writerow([i, p])preds = test(tt_set, model, device)# predict COVID-19 cases with your model
save_pred(preds, 'pred.csv')# save prediction file to pred.csv
Hints Simple Baseline
- Run sample code
- Feature selection: 40 states + 2
tested_positive
(TODO
in dataset)
- Feature selection (what other features are useful?)
- DNN architecture (layers? dimension? activation function?)
- Training (mini-batch? optimizer? learning rate?)
- L2 regularization
- There are some mistakes in the sample code, can you find them?
Copying or reusing this code is required to specify the original author.
E.g.
Source: Heng-Jui Chang @ NTUEE(https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.ipynb)
优化参考https://github.com/wolfparticle/machineLearningDeepLearning/blob/main/homework_code/hw1/HW1_local参考代码/HW1_local.ipynb
推荐阅读
- youcans|【youcans 的 OpenCV 例程200篇】197.轮廓的基本特征
- youcans|【youcans 的 OpenCV 例程200篇】199.轮廓的外接边界框
- youcans|【youcans 的 OpenCV 例程200篇】195.绘制图像轮廓(cv.drawContours)
- youcans|【youcans 的 OpenCV 例程200篇】194.寻找图像轮廓(cv.findContours)
- youcans|【youcans 的 OpenCV 例程200篇】196.图像的矩和不变矩(cv.moments)
- Python中的图形绘制——3D绘图
- 如何用 Python 自动发送微博()
- 初入深度学习|初入深度学习2——如何使用一个深度学习库
- 机器学习/深度学习|深度学习之目标检测——基于R-CNN的物体检测