心跳异常检测--使用Keras,K折交叉训练CNN一维卷积

本次是AI研习社的一个比赛,
目标是区分出心电图数据是否正常,分为两类,正常 = 0,不正常 = 1。
下载数据集后打开ptbdb_train.csv 发现会有7000行数据,每一行的前187列是心电图数据,最后一个是label。
这个任务是一个二分类任务,且由于数据是时间序列数据,具有上下文关系,故考虑用一维卷积去实现,也可以用lstm去实现,但对lstm不太了解。
【心跳异常检测--使用Keras,K折交叉训练CNN一维卷积】主要分为下面几个模块:
数据预处理
模型构建
模型训练
模型测试
数据预处理 数据预处理包括1.将label转为热向量编码 2. 数据读取两个部分
1.将label转为热向量编码 由于每一行数据对应一个label,label值为0或1,且后面我们构造的网络由两个输出,故我们将label转为oneHot形式。

# 把标签转成oneHot def convert2oneHot(index,Lens): hot = np.zeros((Lens,)) hot[int(index)] = 1 return(hot)

2. 数据读取 数据读取部分包括测试数据和训练数据的读取,使用yield方法生成数据。
# 生成数据 def train_gen(df,batch_size = 20,train=True):img_list = np.array(df) if train: steps = math.ceil(img_list.shape[0] / batch_size)# 确定每轮有多少个batch else: steps = math.ceil(img_list.shape[0] / batch_size)# 确定每轮有多少个batch while True: for i in range(steps):batch_list = img_list[i * batch_size : i * batch_size + batch_size] #print('batch_list shape is {}'.format(batch_list.shape)) np.random.shuffle(batch_list) batch_x = np.array([file for file in batch_list[:,:-1]]) batch_y = np.array([convert2oneHot(label,2) for label in batch_list[:,-1]])yield batch_x, batch_y# 生成测试数据 def test_gen(df,batch_size = 20): img_list = np.array(df) steps = math.ceil(len(img_list) / batch_size)# 确定每轮有多少个batch while True: for i in range(steps): batch_list = img_list[i * batch_size : i * batch_size + batch_size] batch_x = np.array([file for file in batch_list[:,:]]) print(batch_x.shape) yield batch_x

模型构建 模型构建基于keras框架,keras框架的Sequential方法构建顺序模型十分方便,这里就用它了。
一个样本是长度187的时间序列数据。
下面是用一维卷积随意构建的模型,需要调整模型使模型对数据的拟合更好。
TIME_PERIODS = 187 def build_model(input_shape=(TIME_PERIODS,),num_classes=2): model = Sequential() model.add(Reshape((TIME_PERIODS, 1), input_shape=input_shape)) model.add(Conv1D(16, 8,strides=2, activation='relu',input_shape=(TIME_PERIODS,1)))model.add(Conv1D(16, 8,strides=2, activation='relu',padding="same")) #model.add(MaxPooling1D(2)) model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same")) model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same")) #model.add(MaxPooling1D(2)) model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same")) model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same")) #model.add(MaxPooling1D(2)) model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same")) model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same")) #model.add(MaxPooling1D(2))model.add(GlobalAveragePooling1D()) model.add(Dropout(0.3)) model.add(Dense(num_classes, activation='softmax')) return(model)

模型训练 准备好数据和构建好模型后,就可以进行模型训练了。这里使用sklearnKFold方法进行模型交叉训练。
本次分为10折,即每一次模型训练有90%的数据用于训练,10%的数据用于验证。
data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv']train_data = https://www.it610.com/article/pd.read_csv(data_fath[0], header = None) train_data_df = pd.DataFrame(train_data)batch_size = 20 skf = KFold(n_splits=10, random_state=233, shuffle=True) for flod_idx, (train_idx, val_idx) in enumerate(skf.split(train_data_df, train_data_df)): train_data= train_data_df.iloc[train_idx] val_data = train_data_df.iloc[val_idx] len_train = train_data.shape[0] len_val = val_data.shape[0]train_iterr = train_gen(train_data, batch_size, True) val_iterr = train_gen(val_data, batch_size, False)ckpt = keras.callbacks.ModelCheckpoint( # 模型保存名字,训练完成会有10个模型文件。 filepath='best_model_{}.h5'.format(flod_idx), monitor='val_loss', save_best_only=True, verbose=1)model = build_model() # 使用Adam优化器,这里可以更改为其它优化器 opt = Adam(0.0002) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) print(model.summary())model.fit_generator( generator=train_iterr, steps_per_epoch=len_train // batch_size, epochs=100, initial_epoch=0, validation_data=https://www.it610.com/article/val_iterr, nb_val_samples=len_val // batch_size, callbacks=[ckpt], )

训练过程示意如下:
Epoch 00001: val_loss improved from inf to 0.36047, saving model to best_model_0.h5 Epoch 2/100 315/315 [==============================] - 3s 11ms/step - loss: 0.4000 - accuracy: 0.8365 - val_loss: 0.3006 - val_accuracy: 0.8700Epoch 00002: val_loss improved from 0.36047 to 0.30064, saving model to best_model_0.h5 Epoch 3/100 315/315 [==============================] - 3s 11ms/step - loss: 0.3306 - accuracy: 0.8684 - val_loss: 0.2190 - val_accuracy: 0.8771Epoch 00003: val_loss improved from 0.30064 to 0.21901, saving model to best_model_0.h5 Epoch 4/100 315/315 [==============================] - 3s 9ms/step - loss: 0.2674 - accuracy: 0.8948 - val_loss: 0.1419 - val_accuracy: 0.9014Epoch 00004: val_loss improved from 0.21901 to 0.14192, saving model to best_model_0.h5 Epoch 5/100 315/315 [==============================] - 4s 13ms/step - loss: 0.2161 - accuracy: 0.9181 - val_loss: 0.1205 - val_accuracy: 0.9271Epoch 00005: val_loss improved from 0.14192 to 0.12052, saving model to best_model_0.h5 Epoch 6/100 315/315 [==============================] - 4s 11ms/step - loss: 0.1775 - accuracy: 0.9346 - val_loss: 0.1750 - val_accuracy: 0.9257

模型测试 训练完成后,得到10个模型,分别对10个模型进行测试,并融合,将结果输出到csv文件。
batch_size = 20 result =np.zeros(shape=(1000,)) data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv'] test_data = https://www.it610.com/article/pd.read_csv(data_fath[1], header = None) test_data_df = pd.DataFrame(test_data) # 测试数据生成器 test_iter = test_gen(test_data_df,batch_size = 20)for i in range(10): h5 ='./best_model_{}.h5'.format(i) model = load_model(h5) pres =model.predict_generator(generator=test_iter,steps=math.ceil(1000/batch_size),verbose=1) print('pres.shape is {}'.format(pres.shape)) ohpres = np.argmax(pres,axis=1) print('ohpres.shape is {}'.format(ohpres.shape)) print(type(ohpres)) result +=ohpres print('result shape is {}'.format(result))result = [1.0 if result[i]>5 else 0.0 for i in range(len(result))] df = pd.DataFrame() df["id"] = np.arange(0,len(ohpres)) df["label"] = result df.to_csv("submmit.csv",header=None, index=None)

一套流程全部OK,不过线下猛如虎,线上五十五。。。。模型的构建不合理,需要进行改变。
以后把代码整理上传github,做个记录。
Game Over!

    推荐阅读