算法|【机器学习】基于Logistic Regression的新冠肺炎CT影像识别


文章目录

  • 【机器学习】基于Logistic Regression的新冠肺炎CT影像识别
    • 1. 线性模型与回归
    • 2. 通过最小二乘实现参数求解
    • 3. 对数线性回归
    • 4. Logistic Regression
    • 5. 新冠肺炎CT影像识别

【机器学习】基于Logistic Regression的新冠肺炎CT影像识别 本篇博客通过Logistic Regression的方法实现新冠肺炎CT影像的识别。我们通过代码与概念深入浅出项目的实现过程。
1. 线性模型与回归 f ( x ) = w 1 x 1 + w 2 x 2 + … + w d x d + b 其 中 x = ( x 1 , x 2 , . . . , x d ) 是 由 d 维 属 性 描 述 的 样 本 向 量 化 表 示 : f ( x ) = w T x + b f(x)=w_{1} x_{1}+w_{2} x_{2}+\ldots+w_{d} x_{d}+b\\ 其中x=(x_1,x_2,...,x_d)是由d维属性描述的样本\\ 向量化表示:f(x)=w^{T} x+b f(x)=w1?x1?+w2?x2?+…+wd?xd?+b其中x=(x1?,x2?,...,xd?)是由d维属性描述的样本向量化表示:f(x)=wTx+b
算法|【机器学习】基于Logistic Regression的新冠肺炎CT影像识别
文章图片

2. 通过最小二乘实现参数求解 线性回归目标:回归预测值与真实值的误差最小。
( w ? , b ? ) = arg ? min ? ( w , b ) ∑ i = 1 m ( f ( x i ) ? y i ) 2 = arg ? min ? ( w , b ) ∑ i = 1 m ( y i ? w x i ? b ) 2 \begin{aligned} \left(w^{*}, b^{*}\right) &=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(f\left(x_{i}\right)-y_{i}\right)^{2} \\ &=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2} \end{aligned} (w?,b?)?=(w,b)argmin?i=1∑m?(f(xi?)?yi?)2=(w,b)argmin?i=1∑m?(yi??wxi??b)2?
因此我们需要对参数w和b求偏导求解误差最小值。
? E ( w , b ) ? w = 2 ( w ∑ i = 1 m x i 2 ? ∑ i = 1 m ( y i ? b ) x i ) = 0 ? E ( w , b ) ? b = 2 ( m b ? ∑ i = 1 m ( y i ? w i x i ) ) = 0 } \left.\begin{array}{c} \frac{\partial E_{(w, b)}}{\partial w}=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)=0 \\ \frac{\partial E_{(w, b)}}{\partial b}=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w_{i} x_{i}\right)\right)=0 \end{array}\right\} ?w?E(w,b)??=2(w∑i=1m?xi2??∑i=1m?(yi??b)xi?)=0?b?E(w,b)??=2(mb?∑i=1m?(yi??wi?xi?))=0?}
求解得。
w = ∑ i = 1 m y i ( x i ? x ˉ ) ∑ i = 1 m x i 2 ? 1 m ( ∑ i = 1 m x i ) 2 b = 1 m ∑ i = 1 m ( y i ? w x i ) 其 中 : x ˉ = 1 m ∑ i = 1 m x i \begin{gathered} w=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^2-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^2} \\ b=\frac{1}{m} \sum_{i=1}^{m}\left(y_{i}-w x_{i}\right) \end{gathered}\\ 其中:\bar{x}=\frac{1}{m} \sum_{i=1}^{m} x_{i} w=∑i=1m?xi2??m1?(∑i=1m?xi?)2∑i=1m?yi?(xi??xˉ)?b=m1?i=1∑m?(yi??wxi?)?其中:xˉ=m1?i=1∑m?xi?
3. 对数线性回归 对数线性回归目的:通过线性模型预测非线性的复杂函数
f ( x ) = w x + b g ( x ) = e x g ( f ( x ) ) = e w x + b f(x)=wx+b\\ g(x)=e^x\\ g(f(x))=e^{wx+b} f(x)=wx+bg(x)=exg(f(x))=ewx+b
算法|【机器学习】基于Logistic Regression的新冠肺炎CT影像识别
文章图片

4. Logistic Regression Logistic Regression的目的 :虽然名字上称之回归,但其本质是一个分类算法。
Logistic Regression的本质:Logistic Regression属于判别式模型。它是在线性回归的基础上使用sigmoid函数将线性模型压缩到[0,1]之间,实其具备概率意义。
sigmoid函数:获得分类概率
h θ ( x ) = g ( θ T x ) = 1 1 + e ? θ T x h_{\theta}(x)=g\left(\theta^{T} x\right)=\frac{1}{1+e^{-\theta^{T} x}} hθ?(x)=g(θTx)=1+e?θTx1?
算法|【机器学习】基于Logistic Regression的新冠肺炎CT影像识别
文章图片

Logistic Regression的损失函数:当通过sigmoid函数获得分类预测值后,我们通过损失函数来参与logistic模型的优化。
假 设 训 练 数 据 集 为 { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … ( x m , y m ) } 假设训练数据集为\left\{\left(\mathrm{x}^{1}, \mathrm{y}^{1}\right),\left(\mathrm{x}^{2}, \mathrm{y}^{2}\right), \ldots\left(\mathrm{x}^{\mathrm{m}}, \mathrm{y}^{\mathrm{m}}\right)\right\}\\ 假设训练数据集为{(x1,y1),(x2,y2),…(xm,ym)}
令x = [ x 0 , x 1 , … , x n ] T , x 0 = 1,即每个样本有n个特征,y ∈ { 0 , 1 } \text { 令 } \mathrm{x}=\left[\mathrm{x}_{0}, \mathrm{x}_{1}, \ldots, \mathrm{x}_{\mathrm{n}}\right]^{\mathrm{T}}, \mathrm{x}_{0}=1 \text { ,即每个样本有 } \mathrm{n} \text { 个特征, } \mathrm{y} \in\{0,1\}\\令 x=[x0?,x1?,…,xn?]T,x0?=1 ,即每个样本有 n 个特征, y∈{0,1}
损 失 函 数 定 义 : J ( θ ) = 1 m ∑ i = 1 m cost ? ( h θ ( x i ) , y i ) cost ? ( h θ ( x ) , y ) = { ? log ? ( h θ ( x ) )ify = 1 ? log ? ( 1 ? h θ ( x ) )ify = 0 损失函数定义:\\ \begin{aligned} &J(\theta)=\frac{1}{m} \sum_{i=1}^{m} \operatorname{cost}\left(h_{\theta}\left(x^{i}\right), y^{i}\right) \\ &\operatorname{cost}\left(h_{\theta}(x), y\right)= \begin{cases}-\log \left(h_{\theta}(x)\right) & \text { if } \mathrm{y}=1 \\ -\log \left(1-h_{\theta}(x)\right) & \text { if } \mathrm{y}=0\end{cases} \end{aligned}\\ 损失函数定义:?J(θ)=m1?i=1∑m?cost(hθ?(xi),yi)cost(hθ?(x),y)={?log(hθ?(x))?log(1?hθ?(x))? if y=1 if y=0??
整 理 得 : cost ? ( h θ ( x ) , y ) = ? y log ? ( h θ ( x ) ) ? ( 1 ? y ) log ? ( 1 ? h θ ( x ) ) 整理得:\\ \operatorname{cost}\left(h_{\theta}(x), y\right)=-y \log \left(h_{\theta}(x)\right)-(1-y) \log \left(1-h_{\theta}(x)\right)\\ 整理得:cost(hθ?(x),y)=?ylog(hθ?(x))?(1?y)log(1?hθ?(x))
转 化 为 成 本 函 数 : J ( θ ) = ? 1 m ∑ i = 1 m [ y i log ? ( h θ ( x i ) ) + ( 1 ? y i ) log ? ( 1 ? h θ ( x i ) ) ] 转化为成本函数:\\ J(\theta)=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{i} \log \left(h_{\theta}\left(x^{i}\right)\right)+\left(1-y^{i}\right) \log \left(1-h_{\theta}\left(x^{i}\right)\right)\right] 转化为成本函数:J(θ)=?m1?i=1∑m?[yilog(hθ?(xi))+(1?yi)log(1?hθ?(xi))]
Logistic Regression的梯度下降:用梯度下降法来求得使代价函数最小的参数。
θ j = θ j ? α ? J ( θ ) ? θ j = θ j ? α 1 m ∑ i = 1 m ( h θ ( x i ) ? y i ) x j i \begin{aligned} \theta_{j} &=\theta_{j}-\alpha \frac{\partial J(\theta)}{\partial \theta_{j}} \\ &=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{i}\right)-y^{i}\right) x_{j}^{i} \end{aligned} θj??=θj??α?θj??J(θ)?=θj??αm1?i=1∑m?(hθ?(xi)?yi)xji??
Logistic Regression的推广(Softmax Regression):
Logistic Regression用来解决二分类问题,但若是遇到多分类问题我们常采取softmax regression,它是Logistic Regression在多分类问题上的推广。
5. 新冠肺炎CT影像识别
数据集与代码:
链接:https://pan.baidu.com/s/1Ay4Y3Cr-0i–dlDl0zt1xg
提取码:7irt
import tensorflow as tf import matplotlib.pyplot as plt import numpy as np from tensorflow.keras.datasets import mnist import os import cv2 from sklearn.model_selection import train_test_split# 识别类别2类:正常与新冠 num_classes = 2 # 128 * 128 num_features = 16384 # 学习率往小了调 learning_rate = 0.0001 # 基本超参数 training_steps = 1000 batch_size = 32 display_step = 200# 制作数据集 root_path = "./CT/" imgs = [] labels = []for files in os.listdir(root_path): path = os.path.join(root_path,files) #print(path) for img_name in os.listdir(path): img_path = os.path.join(path,img_name) img = cv2.imread(img_path,0) img = cv2.resize(img,(128,128)) imgs.append(img) if files == 'CT_COVID': labels.append(0) if files == 'CT_NonCOVID': labels.append(1)# 划分数据集 x_train,x_test,y_train,y_test = train_test_split(imgs, labels, test_size = 0.2) print(y_test)# 转换为float32 x_train,x_test = np.array(x_train,np.float32),np.array(x_test,np.float32)print(x_train.shape) print(x_test.shape)# 将图像平铺成784个特征的一维向量(128*128) x_train,x_test = x_train.reshape([-1,num_features]),x_test.reshape([-1,num_features])# 将像素值从[0,255]归一化为[0,1] x_train,x_test = x_train/255, x_test/255# 数据随机分布和批处理 train_data = https://www.it610.com/article/tf.data.Dataset.from_tensor_slices((x_train,y_train)) train_data = train_data.repeat().shuffle(5000).batch(batch_size).prefetch(1)# 权值矩阵形状[16384,2],128 * 128图像特征数和类别数目 W = tf.Variable(tf.ones([num_features, num_classes]), name="weight") # 偏置形状[2], 类别数目 b = tf.Variable(tf.zeros([num_classes]), name="bias")# logistic回归,这里我使用 def logistic_regression(x): return tf.nn.softmax(tf.matmul(x,W) + b) # return tf.nn.sigmoid(tf.matmul(x,W) + b)# 交叉熵损失函数 def cross_entropy(y_pred, y_true): # 将标签编码为一个独热编码向量 y_true = tf.one_hot(y_true, depth=num_classes) # 压缩预测值以避免log(0)错误 y_pred = tf.clip_by_value(y_pred, 1e-9, 1.) # 计算交叉熵 return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))# 计算精度 def accuracy(y_pred, y_true): correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64)) return tf.reduce_mean(tf.cast(correct_prediction, tf.float32))# 优化器,这里我选Adam optimizer = tf.optimizers.Adam(learning_rate)# 优化过程 def run_optimization(x, y): # 将计算封装在GradientTape中以实现自动微分 with tf.GradientTape() as g: #print("x:",x) pred = logistic_regression(x) #print("pred:",pred) loss = cross_entropy(pred, y)# 计算梯度 gradients = g.gradient(loss, [W, b])# 根据gradients更新 W 和 b optimizer.apply_gradients(zip(gradients, [W, b]))# 开始训练 for step, (batch_x, batch_y) in enumerate(train_data.take(training_steps), 1): # 更新W和b值 run_optimization(batch_x, batch_y) if step % display_step == 0: pred = logistic_regression(batch_x) loss = cross_entropy(pred, batch_y) acc = accuracy(pred, batch_y) print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))# 在验证集上测试模型 pred = logistic_regression(x_test) print("Test Accuracy: %f" % accuracy(pred, y_test))#可视化预测 n_images = 5test_images = x_test[:n_images]font={ 'color': 'red', 'size': 20, 'family': 'Times New Roman', 'style':'italic'}predictions = logistic_regression(test_images)for i in range(n_images): print(np.argmax(predictions.numpy()[i])) if np.argmax(predictions.numpy()[i]) == 0: plt.imshow(np.reshape(test_images[i],[128,128])) plt.text(28, 0.1, "Prediction : COVID", fontdict=font) plt.show() else: plt.imshow(np.reshape(test_images[i],[128,128])) plt.text(28, 0.1, "Prediction : Normal", fontdict=font) plt.show()

新冠CT影像上的消融实验:
Method Accuracy
Logistic Regression 0.4800
Softmax Regression 0.7133
【算法|【机器学习】基于Logistic Regression的新冠肺炎CT影像识别】定量分析:
算法|【机器学习】基于Logistic Regression的新冠肺炎CT影像识别
文章图片

算法|【机器学习】基于Logistic Regression的新冠肺炎CT影像识别
文章图片

    推荐阅读