我的NLP实践之路02 经历了上一篇博客的数据介绍和模型介绍后,我们来正式开始进行数据分析
数据读取
废话少说,上代码
import pandas as pd
import numpy as nppath = './data/'
train=pd.read_csv(path+'train_set.csv', sep='\t')
test=pd.read_csv(path+'test_a.csv', sep='\t')train.head()
【我的NLP实践之旅02】这里用到了pandas的函数,sep=’\t’表示分隔符为空格,根据需要,可以自行设置,这里要说明的是本人在用read_csv()函数读存储过的数据时经常会读到标题列,这个时候只有设置属性index_col=0就可以避免了
pandas存储csv文件的函数为,df.to_csv(),df代表读到的csv文件,代码如下:
train.to_csv('./data/train.csv')
此外对于文件较大的情况可以将csv存储为其他格式加快读取速度,如h5格式
train.to_hdf("train.h5", "train", format ="table", mode="w")
train=pd.read_hdf("train.h5", "train")
这样可以加快读取速度,这个本人深有体会,当读取像7/8G的文件时读取数据会明显变长,所以需要加快读取速度。
此外,读取较大的csv文件会占用大量内存,笔者由于电脑配置的原因经常内存不足,后来参考了大佬的代码,发现运用改变数据类型可以降低内存的消耗,代码如下:
def reduce_memory(data):
start_memory = data.memory_usage().sum() / 1024**2
print("Memory usage of properties dataframe is :",start_memory," MB")
NAlist = [] # Keeps track of columns that have missing values filled in. for col in data.columns:
if ('int' in data[col].dtype.name) or ('float' in data[col].dtype.name):# Exclude strings
try:
# Print current column type
print("******************************")
print("Column: ",col)
print("dtype before: ",data[col].dtype)# make variables for Int, max and min
IsInt = False
value_max = data[col].max()
value_min = data[col].min()# Integer does not support NA, therefore, NA needs to be filled
if not np.isfinite(data[col]).all():
NAlist.append(col)
data[col].fillna(value_min-1,inplace=True)# test if column can be converted to an integer
asint = data[col].fillna(0).astype(np.int64)
result = (data[col] - asint)
result = result.sum()
if result > -0.01 and result < 0.01:
IsInt = True# Make Integer/unsigned Integer datatypes
if IsInt:
if value_min >= 0:
if value_max < 255:
data[col] = data[col].astype(np.uint8)
elif value_max < 65535:
data[col] = data[col].astype(np.uint16)
elif value_max < 4294967295:
data[col] = data[col].astype(np.uint32)
else:
data[col] = data[col].astype(np.uint64)
else:
if value_min > np.iinfo(np.int8).min and value_max < np.iinfo(np.int8).max:
data[col] = data[col].astype(np.int8)
elif value_min > np.iinfo(np.int16).min and value_max < np.iinfo(np.int16).max:
data[col] = data[col].astype(np.int16)
elif value_min > np.iinfo(np.int32).min and value_max < np.iinfo(np.int32).max:
data[col] = data[col].astype(np.int32)
elif value_min > np.iinfo(np.int64).min and value_max < np.iinfo(np.int64).max:
data[col] = data[col].astype(np.int64)# Make float datatypes 32 bit
else:
data[col] = data[col].astype(np.float32)# Print new column type
print("dtype after: ",data[col].dtype)
print("******************************")
except:
print("dtype after: Failed")
else:
print("dtype remain: ",data[col].dtype)# Print final result
print("___MEMORY USAGE AFTER COMPLETION:___")
end_memory = data.memory_usage().sum() / 1024**2
print("Memory usage is: ",end_memory," MB")
print("This is ",100*start_memory/end_memory,"% of the initial size")
print("Missing Value list", NAlist)
return data
数据分析
训练集数据总数:200000
测试集数据总数:50000
标签共有14个类别,从表格可看出,越往后的类别训练集数量越少
train.groupby('label').count()/len(train)
![我的NLP实践之旅02](https://img.it610.com/image/info8/bfcba3a9d653456ba76af76c8a92c95b.jpg)
文章图片
统计文本长度:
train['count']=train['text'].apply(lambda x:len(x.split(' ')))
train['count'].describe()
![我的NLP实践之旅02](https://img.it610.com/image/info8/d6bb1b4c7cb1449bab30c693f4a1f0fa.jpg)
文章图片
利用describe函数可观察到文本长度的最大值、最小值、均值等信息,可以看到文本均值为907左右
figure = plt.figure()
ax1=figure.add_subplot(3,1,1)
ax1.plot(train_count['label'],train_count['max'])
ax2=figure.add_subplot(3,1,2)
ax2.plot(train_count['label'],train_count['min'])
ax3=figure.add_subplot(3,1,3)
ax3.plot(train_count['label'],train_count['mean'])
![我的NLP实践之旅02](https://img.it610.com/image/info8/1c5a6401a7764474ba36afa8d3073c25.jpg)
文章图片
各便签的数量分析,好像没什么用。。。。。
1.假设字符3750,字符900和字符648是句子的标点符号,请分析赛题每篇新闻平均由多少个句子构成?
import re
train['count']=train['text'].apply(lambda x:len(re.split('3750|900|648',x)))
print(train['count'].mean())
#输出:80.80237
2.统计每类新闻中出现次数对多的字符
from collections import Counter
train=pd.read_csv('./data/train_set.csv',sep='\t')
for i in range(14):
tmp=train[train['label']==i]['text']
word_count = Counter(" ".join(tmp.values.tolist()).split())
print(i,word_count.most_common(1)[0])
#输出:
# 0 ('3750', 1267331)
# 1 ('3750', 1200686)
# 2 ('3750', 1458331)
# 3 ('3750', 774668)
# 4 ('3750', 360839)
# 5 ('3750', 715740)
# 6 ('3750', 469540)
# 7 ('3750', 428638)
# 8 ('3750', 242367)
# 9 ('3750', 178783)
# 10 ('3750', 180259)
# 11 ('3750', 83834)
# 12 ('3750', 87412)
# 13 ('3750', 33796)