我的NLP实践之旅02 我的博客

我的NLP实践之路02 经历了上一篇博客的数据介绍和模型介绍后，我们来正式开始进行数据分析
数据读取
废话少说，上代码

import pandas as pd import numpy as nppath = './data/' train=pd.read_csv(path+'train_set.csv', sep='\t') test=pd.read_csv(path+'test_a.csv', sep='\t')train.head()

【我的NLP实践之旅02】这里用到了pandas的函数，sep=’\t’表示分隔符为空格，根据需要，可以自行设置，这里要说明的是本人在用read_csv()函数读存储过的数据时经常会读到标题列，这个时候只有设置属性index_col=0就可以避免了
pandas存储csv文件的函数为，df.to_csv（），df代表读到的csv文件，代码如下：

train.to_csv('./data/train.csv')

此外对于文件较大的情况可以将csv存储为其他格式加快读取速度，如h5格式

train.to_hdf("train.h5", "train", format ="table", mode="w") train=pd.read_hdf("train.h5", "train")

这样可以加快读取速度，这个本人深有体会，当读取像7/8G的文件时读取数据会明显变长，所以需要加快读取速度。
此外，读取较大的csv文件会占用大量内存，笔者由于电脑配置的原因经常内存不足，后来参考了大佬的代码，发现运用改变数据类型可以降低内存的消耗，代码如下：

def reduce_memory(data): start_memory = data.memory_usage().sum() / 1024**2 print("Memory usage of properties dataframe is :",start_memory," MB") NAlist = [] # Keeps track of columns that have missing values filled in. for col in data.columns: if ('int' in data[col].dtype.name) or ('float' in data[col].dtype.name):# Exclude strings try: # Print current column type print("******************************") print("Column: ",col) print("dtype before: ",data[col].dtype)# make variables for Int, max and min IsInt = False value_max = data[col].max() value_min = data[col].min()# Integer does not support NA, therefore, NA needs to be filled if not np.isfinite(data[col]).all(): NAlist.append(col) data[col].fillna(value_min-1,inplace=True)# test if column can be converted to an integer asint = data[col].fillna(0).astype(np.int64) result = (data[col] - asint) result = result.sum() if result > -0.01 and result < 0.01: IsInt = True# Make Integer/unsigned Integer datatypes if IsInt: if value_min >= 0: if value_max < 255: data[col] = data[col].astype(np.uint8) elif value_max < 65535: data[col] = data[col].astype(np.uint16) elif value_max < 4294967295: data[col] = data[col].astype(np.uint32) else: data[col] = data[col].astype(np.uint64) else: if value_min > np.iinfo(np.int8).min and value_max < np.iinfo(np.int8).max: data[col] = data[col].astype(np.int8) elif value_min > np.iinfo(np.int16).min and value_max < np.iinfo(np.int16).max: data[col] = data[col].astype(np.int16) elif value_min > np.iinfo(np.int32).min and value_max < np.iinfo(np.int32).max: data[col] = data[col].astype(np.int32) elif value_min > np.iinfo(np.int64).min and value_max < np.iinfo(np.int64).max: data[col] = data[col].astype(np.int64)# Make float datatypes 32 bit else: data[col] = data[col].astype(np.float32)# Print new column type print("dtype after: ",data[col].dtype) print("******************************") except: print("dtype after: Failed") else: print("dtype remain: ",data[col].dtype)# Print final result print("___MEMORY USAGE AFTER COMPLETION:___") end_memory = data.memory_usage().sum() / 1024**2 print("Memory usage is: ",end_memory," MB") print("This is ",100*start_memory/end_memory,"% of the initial size") print("Missing Value list", NAlist) return data

数据分析
训练集数据总数：200000
测试集数据总数：50000
标签共有14个类别，从表格可看出，越往后的类别训练集数量越少

train.groupby('label').count()/len(train)

文章图片

统计文本长度：

train['count']=train['text'].apply(lambda x:len(x.split(' '))) train['count'].describe()

文章图片

利用describe函数可观察到文本长度的最大值、最小值、均值等信息，可以看到文本均值为907左右

figure = plt.figure() ax1=figure.add_subplot(3,1,1) ax1.plot(train_count['label'],train_count['max']) ax2=figure.add_subplot(3,1,2) ax2.plot(train_count['label'],train_count['min']) ax3=figure.add_subplot(3,1,3) ax3.plot(train_count['label'],train_count['mean'])

文章图片

各便签的数量分析，好像没什么用。。。。。
1.假设字符3750，字符900和字符648是句子的标点符号，请分析赛题每篇新闻平均由多少个句子构成？

import re train['count']=train['text'].apply(lambda x:len(re.split('3750|900|648',x))) print(train['count'].mean()) #输出：80.80237

2.统计每类新闻中出现次数对多的字符

from collections import Counter train=pd.read_csv('./data/train_set.csv',sep='\t') for i in range(14): tmp=train[train['label']==i]['text'] word_count = Counter(" ".join(tmp.values.tolist()).split()) print(i,word_count.most_common(1)[0]) #输出： # 0 ('3750', 1267331) # 1 ('3750', 1200686) # 2 ('3750', 1458331) # 3 ('3750', 774668) # 4 ('3750', 360839) # 5 ('3750', 715740) # 6 ('3750', 469540) # 7 ('3750', 428638) # 8 ('3750', 242367) # 9 ('3750', 178783) # 10 ('3750', 180259) # 11 ('3750', 83834) # 12 ('3750', 87412) # 13 ('3750', 33796)