NLP记录
Day1 1.下载数据集
2.观察数据集内容并理解题意
题目如下
文章图片
文章图片
Day2
import pandas as pd
import matplotlib.pyplot as plt
from collections import Countertrain_df = pd.read_csv('./train_set.csv', sep='\t', nrows=100)train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)print(len(word_count))
# 6869print(word_count[0])
# ('3750', 7482224)print(word_count[-1])
# ('3133', 1)train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)print(word_count[0])
# ('3750', 197997)print(word_count[1])
# ('900', 197653)print(word_count[2])
# ('648', 191975)
文章图片
【NLP实践记录】
文章图片
Day3 先空着