朴树贝叶斯 sklean 文本分类实现

读取本地文件进行分析,分词中英文都支持,可以换结巴分词。
训练样本可以自己定义,目录结构就是当前项目的 data_log文件夹,一级目录是类别,二级目录是文件即可。
博主训练集合 仅供参考:http://download.csdn.net/download/yl3395017/10236998


from sklearn.datasets import load_files # 加载数据集 training_data = https://www.it610.com/article/load_files('./data_log', encoding='utf-8') ''' 这是开始提取特征,这里的特征是词频统计。 ''' from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(training_data.data) ''' 这是开始提取特征,这里的特征是TFIDF特征。 ''' from sklearn.feature_extraction.text import TfidfTransformer tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)''' 使用朴素贝叶斯分类,并做出简单的预测 ''' from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB().fit(X_train_tfidf, training_data.target) docs_new = ['danger_degree:1; breaking_sighn:0; event'] X_new_counts = count_vect.transform(docs_new) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = clf.predict(X_new_tfidf) for doc, category in zip(docs_new, predicted): print('%r => %s' % (doc, training_data.target_names[category]))''' 使用测试集来评估模型好坏。 ''' from sklearn import metrics import numpy as np; # twenty_test = fetch_20newsgroups(subset='test',categories=categories, shuffle=True, random_state=42) testing_data = https://www.it610.com/article/load_files('./predict_test_log', encoding='utf-8') docs_test = testing_data.data X_test_counts = count_vect.transform(docs_test) X_test_tfidf = tfidf_transformer.transform(X_test_counts) predicted = clf.predict(X_test_tfidf) print(metrics.classification_report(testing_data.target, predicted,target_names=testing_data.target_names)) print("accurary\t"+str(np.mean(predicted == testing_data.target)))



【朴树贝叶斯 sklean 文本分类实现】

    推荐阅读