使用Python的潜在语义分析 _数据分析

本文概述

主题建模
文本分类和主题建模之间的比较
潜在语义分析
确定最佳主题数
使用Gensim实施LSA
LSA的优缺点
主题建模的用例
总结

发现主题对于多种目的都是有益的, 例如用于将文档聚类, 组织在线可用内容以进行信息检索和推荐。多个内容提供商和新闻社正在使用主题模型向读者推荐文章。同样, 招聘公司也正在使用这些方法来提取职位描述, 并将其与候选技能集相对应。如果你看到数据科学家的工作, 那就是从大量收集的数据中提取” 知识” 。通常, 收集的数据是非结构化的。你需要强大的工具和技术来分析和理解大量的非结构化数据。
主题建模是一种文本挖掘技术, 它提供了一种方法, 用于识别同现关键字以汇总大量文本信息。它有助于发现文档中的隐藏主题, 使用这些主题对文档进行注释, 以及组织大量非结构化数据。
在本教程中, 你将涵盖以下主题：

什么是主题建模？
文本分类和主题建模之间的比较
潜在语义分析
使用Gensim在Python中实现LSA
确定文档中的最佳主题数
LSA的优缺点
主题建模的用例
总结

主题建模主题建模会自动发现给定文档中的隐藏主题。它是一种无监督的文本分析算法, 用于从给定文档中查找单词组。这些词组代表一个主题。一个文档有可能与多个主题相关联。例如, “ 患者” , “ 医生” , “ 疾病” , “ 癌症” , “ 健康” 等组词代表主题” 医疗保健” 。与使用正则表达式的基于规则的文本搜索相比, 主题建模是一种不同的游戏。

文章图片
文本分类和主题建模之间的比较文本分类是有监督的机器学习问题, 其中文本文档或文章分类为一组预定义的类。主题建模是在文本文档中发现一组同时出现的单词的过程。这些组共同出现的相关单词构成” 主题” 。这是无监督学习的一种形式, 因此可能的主题集是未知的。主题建模可用于解决文本分类问题。主题建模将识别文档中存在的主题” , 而文本分类将文本分为一个类。

文章图片
潜在语义分析 LSA(潜在语义分析)也称为LSI(潜在语义索引)LSA使用单词袋(BoW)模型, 这会生成术语文档矩阵(文档中术语的出现)。行代表术语, 列代表文档。 LSA通过使用奇异值分解对文档项矩阵执行矩阵分解来学习潜在主题。 LSA通常用作降维或降噪技术。

文章图片
奇异值分解(SVD)
SVD是一种矩阵分解方法, 表示两个矩阵乘积中的一个矩阵。它在信号处理, 心理学, 社会学, 气候和大气科学, 统计和天文学方面提供了各种有用的应用程序。

文章图片

M是一个m×m矩阵
U是m×n的左奇异矩阵
Σ是一个n×n对角矩阵, 具有非负实数。
V是m×n的右奇异矩阵
V *是n×m矩阵, 是V的转置。

单位矩阵：这是一个方矩阵, 其中主对角线的所有元素均为1, 所有其他元素均为0。
对角矩阵：它是一个矩阵, 其中除主对角线之外的所有条目均为零。
奇异矩阵：如果矩阵的行列式为0或不具有矩阵逆的方阵, 则该矩阵是奇异的。
确定最佳主题数确定主题建模中的k(主题数)的最佳方法是什么？在给定的语料库文本中确定最佳主题数是一项艰巨的任务。我们可以使用以下选项来确定最佳主题数：

确定最佳主题数的一种方法是将每个主题视为一个聚类, 并使用Silhouette系数找出聚类的有效性。
主题连贯性度量是用于识别主题数量的现实度量。

主题一致性度量是一种广泛用于评估主题模型的度量。它使用潜在变量模型。每个生成的主题都有一个单词列表。在主题连贯性度量中, 你将找到主题中单词的成对单词相似度得分的平均值/中位数。主题连贯评分模型的高价值将被认为是一个很好的主题模型。
使用Gensim实施LSA 导入所需的库

#import modulesimport os.pathfrom gensim import corporafrom gensim.models import LsiModelfrom nltk.tokenize import RegexpTokenizerfrom nltk.corpus import stopwordsfrom nltk.stem.porter import PorterStemmerfrom gensim.models.coherencemodel import CoherenceModelimport matplotlib.pyplot as plt

加载数据中首先创建用于加载article.csv的数据加载功能。你可以在此处下载数据。

def load_data(path, file_name):"""Input: path and file_namePurpose: loading text fileOutput : list of paragraphs/documents andtitle(initial 100 words considred as title of document)"""documents_list = []titles=[]with open( os.path.join(path, file_name) , "r") as fin:for line in fin.readlines():text = line.strip()documents_list.append(text)print("Total Number of Documents:", len(documents_list))titles.append( text[0:min(len(text), 100)] )return documents_list, titles

预处理数据 【使用Python的潜在语义分析】数据加载功能完成后, 需要对文本进行预处理。采取以下步骤对文本进行预处理：

标记文本文章
删除停用词
在文本关节上执行词干

def preprocess_data(doc_set):"""Input: docuemnt listPurpose: preprocess text (tokenize, removing stopwords, and stemming)Output : preprocessed text"""# initialize regex tokenizertokenizer = RegexpTokenizer(r'\w+')# create English stop words listen_stop = set(stopwords.words('english'))# Create p_stemmer of class PorterStemmerp_stemmer = PorterStemmer()# list for tokenized documents in looptexts = []# loop through document listfor i in doc_set:# clean and tokenize document stringraw = i.lower()tokens = tokenizer.tokenize(raw)# remove stop words from tokensstopped_tokens = [i for i in tokens if not i in en_stop]# stem tokensstemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]# add tokens to listtexts.append(stemmed_tokens)return texts

准备语料库下一步是准备语料库。在这里, 你需要创建一个文档术语矩阵和术语词典。

def prepare_corpus(doc_clean):"""Input: clean documentPurpose: create term dictionary of our courpus and Converting list of documents (corpus) into Document Term MatrixOutput : term dictionary and Document Term Matrix"""# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)dictionary = corpora.Dictionary(doc_clean)# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]# generate LDA modelreturn dictionary, doc_term_matrix

使用Gensim创建LSA模型创建语料库后, 你可以使用LSA生成模型。

def create_gensim_lsa_model(doc_clean, number_of_topics, words):"""Input: clean document, number of topics and number of words associated with each topicPurpose: create LSA model using gensimOutput : return LSA model"""dictionary, doc_term_matrix=prepare_corpus(doc_clean)# generate LSA modellsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)# train modelprint(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))return lsamodel

确定主题数还需要采取额外的步骤来通过确定最佳主题数量来优化结果。在这里, 你将生成连贯分数, 以确定最佳主题数。

def compute_coherence_values(dictionary, doc_term_matrix, doc_clean, stop, start=2, step=3):"""Input: dictionary : Gensim dictionarycorpus : Gensim corpustexts : List of input textsstop : Max num of topicspurpose : Compute c_v coherence for various number of topicsOutput: model_list : List of LSA topic modelscoherence_values : Coherence values corresponding to the LDA model with respective number of topics"""coherence_values = []model_list = []for num_topics in range(start, stop, step):# generate LSA modelmodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)# train modelmodel_list.append(model)coherencemodel = CoherenceModel(model=model, texts=doc_clean, dictionary=dictionary, coherence='c_v')coherence_values.append(coherencemodel.get_coherence())return model_list, coherence_values

让我们绘制连贯性得分值。

def plot_graph(doc_clean, start, stop, step):dictionary, doc_term_matrix=prepare_corpus(doc_clean)model_list, coherence_values = compute_coherence_values(dictionary, doc_term_matrix, doc_clean, stop, start, step)# Show graphx = range(start, stop, step)plt.plot(x, coherence_values)plt.xlabel("Number of Topics")plt.ylabel("Coherence score")plt.legend(("coherence_values"), loc='best')plt.show()

start, stop, step=2, 12, 1plot_graph(clean_text, start, stop, step)

文章图片
你可以轻松评估此图。在这里, 你在X轴上有许多主题, 在Y轴上有相干得分。在主题数中, 有7个具有最高的连贯性得分, 因此最佳主题数是7。
运行以上所有功能

# LSA Modelnumber_of_topics=7words=10document_list, titles=load_data("", "articles.txt")clean_text=preprocess_data(document_list)model=create_gensim_lsa_model(clean_text, number_of_topics, words)

Total Number of Documents: 4551[(0, '0.361*"trump" + 0.272*"say" + 0.233*"said" + 0.166*"would" + 0.160*"clinton" + 0.140*"peopl" + 0.136*"one" + 0.126*"campaign" + 0.123*"year" + 0.110*"time"'), (1, '-0.389*"citi" + -0.370*"v" + -0.356*"h" + -0.355*"2016" + -0.354*"2017" + -0.164*"unit" + -0.159*"west" + -0.157*"manchest" + -0.116*"apr" + -0.112*"dec"'), (2, '0.612*"trump" + 0.264*"clinton" + -0.261*"eu" + -0.148*"say" + -0.137*"would" + 0.135*"donald" + -0.134*"leav" + -0.134*"uk" + 0.119*"republican" + -0.110*"cameron"'), (3, '-0.400*"min" + 0.261*"eu" + -0.183*"goal" + -0.152*"ball" + -0.132*"play" + 0.128*"said" + 0.128*"say" + -0.126*"leagu" + 0.122*"leav" + -0.122*"game"'), (4, '0.404*"bank" + -0.305*"eu" + -0.290*"min" + 0.189*"year" + -0.164*"leav" + -0.153*"cameron" + 0.143*"market" + 0.140*"rate" + -0.139*"vote" + -0.133*"say"'), (5, '0.310*"bank" + -0.307*"say" + -0.221*"peopl" + 0.203*"trump" + 0.166*"1" + 0.164*"min" + 0.163*"0" + 0.152*"eu" + 0.152*"market" + -0.138*"like"'), (6, '0.570*"say" + 0.237*"min" + -0.170*"vote" + 0.158*"govern" + -0.154*"poll" + 0.122*"tax" + 0.115*"statement" + 0.115*"bank" + 0.112*"budget" + -0.108*"one"')]

主题1：” 王牌” , “ 说” , “ 说” , “ 将” , “ 克林顿” , “ 人” , “ 一个” , “ 竞选” , “ 年” , “ 时间” ‘ (美国总统选举)
主题2：” citi” , ” v” , ” h” , ” 2016″ , ” 2017″ , ” unit” , ” west” , ” manchest” , ” apr” , ” dec” ‘ (英超)
主题3：” 王牌” , “ 克林顿” , “ 欧盟” , “ 说” , “ 将” , “ 唐纳德” , “ 叶” , “ 英国” , “ 共和党” , “ 卡梅隆” (美国总统大选, 脱欧)
主题4：min” , ” eu” , ” goal” , ” ball” , ” play” , ” said” , ” say” , ” leagu” , ” leav” , ” game” (英超)
主题5：” 银行” , “ 欧盟” , “ 最低” , “ 年” , “ 最低” , “ 喀麦隆” , “ 市场” , “ 利率” , “ 投票” , “ 说” (脱欧和市场状况)
主题6：” 银行” , “ 说” , “ 人” , “ 王牌” , ” 1″ , “ 最小” , “ 欧盟” , “ 市场” , “ 喜欢” (政治情况和市场条件)
主题7：” 说” , “ 分钟” , “ 投票” , “ 政府” , “ 民意测验” , “ 税” , “ 声明” , “ 银行” , “ 预算” , “ 一项” (美国总统选举和财务计划)

在这里, 使用潜在语义分析发现了7个主题。其中一些是重叠的主题。为了更准确地捕获多个含义, 我们需要尝试LDA(潜在Dirichlet分配)。我将把这作为练习留给你, 使用Gensim进行尝试并分享你的观点。
LSA的优缺点 LSA算法是最简单的方法, 易于理解和实现。与向量空间模型相比, 它还提供了更好的结果。与其他可用算法相比, 它更快, 因为它仅涉及文档术语矩阵分解。
潜在主题的维度取决于矩阵的等级, 因此我们无法扩展该限制。 LSA分解矩阵是一个高密度矩阵, 因此很难索引各个维。 LSA无法捕获单词的多种含义。与LDA(潜在的Dirichlet分配)相比, 实现起来并不容易。它提供的精度低于LDA。
主题建模的用例使用此技术的简单应用程序是文本分析, 推荐系统和信息检索中记录的群集。主题建模的更详细用例是：

简历概述：它可以帮助招聘人员快速浏览简历。他们可以减少筛选简历的工作量。
搜索引擎优化：通过识别主题和关联的关键字, 可以轻松标记在线文章, 博客和文档, 从而可以优化搜索结果。
推荐系统优化：推荐系统根据用户个人资料和以前的历史记录充当信息过滤器和顾问。它可以帮助我们根据过去的访问来发现未访问的相关内容。
改善客户支持：发现客户投诉和反馈中的相关主题和相关关键字, 以获取示例产品和服务规格, 部门和分支机构的详细信息。此类信息可帮助公司在各个部门直接投诉。
在医疗保健行业, 主题建模可以帮助我们从非结构化医疗报告中提取有用和有价值的信息。此信息可用于患者治疗和医学科学研究。

总结在本教程中, 你涵盖了有关主题建模的许多详细信息。你已经了解了什么是主题建模, 什么是潜在语义分析, 如何构建各自的模型, 如何使用LSA生成主题。此外, 你还介绍了一些基本概念, 例如奇异值分解, 主题一致性得分。
希望你现在可以利用主题建模来分析自己的数据集。感谢你阅读本教程！
如果你想了解有关Python的更多信息, 请参加srcmini的统计思维案例研究课程。