HMM最大匹配分词算法（Python） nlp

正向最大匹配算法是我国最早提出的解决中文分词问题的算法，因其简单易操作，至今仍作为机器分词的粗分算法，在今天看来，这种算法的准确率远不够高，无法达到令人满意的要求。这只是一次练习。
待切分文本是：
我和你共同创造美好的新生活
【HMM最大匹配分词算法（Python）】词典：
共同，创造，美好，的，新，生活
预期分词效果：
我和你共同创造美好的新生活

# Python 3.4.3lexicon = ('共同','创造','美好','的','新','生活') # 为了方便，词典直接写在程序里。 wordSeg = []# 新建列表存放切分好的词 maxWordLen = 3# 最大词长设为3 with open('test.txt','r', encoding='utf-8') as src: sentence = src.read() sentenceLen = len(sentence) wordLen = min(maxWordLen, sentenceLen) startPoint = 0 while startPoint < sentenceLen:# 从第一个字符循环到最后一个字符 matched = False# 假设找不到匹配的词 for i in range(maxWordLen, 0, -1):# 从最大词长3递减到1 string = sentence[startPoint:startPoint+i]# 取startPoint开始到startPoint+i-1的切片 if string in lexicon: wordSeg.append(string) matched = True break if not matched:# 假如在词典中找不到匹配 i = 1 wordSeg.append(sentence[startPoint])# 全部切分为单字词 startPoint += iwith open('WordSeg.txt', 'w', encoding='utf-8') as des: for word in wordSeg: des.write(word+' ')

分词成果：
我和你共同创造美好的新生活
召回率：100%
正确率：100%
文章受到了” Blueliner,fighting!!!“的启发，表示感谢。