
背景 使用过大名鼎鼎的NLP工具包NLTK的同学们都知道, 自从NLTK更新到3.0版本后, 子包'model'被移除了. 原因是各种依赖的接口有较大调整, 子包'model'的迁移出现问题, 被维护者暂时移除但又迟迟没有合并回去. 这是十分可惜的事情, 因为其中包括我们常用的Ngram模型!
不过, 对应地维护者在'model'分支上提供了Ngram模型的基类 BaseNgramModel`, 使用者可以通过这个基类实现自己的模型. 作者根据此基类, 实现递归NgramCounter, 进而重新实现了2.x版本的Katz backoff平滑Ngrams模型. 代码保存在github. 下面, 作者会对实现过程做些简单介绍.
BaseNgramModel 我们先来看看 BaseNgramModel 长什么样子:

@compat.python_2_unicode_compatible class BaseNgramModel(object): """An example of how to consume NgramCounter to create a language model. This class isn't intended to be used directly, folks should inherit from it when writing their own ngram models. """def __init__(self, ngram_counter):self.ngram_counter = ngram_counter # for convenient access save top-most ngram order ConditionalFreqDist self.ngrams = ngram_counter.ngrams[ngram_counter.order] self._ngrams = ngram_counter.ngrams self._order = ngram_counter.orderself._check_against_vocab = self.ngram_counter.check_against_vocabdef check_context(self, context): """Makes sure context not longer than model's ngram order and is a tuple.""" if len(context) >= self._order: raise ValueError("Context is too long for this ngram order: {0}".format(context)) # ensures the context argument is a tuple return tuple(context)def score(self, word, context): """ This is a dummy implementation. Child classes should define their own implementations. :param word: the word to get the probability of :type word: str :param context: the context the word is in :type context: Tuple[str] """ return 0.5def logscore(self, word, context): """ Evaluate the log probability of this word in this context. This implementation actually works, child classes don't have to redefine it. :param word: the word to get the probability of :type word: str :param context: the context the word is in :type context: Tuple[str] """ score = self.score(word, context) if score == 0.0: return NEG_INF return log(score, 2)def entropy(self, text): """ Calculate the approximate cross-entropy of the n-gram model for a given evaluation text. This is the average log probability of each word in the text. :param text: words to use for evaluation :type text: Iterable[str] """normed_text = (self._check_against_vocab(word) for word in text) H = 0.0# entropy is conventionally denoted by "H" processed_ngrams = 0 for ngram in self.ngram_counter.to_ngrams(normed_text): context, word = tuple(ngram[:-1]), ngram[-1] H += self.logscore(word, context) processed_ngrams += 1 return - (H / processed_ngrams)def perplexity(self, text): """ Calculates the perplexity of the given text. This is simply 2 ** cross-entropy for the text. :param text: words to calculate perplexity of :type text: Iterable[str] """return pow(2.0, self.entropy(text))

可以看到, 要继承这个类重新实现NgramModel, 我们有两大任务:
  1. 实现初始化参数ngram_counter
  2. 派生类要覆盖score方法
NgramCounter 从上面的代码我们可以看到, 参数ngram_counter的类必须实现以下属性和方法:
  • order: 属性, int, 模型阶数
  • ngrams: 属性, dict, 各阶模型的条件概率分布的集合
  • vocabulary: 属性, set, ngram词汇表
  • to_gram: 方法, (list)-> yield tuple, 通过输入文本生成ngram
  • check_against_vocab: 方法, (str)-> str, 根据词汇表对单词做映射
【复活NgramModel!-继承'BaseNgramModel'重新实现】小菜一叠, 唯独需要注意的里面的低阶模型的递归生成, 因为我们要靠这个数据结构实现Katz backoff平滑模型. 另外顺便一提, 尽管python的类属性没有公有私有的区别, 但是大家尽可能不要外部直接访问类属性, 应该用@property@xxx.setter保护起来, 道理大家懂的. 实现代码如下:
class NgramCounter(object): """ 依据 NLTK 3.0 给出的模型基类'BaseNgramModel'所实现的NgramCounter必要成员属性和方法 - order: 属性, int, 模型阶数 - ngrams: 属性, dict, 各界模型的条件概率分布的集合 - vocabulary: 属性, set, ngram词汇表 - to_gram: 方法, (list)-> yield tuple, 通过输入文本生成ngram - check_against_vocab: 方法, (str)-> str, 根据词汇表对单词做映射""" def __init__(self, order: int, train: list, pad_left: bool=True, pad_right: bool =False, left_pad_symbol: str ='', right_pad_symbol: str ='', recursive: bool =True): """:param order: 模型阶数 :param train: 训练样本 :param pad_left: 是否进行左填充 :param pad_right: 是否进行右填充 :param left_pad_symbol: 左填充符号 :param right_pad_symbol: 右填充符号 :param recursive: 是否生成低阶模型 """ self._ngrams = dict()# 模型阶数必须大于0 assert (order > 0), order # 保存模型阶数 self._order = order # 为方便检查, 为n=1的1阶模型保存一个快捷变量# padding的设置 assert (isinstance(pad_left, bool)) assert (isinstance(pad_right, bool)) self._pad_left = pad_left self._pad_right = pad_right self._left_pad_symbol = left_pad_symbol self._right_pad_symbol = right_pad_symbolcfd = ConditionalFreqDist() self._vocabulary = set()# 输入适配. 如果输入的训练数据不是list, 用一个列表包裹它 if (train is not None) and isinstance(train[0], compat.string_types): train = [train]for sent in train: for ngram in self.to_ngrams(sent): self._vocabulary.add(ngram) context = tuple(ngram[:-1]) token = ngram[-1] # NB, ConditionalFreqDist的接口已经改变, 已经没有方法'inc', 需要改为如下语句 cfd[context][token] += 1self._ngrams[self._order] = cfd# NB, 关键代码: 递归生成低阶NgramCounter # 如果递归, 那就生成低阶概率分布, 注意还要把order-2至1阶的概率分布取回来 if recursive and not order == 1: self._backoff = NgramCounter(order - 1, train, pad_left=pad_left, left_pad_symbol=left_pad_symbol, pad_right=pad_right, right_pad_symbol=right_pad_symbol) # 递归地把个低阶概率分布取回来 cursor = self._backoff while cursor is not None: self._ngrams[cursor.order] = cursor.ngrams[cursor.order] cursor = cursor.backoff else: self._backoff = None@property def order(self) -> int: return self._order@property def vocabulary(self) -> set: return self._vocabulary@property def ngrams(self) -> dict: return self._ngrams@property def backoff(self) -> type('NgramCounter'): return self._backoffdef check_against_vocab(self, word) -> str: """ 目前不对生词作任何处理 :param word: """ return worddef to_ngrams(self, text) -> tuple: return ngrams(text, self._order, pad_left=self._pad_left, pad_right=self._pad_right, left_pad_symbol=self._left_pad_symbol, right_pad_symbol=self._right_pad_symbol)

有了可以递归的NgramCounter, 我们就可以继承BaseNgramModel复活NgramModel. 需要注意的两点是:
  1. 先调父类的构造函数, 因为它初始化了各种属性
  2. 注意低阶模型的递归
Talk is cheap, show me the code:
class NgramModel(BaseNgramModel): """ 继承模型基类'BaseNgramModel'重新实现NgramModelNote: 1. 原方法'prob'和'logprob'已分别改名为'score'和'logstore' 2. 原方法'entropy'显式对输入文本进行padding, 然而基类'BaseNgramModel'的'entorpy'没有. 但是, 基类'BaseNgramModel'的'entorpy'的调用'NgramCounter'to_ngram, 已经进行padding. 所以我们不需要覆盖'entropy' """def __init__(self, ngram_counter, estimator=None, *estimator_args, **estimator_kwargs):super(NgramModel, self).__init__(ngram_counter)# 设置频率平滑器, 没有就使用默认 if estimator is None: estimator = _estimator# 使用频率平滑器, 生成ngram模型 if not estimator_args and not estimator_kwargs: self._model = ConditionalProbDist(self.ngrams, estimator, len(self.ngrams)) else: self._model = ConditionalProbDist(self.ngrams, estimator, *estimator_args, **estimator_kwargs)# 递归生成低阶模型 if self._order > 1 and self.ngram_counter.backoff is not None: self._backoff = NgramModel(self.ngram_counter.backoff, estimator, *estimator_args, **estimator_kwargs)def score(self, word, context): """ Evaluate the probability of this word in this context using Katz Backoff. :param word: the word to get the probability of :type word: str :param context: the context the word is in :type context: list(str) """context = tuple(context) # NB, 属性'_ngrams'已经在基类'BaseNgramModel'被赋值为'NgramCounter'的ConditionalFreqDist集合. # 词汇表实际上是NgramCounter的属性'vocabulary'. 具体修改如下 # if (context + (word,) in self._ngrams) or (self._n == 1): if (context + (word,) in self.ngram_counter.vocabulary) or (self._order == 1): return self[context].prob(word) else: return self._alpha(context) * self._backoff.score(word, context[1:])def _alpha(self, tokens): return self._beta(tokens) / self._backoff._beta(tokens[1:])def _beta(self, tokens): return self[tokens].discount() if tokens in self else 1def choose_random_word(self, context): """ Randomly select a word that is likely to appear in this context. :param context: the context the word is in :type context: list(str) """return self.generate(1, context)[-1]# NB, this will always start with same word if the model # was trained on a single text def generate(self, num_words, context=()): """ Generate random text based on the language model. :param num_words: number of words to generate :type num_words: int :param context: initial words in generated string :type context: list(str) """text = list(context) for i in range(num_words): text.append(self._generate_one(text)) return textdef _generate_one(self, context): context = (self._lpad + tuple(context))[-self._n + 1:] if context in self: return self[context].generate() elif self._n > 1: return self._backoff._generate_one(context[1:]) else: return '.'def __contains__(self, item): return tuple(item) in self._modeldef __getitem__(self, item): return self._model[tuple(item)]def __repr__(self): return '' % (len(self._ngrams), self._n)

结语 复活的模型和原2.x中的模型计算结果完全一致, 大家可以自行测试, 或直接运行github上的代码测试.
