Evaluation-文本摘要-ROUGE Evaluation-文本摘要-ROUGE

最近，在对生成摘要的文本进行评估时，需要去重温ROUGE的定义。
同时，意外找到了人们针对文本摘要的衡量方式。

summary是否通顺（fluent）
summary是否足够（adequate)？举例而言，缩写的长度是否合适；是否涵盖了原文所有最重要的信息

ROUGE 在这里是用来评估足够性（adequate）这个指标，具体做的是通过简单计数，在生成的summary中有多少个n-grams是匹配参考summary（ground truth）的n-grams的。
（或者是多个summaries，因为可能存在多个参考summary。如果是多个reference summary的情况 ROUGE-1的得分是经过平均的。）
由于ROUGE是基于是基于内容重叠的，所以它能够决定生成的summary和参考的summary是不是在讨论大致的概念，但是并不能去考虑这两者的出来的结论是否一直，生成的summary是否是有道理的（sensible）
在维基百科上，是这么解释的。 【Evaluation-文本摘要-ROUGE】ROUGE-N: Overlap of N-grams between the system and reference summaries.

ROUGE-1 refers to the overlap of 1-gram (each word) between the system and reference summaries.
ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.

ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes .
ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.
ROUGE与BLEU几乎一模一样，但是BLEU计算的是准确率，ROUGE计算的是召回率。
其次ROUGE的词可以不是连续的，而BLEU的n-gram要求词语必须连续出现。
比如两句话“我喜欢吃香蕉”和“我刚才吃了一个香蕉”的最长公共子串为“我吃香蕉”