Stanford|StandFord的parser的调用API

注意 Parser针对已分好词的中文句子生成语义生成树。
Parser下载地址 https://nlp.stanford.edu/software/lex-parser.shtml
API(Java) 【Stanford|StandFord的parser的调用API】将Jar包导入到项目后在Java程序中import一下

import java.util.ArrayList; import java.util.Collection; import java.util.List; import java.io.*; import edu.stanford.nlp.process.DocumentPreprocessor; import edu.stanford.nlp.ling.HasWord; import edu.stanford.nlp.ling.TaggedWord; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

不带词性的句子 在将不带词性分好词的句子输入到Parser中,会被Parser自动的进行词性标注,然后生成语义生成树。下面的代码是基于Parser一起提供的样例代码修改的。由于这个样例代码中使用的读封装好的类来读取句子,按照提供的代码去跑自己的数据发现这个样例本身是为一个句子设计的,也就是说它不会自动断句。自己没有找到相关的说明文档,所以在这个样例基础之上简单地手动切分句子。其中,String类型的doc\oh\en\up变量分别是代表着几种句子结尾的标点符号。
这里使用的文本文件作为输入,同时将生成树也输出到文本文件。
class ParserDemo { public static void main(String[] args) { String parserModel = "edu\\stanford\\nlp\\models\\lexparser\\chinesePCFG.ser.gz"; String testFile = "C:\\Users\\codinglee\\Desktop\\NLP\\Project_coding\\data\\test.txt"; String outTest = "C:\\Users\\codinglee\\Desktop\\NLP\\Project_coding\\cpp\\cpp\\testTree.txt"; demoDP(parserModel, testFile, outTest); }/** * demoDP demonstrates turning a file into tokens and then parse * trees.Note that the trees are printed by calling pennPrint on * the Tree object.It is also possible to pass a PrintWriter to * pennPrint if you want to capture the output. * This code will work with any supported language. */ public static void demoDP(String parserModel,String filename, String outname) { // This option shows loading, sentence-segmenting and tokenizing // a file using DocumentPreprocessor. LexicalizedParser lp = LexicalizedParser.loadModel(parserModel); TreebankLanguagePack tlp = lp.treebankLanguagePack(); // a PennTreebankLanguagePack for Englishtry { FileWriter fw=new FileWriter(outname); PrintWriter pw=new PrintWriter(fw); for (List sentence : new DocumentPreprocessor(filename)) { String doc="。"; String oh = "!"; String en = "?"; String up = "''"; int n = sentence.size(), cur = 0, next = 0, step = 1; for (; cur "); System.out.print(cur); System.out.print(", "); System.out.print(next); System.out.print(": "); System.out.println(sentence.subList(cur,next)); Tree parse = lp.apply(sentence.subList(cur,next)); parse.pennPrint(pw); cur = next; step += 1; } } } pw.close(); } catch (IOException e) { e.printStackTrace(); } } }

带词性标注的句子 自己有带词性标注好句子时,希望使用自己的这个词性标注时,会用到TaggedWord来为单词添加词性。下面代码中“dev.txt”保存的分好词的句子,“devAttr.txt”是对应“dev.txt”中单词的词性标注。“devTree.txt”保存生成的语义生成树。
class ParserDemo { public static void main(String[] args) { String parserModel = "edu\\stanford\\nlp\\models\\lexparser\\chinesePCFG.ser.gz"; String leeWord = "C:\\Users\\codinglee\\Desktop\\NLP\\Project_coding\\data\\dev.txt"; String leeAttr = "C:\\Users\\codinglee\\Desktop\\NLP\\Project_coding\\data\\devAttr.txt"; String leeOut = "C:\\Users\\codinglee\\Desktop\\NLP\\Project_coding\\data\\devTree.txt"; demoLee(parserModel, leeWord, leeAttr, leeOut); }public static void demoLee(String parserModel, String leeWord, String leeAttr, String leeOut) { // This option shows loading, sentence-segmenting and tokenizing // a file using DocumentPreprocessor. LexicalizedParser lp = LexicalizedParser.loadModel(parserModel); TreebankLanguagePack tlp = lp.treebankLanguagePack(); // a PennTreebankLanguagePack for Englishtry { FileWriter fw=new FileWriter(leeOut); PrintWriter pw=new PrintWriter(fw); FileReader frWord = new FileReader(leeWord); BufferedReader brWord = new BufferedReader(frWord); FileReader frAttr =new FileReader(leeAttr); BufferedReader brAttr = new BufferedReader(frAttr); String line = ""; String[] words = null; String[] attrs = null; int counter = 1; while ((line=brWord.readLine())!=null) { if (line.length() > 0) { words=line.split(" "); attrs = brAttr.readLine().split(" "); //while (attrs.length == 0) //attrs = brAttr.readLine().split(" "); System.out.println(counter+"th parser is moving on --->>> "+words.length+" - "+attrs.length); List sentence = new ArrayList(); for (int i=0; i

    推荐阅读