Book|Book Review: Foundations of Statistical Natural Language Processing

作者:Christopher Manning
出版社:Mit Press
发行时间:May 28th 1999
来源:下载的 PDF 版本
Goodreads:4.15 (219 Ratings)
豆瓣:9.0(61人评价)
摘录:
What is the best way to read this book and teach from it? The book is organized into four parts: Preliminaries (part I), Words (part II), Grammar (part III), and Applications and Techniques (part IV).
Part I lays out the mathematical and linguistic foundation that the other parts build on. Concepts and techniques introduced here are referred to throughout the book.
Part II covers word-centered work in Statistical NLP. There is a natural progression from simple to complex linguistic phenomena in its four chapters on collocations, n-gram models, word sense disambiguation, and lexical acquisition, but each chapter can also be read on its own.
The four chapters in part III, Markov Models, tagging, probabilistic context free grammars, and probabilistic parsing, build on each other, and so they are best presented in sequence. However, the tagging chapter can be read separately with occasional references to the Markov Model chapter.
The topics of part IV are four applications and techniques: statistical alignment and machine translation, clustering, information retrieval, and text categorization. Again, these chapters can be treated separately according to interests and time available, with the few dependencies between them marked appropriately.
“Statistical considerations are essential to an understanding of the operation and development of languages”
(Lyons 1968: 98)
“One’s ability to produce and recognize grammatical utterances is not based on notions of statistical approximation and the like”
(Chomsky 1957: 16)
“You say: the point isn’t the word, but its meaning, and you think of the meaning as a thing of the same kind as the word, though also different from the word. Here the word, there the meaning. The money, and the cow that you can buy with it. (But contrast: money, and its use.)”
(Wittgenstein 1968, Philosophical Investigations, §120)
“For a large class of cases—though not for all—in which we employ the word ‘meaning’ it can be defined thus: the meaning of a word is its use in the language.”
(Wittgenstein 1968, §43)
“Now isn’t it queer that I say that the word ‘is’ is used with two different meanings (as the copula and as the sign of equality), and should not care to say that its meaning is its use; its use, that is, as the copula and the sign of equality?” (Wittgenstein 1968, §561)
The aim of a linguistic science is to be able to characterize and explain the multitude of linguistic observations circling around us, in conversations, writing, and other media. Part of that has to do with the cognitive side of how humans acquire, produce, and understand language, part of it has to do with understanding the relationship between linguistic utterances and the world, and part of it has to do with understanding the linguistic structures by which language communicates. In order to rules approach the last problem, people have proposed that there are rules which are used to structure linguistic expressions. This basic approach has a long history that extends back at least 2000 years, but in this century the approach became increasingly formal and rigorous as linguists explored detailed grammars that attempted to describe what were wellformed versus ill-formed utterances of a language.
However, it has become apparent that there is a problem with this conception. Indeed it was noticed early on by Edward Sapir, who summed it up in his famous quote “All grammars leak” (Sapir 1921: 38). It is just not possible to provide an exact and complete characterization of wellformed utterances that cleanly divides them from all other sequences of words, which are regarded as ill-formed utterances. This is because people are always stretching and bending the ‘rules’ to meet their communicative needs. Nevertheless, it is certainly not the case that the rules are completely ill-founded. Syntactic rules for a language, such as that a basic English noun phrase consists of an optional determiner, some number of adjectives, and then a noun, do capture major patterns within the language. But somehow we need to make things looser, in accounting for the creativity of language use.
Finally, Chomskyan linguistics, while recognizing certain notions of categorical competition between principles, depends on categorical principles, which sentences either do or do not satisfy. In general, the same was true of American structuralism. But the approach we will pursue in Statistical NLP draws from the work of Shannon, where the aim is to assign probabilities to linguistic events, so that we can say which sentences are ‘usual’ and ‘unusual’. An upshot of this is that while Chomskyan linguists tend to concentrate on categorical judgements about very rare types of sentences, Statistical NLP practitioners are interested in good descriptions of the associations and preferences that occur in the totality of language use. Indeed, they often find that one can get good real world performance by concentrating on common types of sentences.
We believe that much of the skepticism towards probabilistic models for language (and for cognition in general) stems from the fact that the well-known early probabilistic models (developed in the 1940s and 1950s) are extremely simplistic. Because these simplistic models clearly do not do justice to the complexity of human language, it is easy to view probabilistic models in general as inadequate. One of the insights we hope to promote in this book is that complex probabilistic models can be as explanatory as complex non-probabilistic models – but with the added advantage that they can explain phenomena that involve the type of uncertainty and incompleteness that is so pervasive in cognition in general and in language in particular.
The Brown corpus is probably the most widely known corpus. It is a tagged corpus of about a million words that was put together at Brown balanced corpus university in the 1960s and 1970s. It is a balanced corpus. That is, an attempt was made to make the corpus a representative sample of American English at the time. Genres covered are press reportage, fiction, scientific text, legal text, and many others. Unfortunately, one has to pay to obtain the Brown corpus, but it is relatively inexpensive for research purposes. Many institutions with NLP research have a copy available, so ask around. The Lancaster-Oslo-Bergen (LOB) corpus was built as a British English replication of the Brown corpus.
The Susanne corpus is a 130,000 word subset of the Brown corpus, which has the advantage of being freely available. It is also annotated with information on the syntactic structure of sentences – the Brown corpus only disambiguates on a word-for-word basis. A larger corpus of syntactically annotated (or parsed) sentences is the Penn Treebank. The text is from the Wall Street Journal. It is more widely used, but not available for free.
In addition to texts, we also need dictionaries. WordNet is an electronic dictionary of English. Words are organized into a hierarchy. Each node consists of a synset of words with identical (or close to identical) meanings. There are also some other relations between words that are defined, such as meronymy or part-whole relations. WordNet is free and can be downloaded from the internet.
In his book Human Behavior and the Principle of Least Effort, Zipf argues that he has found a unifying principle, the Principle of Least Effort, which underlies essentially the entire human condition (the book even includes some questionable remarks on human sexuality!). The Principle of Least Effort argues that people will act so as to minimize their probable average rate of work (i.e., not only to minimize the work that they would have to do immediately, but taking due consideration of future work that might result from doing work poorly in the short term). The evidence for this theory is certain empirical laws that Zipf uncovered, and his presentation of these laws begins where his own research began, in uncovering certain statistical distributions in language. We will not comment on his general theory here, but will mention some of his empirical language laws.
References to Zipf’s law in the Statistical NLP literature invariably refer to the above law, but Zipf actually proposed a number of other empirical laws relating to language which were also taken to illustrate the Principle of Least Effort. At least two others are of some interest to the concerns of Statistical NLP. One is the suggestion that the number of meanings of a word is correlated with its frequency. Again, Zipf argues that conservation of speaker effort would prefer there to be only one word with all meanings while conservation of hearer effort would prefer each meaning to be expressed by a different word.
Zipf finds empirical support for this result (in his study, words of frequency rank about 10,000 average about 2.1 meanings, words of rank about 5000 average about 3 meanings, and words of rank about 2000 average about 4.6 meanings).
A second result concerns the tendency of content words to clump. For a word one can measure the number of lines or pages between each occurrence of the word in a text, and then calculate the frequency F of different interval sizes I. For words of frequency at most 24 in a 260,000 word corpus, Zipf found that the number of intervals of a certain size was inversely related to the interval size (F ∝ I?p, where p varied between about 1 and 1.3 in Zipf’s studies). In other words, most of the time content words occur near another occurrence of the same word.
Lexicographers and linguists (although rarely those of a generative bent) collocation have long been interested in collocations. A collocation is any turn of phrase or accepted usage where somehow the whole is perceived to have an existence beyond the sum of the parts. Collocations include compounds (disk drive), phrasal verbs (make up), and other stock phrases (bacon and eggs). They often have a specialized meaning or are idiomatic, but they need not be. For example, at the time of writing, a favorite expression of bureaucrats in Australia is international best practice. Now there appears to be nothing idiomatic about this expression; it is simply two adjectives modifying a noun in a productive and semantically compositional way. But, nevertheless, the frequent use of this phrase as a fixed expression accompanied by certain connotations justifies regarding it as a collocation. Indeed, any expression that people repeat because they have heard others using it is a candidate for a collocation.
A modification that might be less obvious, but which is very effective, is to filter the collocations and remove those that have parts of speech (or syntactic categories) that are rarely associated with interesting collocations. There simply are no interesting collocations that have a preposition as the first word and an article as the second word. The two most frequent patterns for two word collocations are “adjective noun” and “noun noun” (the latter are called noun-noun compounds). Table 1.5 shows which bigrams are selected from the corpus if we only keep adjective noun and noun-noun bigrams. Almost all of them seem to be phrases that we would want to list in a dictionary – with some exceptions like last year and next year.
As a final illustration of data exploration, suppose we are interested in the syntactic frames in which verbs appear. People have researched how to get a computer to find these frames automatically, but we can also just use the computer as a tool to find appropriate data. For such purposes, people often use a Key Word In Context (KWIC) concordancing program which produces displays of data such as the one in figure 1.3. In such a display, all occurrences of the word of interest are lined up beneath one another, with surrounding context shown on both sides.
So far we have examined the notion of entropy, and seen roughly how it is a guide to determining efficient codes for sending messages, but how does this relate to understanding language? The secret to this is to return to the idea that entropy is a measure of our uncertainty. The more we know about something, the lower the entropy will be because we are less surprised by the outcome of a trial.
Alternately, we can think of entropy as a matter of how surprised we will be. Suppose that we are trying to predict the next word in a Simplified Polynesian text. That is, we are examining P (w|h), where w is the next word and h is the history of words seen so far.
Pronouns are a separate small class of words that act like variables in that they refer to a person or thing that is somehow salient in the discourse context. For example, the pronoun she in sentence (3.5) refers to the most salient person (of feminine gender) in the context of use, which is Mary.
Brown tags. NN is the Brown tag for singular nouns (candy, woman). The Brown tag set also distinguishes two special types of nouns, proper nouns (or proper names), and adverbial nouns. Proper nouns are names like Mary, Smith, or United States that refer to particular persons or things. Proper nouns are usually capitalized. The tag for proper nouns is NNP.1 Adverbial nouns (tag NR) are nouns like home, west and tomorrow that can be used without modifiers to give information about the circumstances of the event described, for example the time or the location. They have a function similar to adverbs (see below). The tags mentioned so far have the following plural equivalents: NNS (plural nouns), NNPS (plural proper nouns), and NRS (plural adverbial nouns). Many also have possessive or genitive extensions: NN (possessive plural nouns), NNP (possessive plural proper nouns), and NR$ (possessive adverbial nouns).
Brown tags. The Brown tag for adjectives (in the positive form) is JJ, for comparatives JJR, for superlatives JJT. There is a special tag, JJS, for the ‘semantically’ superlative adjectives chief, main, and top. Numbers are subclasses of adjectives. The cardinals, such as one, two, and 6,000,000, have the tag CD. The ordinals, such as first, second, tenth, and mid-twentieth have the tag OD.
Brown tags. The Brown tag set uses VB for the base form (take), VBZ for the third person singular (takes), VBD for the past tense (took), VBG for gerund and present participle (taking), and VBN for the past participle (taken). The tag for modal auxiliaries (can, may, must, could, might, ...) is MD. Since be, have, and do are important in forming tenses and moods, the Brown tag set has separate tags for all forms of these verbs. We omit them here, but they are listed in table 4.6.
Brown tags. The tags for adverbs are RB (ordinary adverb: simply, late, well, little), RBR (comparative adverb: later, better, less), RBT (superlative adverb: latest, best, least), ? (not), QL (qualifier: very, too, extremely), and QLP (post-qualifier: enough, indeed). Two tags stand for parts of speech that have both adverbial and interrogative functions: WQL (wh-qualifier: how) and WRB (wh-adverb: how, when, where). The Brown tag for prepositions is IN, while particles have the tag RP.
【Book|Book Review: Foundations of Statistical Natural Language Processing】Brown tags. The tag for conjunctions is CC. The tag for subordinating
conjunctions is CS.

    推荐阅读