Richer linguistic content is available from some corpora, such as part-of-speech tags, dialogue tags, syntactic trees, and so forth; we will see these in later chapters. Although Project Gutenberg contains thousands of books, it represents established literature. It is important to consider less formal language as well. NLTK's small collection of web text includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean , personal advertisements, and wine reviews:.
There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators. The corpus contains over 10, posts, anonymized by replacing usernames with generic names of the form "UserNNN", and manually edited to remove any other identifying information. The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom teens, 20s, 30s, 40s, plus a generic adults chatroom.
The filename contains the date, chatroom, and number of posts; e. The Brown Corpus was the first million-word electronic corpus of English, created in at Brown University. This corpus contains text from sources, and the sources have been categorized by genre, such as news , editorial , and so on. Table 1.
Public Education’s Dirty Secret - Quillette
We can access the corpus as a list of words, or a list of sentences where each sentence is itself just a list of words. We can optionally specify particular categories or files to read:. The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics. Let's compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre.
Remember to import nltk before doing the following:. Your Turn: Choose a different section of the Brown Corpus, and adapt the previous example to count a selection of wh words, such as what , when , where , who , and why. Next, we need to obtain counts for each genre of interest.
We'll use NLTK's support for conditional frequency distributions. These are presented systematically in 2 , where we also unpick the following code line by line. For the moment, you can ignore the details and just concentrate on the output. Observe that the most frequent modal in the news genre is will , while the most frequent modal in the romance genre is could. Would you have predicted this? The idea that word counts might distinguish genres will be taken up again in chap-data-intensive.
The Reuters Corpus contains 10, news documents totaling 1. This split is for training and testing algorithms that automatically detect the topic of a document, as we will see in chap-data-intensive. Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories. For convenience, the corpus methods accept a single fileid or a list of fileids.
Similarly, we can specify the words or sentences we want in terms of files or categories. The first handful of words in each of these texts are the titles, which by convention are stored as upper case. In 1 , we looked at the Inaugural Address Corpus, but treated it as a single text. The graph in fig-inaugural used "word offset" as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address.
- Burnt Popcorn and Cheap Perfume.
- PISTIS SOPHÍA REVELADA (Spanish Edition);
- The Gift and Power: Translating the Book of Mormon (Part 1);
However, the corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension:. Notice that the year of each text appears in its filename. To get the year out of the filename, we extracted the first four characters, using fileid[:4].
Listen to the "CBS This Morning" podcast
Let's look at how the words America and citizen are used over time. The following code converts the words in the Inaugural corpus to lowercase using w. Thus it will count words like American's and Citizens.
We'll learn about conditional frequency distributions in 2 ; for now just consider the output, shown in 1. Figure 1. Many text corpora contain linguistic annotations, representing POS tags, named entities, syntactic structures, semantic roles, and so forth. NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. NLTK comes with corpora for many languages, though in some cases you will need to learn how to manipulate character encodings in Python before using these corpora see 3.
The last of these corpora, udhr , contains the Universal Declaration of Human Rights in over languages. The fileids for this corpus include information about the character encoding used in the file, such as UTF8 or Latin1. Let's use a conditional frequency distribution to examine the differences in word lengths for a selection of languages included in the udhr corpus. The output is shown in 1. Note that True and False are Python's built-in boolean values. Your Turn: Pick a language of interest in udhr. Now plot a frequency distribution of the letters of the text using nltk. Unfortunately, for many languages, substantial corpora are not yet available.
Often there is insufficient government or industrial support for developing language resources, and individual efforts are piecemeal and hard to discover or re-use.
Some languages have no established writing system, or are endangered. See 7 for suggestions on how to locate language resources. We have seen a variety of corpus structures so far; these are summarized in 1. The simplest kind lacks any structure: it is just a collection of texts. Often, texts are grouped into categories that might correspond to genre, source, author, language, etc. Sometimes these categories overlap, notably in the case of topical categories as a text can be relevant to more than one topic. Occasionally, text collections have temporal structure, news collections being the most common example.
NLTK's corpus readers support efficient access to a variety of corpora, and can be used to work with new corpora. We illustrate the difference between some of the corpus access methods below:. If you have your own collection of text files that you would like to access using the above methods, you can easily load them with the help of NLTK's PlaintextCorpusReader.
The second parameter of the PlaintextCorpusReader initializer can be a list of fileids, like [ 'a. We can use the BracketParseCorpusReader to access this corpus. We introduced frequency distributions in 3. We saw that given some list mylist of words or other items, FreqDist mylist would compute the number of occurrences of each item in the list.
Here we will generalize this idea. When the texts of a corpus are divided into several categories, by genre, topic, author, etc, we can maintain separate frequency distributions for each category. This will allow us to study systematic differences between the categories. A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition". The condition will often be the category of the text. Figure 2.
Summaries Of The News:
A frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each event with a condition. So instead of processing a sequence of words , we have to process a sequence of pairs :. Each pair has the form condition, event.
If we were processing the entire Brown Corpus by genre there would be 15 conditions one per genre , and 1,, events one per word. In 1 we saw a conditional frequency distribution where the condition was the section of the Brown Corpus, and for each condition we counted words.
Good things come to those who wait...
Let's break this down, and look at just two genres, news and romance. For each genre , we loop over every word in the genre , producing pairs consisting of the genre and the word :. We can now use this list of pairs to create a ConditionalFreqDist , and save it in a variable cfd. As usual, we can type the name of the variable to inspect it , and verify it has two conditions :.
Let's access the two conditions, and satisfy ourselves that each is just a frequency distribution:.
- [Aufruf zu einer Sammlung für Wilhelm Speyer] (Fischer Klassik Plus 476) (German Edition).
- Deadbeat Parents Who Won’t Help Pay for College?
- Frisch gemacht!: Roman (Andrea Schnidt) (German Edition).
- U.S. Passports.
Apart from combining two or more frequency distributions, and being easy to initialize, a ConditionalFreqDist provides some useful methods for tabulation and plotting. The plot in 1.
Related White and Alone in Baby Docs Haiti (My Very Long Youth, Book 11)
Copyright 2019 - All Right Reserved