NLP, NLTK and ngram


New Fascination for Me - April 2012

NLP, or Natural Language Processing, manipulates the computer to process a language as it is commonly written and spoken. Language is a living organism and like all organisms, it changes. To English majors (mainly those with Ph.D. theses on their minds) a magnificent tool. A tool that will produce a concordance of an author's work. I found concordances fascinating from the moment, years ago, when I understood what they were. They are a "dictionary" of every word in, say a novel, with a portion of the sentence in which the word is used. And you see every instance of where that word is used in that book.

The Stanford NLP Group, is responsible for much of the research being done in this area. But the NLTK, the Natural Language Toolkit, was developed at the University of Pennsylvania in 2001. It is an evolving toolkit which provides functions to manipulate materials into various forms of usable data.

The first day I experimented with this, I quickly realized that this is what is used by Google to produce the ngram viewer. ngram is a wonderful way to "waste" hours of time and to watch societies change, trends come and go, words coming in and fading out of use. For instance, go to ngram and enter "complex,stress" without the quotes.

One of the more exciting things about NLP is how easy it is to at least try it out.

  1. Buy the book, Natural Language Processing with Python at amazon.com (book royalties go to support development of the NLTK).
  2. Download and install Python from python.org
  3. Download and install NLTK from here
  4. From here on, follow the book. It is an excellent book, and very well-written.

Within an hour I had figured out the basics, downloaded the text of Virginia Woolf's 1941 novel, Between the Acts, from the Gutenberg Project in Australia, turned it into an NLTK type text and played with the concordance method. Admittedly, I have some Python expericence, but not a whole lot. In fact the commands you need to get this far are mostly NLTK commands with very little Python. Here are the commands, all issued in the Python interpreter:

		raw = open ('books/woolf-acts.txt', 'rll').read()
		tokens = nltk.wordpunct_tokenize(raw)
		woolfa = nltk.text(tokens)
		woolfa.concordance("rain")

The directory books is on my machine and contains the plain text downloaded from the Gutenberg site. woolfa is the name I want to give to my processed text. And, finally, I use the NLTK function to pull out all contexts in which Woolf used the word "rain" in Between the Acts.

Concurrent with this, I found out that the next course I am taking at udacity.com is taught by the head of Google research, Peter Norvig--a Stanford professor who works in AI and Natural Language Processing. The course is not about NLP, but I wish it were!

Notebook and keyboard for CS101 at udacity.com