Friday, June 20, 2008

Natural Language Processing

[This was originally posted at http://timstall.dotnetdevelopersjournal.com/natural_language_processing.htm]

I got very bad marks in grammar during middle school. It wasn't until after I graduated public school that I cared about writing, and hence grammar. Now, I find it quite fascinating, so I was interested to read about a CodePlex project - SharpNLP. In their own words:

SharpNLP is a collection of natural language processing tools written in C#. Currently it provides the following NLP tools:

  • a sentence splitter

  • a tokenizer

  • a part-of-speech tagger

  • a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks")

  • a parser

  • a name finder

  • a coreference tool

  • an interface to the WordNet lexical database

So, you could type in a paragraph, and it parses that out into the different sentences, and then different words and parts of speech. Of course, I began to think if maybe this could be extended to be like the FxCop of technical articles. For example, according to the Microsoft Manual of Style for Technical Publications, the rules to merely capitalize a sub-title are non-trivial - there are 10 rules and it takes a full page to explain them. I wonder in theory if you could have a NLP (Natural Language Processing) tool run through a subtitle (as an input string), and apply the rules, much like you can parse a arithmetic expression for correct syntax. I was starting to toy around with it, but it seemed to quickly get difficult, so maybe I look at it another day.

 

In the meantime, if you're interested in NLP, check out SharpNLP.

 

No comments:

Post a Comment