> NLTK is very well-documented and easy to work with.
Agreed.
> It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc.
Don't agree. The biggest missing piece is a statistical parser which forms the basis for a lot of further linguistic analysis. It is hard to beat Stanford Parser for that. Check out https://github.com/wavii/pfp which has Python bindings.
For most of the ML stuff, you would be better off going to a specialist library like Scikits.learn directly. They are faster and implementations are more accurate. ( I found some of the implementations not quite correct in NLTK. For example, Naive Bays classifier which a lot of first time users use. The difference in results may not be much in practice but it is still incorrect.)
It is definitely a very good place to start but better alternatives exist for many of the pieces.
>Don't agree. The biggest missing piece is a statistical
>parser which forms the basis for a lot of further linguistic
>analysis.
Conceded and agreed. This is the one major gap. But I still maintain it's a remarkably complete toolkit. Plus you get to work in Python, which is a big advantage for me.
What's wrong with the Naive Bayes classifier? Did you submit a patch?
Likewise, I totally agree with you that there are faster/more accurate/more efficient implementations of many of the tools in the NLTK. If performance is a must, then you're better of prototyping in NLTK then using a specialized library. But in terms of completeness and ease of use, NLTK is very strong.
EDIT: I'm not sure why abhaga is being downvoted. There was nothing disrespectful in his response to me. Disagreement is an important part of intelligent discussion. Upvoting to counter the downvote(s).
The problem I found is that it mixes up the binomial and the multinomial event models for the naive bayes (See http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pd... for reference). It computes the probabilities as per the binomial event model but doesn't include the probabilities of missing events. This was my understanding from reading the source code.
> Plus you get to work in Python, which is a big advantage for me.
Indeed. I so wish someone would build a dependency parser on top of pfp so that I can ditch Stanford parser. I have used https://github.com/dasmith/stanford-corenlp-python for interfacing with Stanford toolkit but it is somewhat brittle.
As far as I know, NLTK has no C dependencies other than its general dependency on NumPy. I think they are keeping the toolkit in pure Python on purpose (but I may be wrong about that). That said there are SVM implementations in pure Python -- PyMVPA, for one.
Agreed.
> It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc.
Don't agree. The biggest missing piece is a statistical parser which forms the basis for a lot of further linguistic analysis. It is hard to beat Stanford Parser for that. Check out https://github.com/wavii/pfp which has Python bindings.
For most of the ML stuff, you would be better off going to a specialist library like Scikits.learn directly. They are faster and implementations are more accurate. ( I found some of the implementations not quite correct in NLTK. For example, Naive Bays classifier which a lot of first time users use. The difference in results may not be much in practice but it is still incorrect.)
It is definitely a very good place to start but better alternatives exist for many of the pieces.