Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> NLTK is very well-documented and easy to work with.

Agreed.

> It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc.

Don't agree. The biggest missing piece is a statistical parser which forms the basis for a lot of further linguistic analysis. It is hard to beat Stanford Parser for that. Check out https://github.com/wavii/pfp which has Python bindings.

For most of the ML stuff, you would be better off going to a specialist library like Scikits.learn directly. They are faster and implementations are more accurate. ( I found some of the implementations not quite correct in NLTK. For example, Naive Bays classifier which a lot of first time users use. The difference in results may not be much in practice but it is still incorrect.)

It is definitely a very good place to start but better alternatives exist for many of the pieces.



>Don't agree. The biggest missing piece is a statistical >parser which forms the basis for a lot of further linguistic >analysis.

Conceded and agreed. This is the one major gap. But I still maintain it's a remarkably complete toolkit. Plus you get to work in Python, which is a big advantage for me.

What's wrong with the Naive Bayes classifier? Did you submit a patch?

Likewise, I totally agree with you that there are faster/more accurate/more efficient implementations of many of the tools in the NLTK. If performance is a must, then you're better of prototyping in NLTK then using a specialized library. But in terms of completeness and ease of use, NLTK is very strong.

EDIT: I'm not sure why abhaga is being downvoted. There was nothing disrespectful in his response to me. Disagreement is an important part of intelligent discussion. Upvoting to counter the downvote(s).


> What's wrong with the Naive Bayes classifier?

The problem I found is that it mixes up the binomial and the multinomial event models for the naive bayes (See http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pd... for reference). It computes the probabilities as per the binomial event model but doesn't include the probabilities of missing events. This was my understanding from reading the source code.

> Plus you get to work in Python, which is a big advantage for me.

Indeed. I so wish someone would build a dependency parser on top of pfp so that I can ditch Stanford parser. I have used https://github.com/dasmith/stanford-corenlp-python for interfacing with Stanford toolkit but it is somewhat brittle.


No SVM support either. I could try to add it I guess; libSVM has Python bindings already.


As far as I know, NLTK has no C dependencies other than its general dependency on NumPy. I think they are keeping the toolkit in pure Python on purpose (but I may be wrong about that). That said there are SVM implementations in pure Python -- PyMVPA, for one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: