Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pandas 0.7.0 released: Python data analysis library (pydata.org)
182 points by wesm on Feb 10, 2012 | hide | past | favorite | 26 comments


Pandas is looking very nice in general, and I'm happy to find HDF5 in there too :)

http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytabl...


Ooooh, this is really cool! R is nice, but switching to it is a pain when working in python.

Together with scikits.learn, this could prove really useful in machine learning and data analysis projects.


Pandas is literally in my top 3 favorite open-source projects. I use v0.6x regularly and it is absolutely fantastic. Highly recommend. Can't wait to try some of these new features as well.


I'm curious: What are your other two favorites?


I'm not sure if I was starting fresh I'd go with R anymore -- it's so hard to leave R when you've got a toolkit of 50+ packages you need though. =(


Wes, was just on your blog and see you are also into data visualization. Wondering if you have any recommendations for web based charting tools? I've used flot (jquery plugin), but looking for alternatives. Thanks!


Very interested in d3 integration. Some people (http://github.com/mikedewar/D3py) have already started working in that direction. The IPython HTML notebook makes JavaScript visualization combined with pandas a very attractive option going forward, especially if you can come up with a way to have an interactive plot with backend computations being handled by pandas. pandas currently does not emit JSON; I would live to adapt UltraJSON or another library to turn DataFrame objects into JSON very fast and efficiently.


I've used http://www.amcharts.com (and its mapping companion http://www.ammap.com). It was originally in flash but now there's also a js version (havent tried it yet). I think i've read here on HN that a new flot fork has been released with jquery requirements but i cant find the link




I tried to find information on how fast the operations are in Pandas, but couldn't see any numbers. Does anybody have opinions about that aspect?


I've written quite a bit about performance on my blog: http://blog.wesmckinney.com. The historical (v)benchmarks page is a good resource (but doesn't compare to any other libraries): http://pandas.pydata.org/pandas-docs/vbench/


what are you using to display code on your blog, it's really nice!


Recent posts use the Crayon syntax highlighter for Wordpress. Though I'm thinking about ditching WP eventually for a workflow more like http://jseabold.net/blog/2012/01/project-genesis.html.


Wes is a rockstar


> NaN (not a number) is the standard missing data marker used in pandas

That's just wrong.


Is it? For lack of NA bit patterns in NumPy it's either use a special value (like NaN) or use masked arrays. If you choose the latter, I say to you: good luck.


NaN as commonly used already has a meaning: it's the result of a calculation whose inputs were known, and the calculation is known to be undefined for those specific inputs. "Unknown" means something entirely different: that we don't know what the inputs were, but if we did they are unlikely to have been NaN.

Conflating the two concepts means you can't tell the difference given the result set. It's just a happy accident that "unknown" and NaN have identical propagation rules, but that doesn't mean that it's safe to use one in place of the other. Reading up on it, it looks like Octave and Matlab can treat NaN as "missing data", though, so I guess there's a certain "industry standard behaviour" to follow so as not to surprise users, but it's still less than ideal.

In an ideal world, we could define an explicit "missing data" quiet NaN which would have a distinct visual representation - I suspect this is doable with access to the float exponent bits, but I don't know how Python could take advantage of it.


Well, masked arrays are a very good solution for the right problem (i.e. temporarily or permanently flagging data as "bad" while preserving the original data). Not to rehash the old debate, but they're quite handy when you need them.

I do agree that NaN's are a better choice for truly missing data, but I'm biased just because they use less memory. They're not a solution for non-floating point data, though.

Great job on Pandas, by the way!


Pandas + statsmodels = #rstats domination. Really like where this project is going.


It'd be better if you lost the ragging on c#/c++/java and went positive by accentuating the great abilities of interpreted languages like python for rapid prototyping, which is what you are doing when you are iteratively improving analysis.

FYI it's fun to hear an academic ragging on "unmaintainable code".


Who's the academic you're referring to (if it's me, you're misinformed)?

One of the strengths of Python is that you can use it to build critical production systems (which I've done for many years in the financial industry). You come up against a lot of people who think "Java/C++/C# are the only suitable systems languages".


I use python at work heavily. I also equally use Java, C++, C#, Ruby, and shell scripting. I use what's good for what I'm trying to do, and I like having several choices.

I'm merely pointing out that the language bashing is not productive. The writeup should point out the positives and stop trying to turn the differences between languages into a parallel of state of American political discourse.


Is there anything similar to Pandas that runs on the JVM?



Is there anything close to this for Ruby?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: