Pandas is literally in my top 3 favorite open-source projects. I use v0.6x regularly and it is absolutely fantastic. Highly recommend. Can't wait to try some of these new features as well.
Wes, was just on your blog and see you are also into data visualization. Wondering if you have any recommendations for web based charting tools? I've used flot (jquery plugin), but looking for alternatives. Thanks!
Very interested in d3 integration. Some people (http://github.com/mikedewar/D3py) have already started working in that direction. The IPython HTML notebook makes JavaScript visualization combined with pandas a very attractive option going forward, especially if you can come up with a way to have an interactive plot with backend computations being handled by pandas. pandas currently does not emit JSON; I would live to adapt UltraJSON or another library to turn DataFrame objects into JSON very fast and efficiently.
I've used http://www.amcharts.com (and its mapping companion http://www.ammap.com). It was originally in flash but now there's also a js version (havent tried it yet).
I think i've read here on HN that a new flot fork has been released with jquery requirements but i cant find the link
Is it? For lack of NA bit patterns in NumPy it's either use a special value (like NaN) or use masked arrays. If you choose the latter, I say to you: good luck.
NaN as commonly used already has a meaning: it's the result of a calculation whose inputs were known, and the calculation is known to be undefined for those specific inputs. "Unknown" means something entirely different: that we don't know what the inputs were, but if we did they are unlikely to have been NaN.
Conflating the two concepts means you can't tell the difference given the result set. It's just a happy accident that "unknown" and NaN have identical propagation rules, but that doesn't mean that it's safe to use one in place of the other. Reading up on it, it looks like Octave and Matlab can treat NaN as "missing data", though, so I guess there's a certain "industry standard behaviour" to follow so as not to surprise users, but it's still less than ideal.
In an ideal world, we could define an explicit "missing data" quiet NaN which would have a distinct visual representation - I suspect this is doable with access to the float exponent bits, but I don't know how Python could take advantage of it.
Well, masked arrays are a very good solution for the right problem (i.e. temporarily or permanently flagging data as "bad" while preserving the original data). Not to rehash the old debate, but they're quite handy when you need them.
I do agree that NaN's are a better choice for truly missing data, but I'm biased just because they use less memory. They're not a solution for non-floating point data, though.
It'd be better if you lost the ragging on c#/c++/java and went positive by accentuating the great abilities of interpreted languages like python for rapid prototyping, which is what you are doing when you are iteratively improving analysis.
FYI it's fun to hear an academic ragging on "unmaintainable code".
Who's the academic you're referring to (if it's me, you're misinformed)?
One of the strengths of Python is that you can use it to build critical production systems (which I've done for many years in the financial industry). You come up against a lot of people who think "Java/C++/C# are the only suitable systems languages".
I use python at work heavily. I also equally use Java, C++, C#, Ruby, and shell scripting. I use what's good for what I'm trying to do, and I like having several choices.
I'm merely pointing out that the language bashing is not productive. The writeup should point out the positives and stop trying to turn the differences between languages into a parallel of state of American political discourse.
http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytabl...