Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I find myself analyzing a piles of json files, or large json files now and then, usually for one-off or infrequent analyses. If I'll never touch this data again, then usually the first thing I do is collect all the data and do whatever is needed to put it in a pandas dataframe, and then throw away the original data. Thankfully I haven't had an issue with the RAM required for this in the last few years. (working with 64GB ram).

If I think I'll have to do at least a few more analyses in the future, perhaps with a growing dataset, I'll usually put the data into Sqlite. If possible I try to keep it simple, with a single table, even if it means a non-normalized schema. As for tooling I typically go with `dataset`, an sqlalchemy wrapper that's super simple to use, and makes it possible to also use raw sql if I need to. I haven't fully explored the JSON capabilities of sqlite itself, but have been meaning to. If duckdb gets similar features that would be certainly worth looking into.

In terms of "meta-format", I usually like object-per-line and array of objects. Easy to add more records, pretty self-explanatory. Maybe inefficient but if that becomes an issue then it's time to move away from just JSON.

In regards to querying language, I usually don't do anything too complicated so I don't think much about it. Having SQL (as when using dataset) is nice. I also have to use mongodb for some tasks, and I also find that query language good enough for most things I need to do, since I'm not usually dealing with highly relational data.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: