>Storing every package tarball as an attachment on a CouchDB document is painful. It works fine for a few hundred docs, but at well over a quarter million package versions, this makes view generation and compaction much too onerous, and the file becomes too large to ever successfully compact.
How big is the average npm package? Is this issue due to couchdb's treatment of attachments, or simply due to the size/volume of the packages themselves?
>On creation, attachments go into a special _attachments attribute of the document. They are encoded in a JSON structure that holds the name, the content_type and the base64 encoded data of an attachment. A document can have any number of attachments.
attachments are by default stored inline in the document itself
Why upload attachments to the skim db and then (more or less) immediately pull them off onto manta? Is that so the client upload software doesn't have to change? If that's the case, would it make sense for the client upload software to change now so it can upload directly to the separate attachment store?
Good question! There are a few benefits to doing it this way.
1. Backwards compatibility, as you mention. We do plan on re-evaluating and seeing what client changes would allow us to do this more efficiently, but there's a big time lag on rolling out a new npm and people actually adopting it. It's safe to say that someone out there will still be using the current release in 2 years or so, so if we can keep it working, then that's a friendly thing to do.
2. If the Skim Worker daemon falls over, the attachments are still going somewhere, and we can always have it catch up later. Apart from disasters, it also means we can treat this daemon a bit roughly. If we change a config or spin up a new one, or otherwise mess with it, no biggie.
3. In the race where you publish, and then someone fetches it right away, before the Skim Worker gets to it, the Fastly configs can detect the 404 and pull it out of the DB directly.
4. If Manta ever goes down, the skimworker will start failing (and pinging nagios, of course) and if need be, we can have the binary GETs go first to FullfatDB and subsequently to SkimDB. That Fastly config change takes about 30s to roll out, so we can mitigate downtime very quickly.
Eventually, we'll probably restructure the PUT endpoints so that it's a bit more clever, but still maintain backwards compatibility in our public API surface.
Totally off topic, but always bugged me: why does the NPM site display email addresses in clear text on profile pages? I don't see the added value in doing that (except for making email harvesters' job easier)
Feel free to chime in if you have a suggestion or even submit a patch; npm Inc only has 4 full-time engineers right now (we're hiring!) so it will be a while before we get to this otherwise :-)
In general, hiding email addresses doesn't usually make the job of harvesters appreciably harder, and does make the life of genuine users a bit more painful.
We'll probably start hiding email addresses altogether once we have a messaging system so that it's still easy for npm users to contact one another when they need to. Until then, I'd accept a patch to do the standard silly "hiding" thing with some JS that shows it on the page.
How big is the average npm package? Is this issue due to couchdb's treatment of attachments, or simply due to the size/volume of the packages themselves?