New Pip resolver takes a long time to complete

sjburt · on Dec 3, 2020

They need to store the version requirements metadata outside the packages. Having to download the entire wheel just to see whether it is compatible is ridiculous. This could all be computed by downloading an index file, then performing the resolution.

They made a half-assed attempt when pypa/pip added the "data-requires-python" tag, but that covers python interpreter version only, it needed to have been done for all dependencies.

What's especially irksome is that Debian and RPM both solved this 20+ years ago and python has refused to learn any lessons.

jrochkind1 · on Dec 3, 2020

rubygems went through several iterations of API to try to deal with this, including yes, starting with a dependency-info-only API.

I wish different language/platform communities were better at learning from each other (and this does go in all directions). In the field of software these days, we don't do much learning from prior art. It's just too hard to keep up with it all.

Heck, I recall this happening with rubygems -- the introductio of the dependencies API to efficiently get the minimum data needing for resolving dependencies before downloading actual packages, and then several iterations on it -- and I can't find any actual documentary evidence of it to share right now. I don't know how anyone WOULD use it as prior art; the source code for a fairly complex project in a language you aren't familiar with isn't going to work, even if you knew to go look for it, which why would you.

orf · on Dec 4, 2020

Pypi does have a json API that returns package dependencies, if the uploaded file is compatible.

See the “requires’ key here: https://pypi.org/pypi/Django/json

woodruffw · on Dec 3, 2020

This is the subject of discussion in this Warehouse (the PyPI backend) issue[1].

The TL;DR is that it probably requires a PEP.

FD: I've been working on adjacent features in PyPI.

[1]: https://github.com/pypa/warehouse/issues/8254

danmur · on Dec 4, 2020

Same reason poetry has slow (but accurate) dependency resolution. It would be nice to have as part of the repo API.

cozzyd · on Dec 3, 2020

yeah, why not just use libsolv or something?

mst · on Dec 3, 2020

I built a resolver like this for perl/CPAN at once point.

Then realised it would have this problem a few days before I'd planned to release, and didn't ship it.

I am feeling very lucky right now.

looperhacks · on Dec 3, 2020

Did I get it right, the problem is basically that the dependency info is turing complete and decided "at installtime", thus the dependency graph cannot be computed without installing many versions of all dependencies? (I do not use python, so I don't know much about how pip works)

Are there other languages that have dependencies decided at installtime?

cipherboy · on Dec 3, 2020

Most distros have install time dependency resolution, such as yum/dnf using libsolv. Repos are a collection of packages alongside metadata tables containing package versions and their dependencies. The tables get cached locally and are used to resolve dependencies prior to downloading any actual RPMs.

JackC · on Dec 3, 2020

Right, for packages distributed as source (vs. having prebuild "binary wheels") the dependencies are specified in python code. Example from one comment on the bug:

  setup(
    install_requires=[random.choice(["urllib3", "requests"])]
  )

This example wouldn't make any sense, but you could imagine installing different dependencies for x86 CPUs or something via a runtime check, and there are lots of packages that use this for checking python versions even though there's now a static way to do that.

So that leads to this situation, from another comment:

"And as an example, [the botocore package, which has releases nearly daily] depends on python-dateutil>=2.1,<3.0.0. So if [your dependency constraints are] to install python-dateutil 3.0.0 and botocore, pip will have to backtrack through every release of botocore before it can be sure that there isn't one that works with dateutil 3.0.0. ... And worse still, if an ancient version of botocore does have an unconstrained dependency on python-dateutil, we could end up installing it with dateutil 3.0.0, and have a system that, while technically consistent, doesn't actually work."

Sounds like there's a long term plan that could fix this situation. Binary wheels already have the needed metadata in a way that could be exposed by PyPI via a fast "fetch dependency constraints for all versions" API, but isn't yet. And for source dists there's a very new plan ( https://www.python.org/dev/peps/pep-0643/ ) to let them indicate that they don't modify install_requires at runtime, so their deps could also be exposed via API, but the ecosystem will have to catch up with that.

I dunno what pip does in the meantime, though!

(I'm a python dev but haven't followed this beyond skimming the bug, so I hope I'm getting this right.)

vmchale · on Dec 3, 2020

> dependency info is turing complete

It's NP-complete in general I think, definitely the case in Haskell-land. I think OCaml sets an explicit timeout for dependency resolution?

> the dependency graph cannot be computed without installing many versions of all dependencies?

Does PyPi not have an index or something? (with statically known package bounds?)

rdlfo · on Dec 3, 2020

Perhaps something like mamba [1] could speed up the resolver. That is if the resolver is actually achieving anything at all.

[1] https://github.com/mamba-org/mamba

_3sno · on Dec 3, 2020

My understanding is mamba, like conda, just call pip. So it likely wouldn't make a difference.

The pip section in a env file is just a list of arguments passed through to the pip install command. Prior to pip 20.3 we had to add `--use-feature=2020-resolver` to get an install that resolved for our teams that used mamba.

martin_renou · on Dec 4, 2020

Mamba and conda do not call pip.

You can install a pip package inside a conda environment. But when running `mamba install` or `conda install`, pip is not involved at all.

rdlfo · on Dec 4, 2020

I got a bit confused by the statements in the thread. If the issue is downloading wheels to check dependencies, mamba should be a better alternative, as not only it is downloading packages in parallel but it uses a different dependency solver.

Iwan-Zotow · on Dec 3, 2020

No, conda is not calling pip

_3sno · on Dec 4, 2020

You're wrong, it does.

Conda installs conda packages and conda uses pip to install pip packages. However a pip package can be converted to a conda package, and then in that case the dependency will be installed by conda and not pip.

martin_renou · on Dec 4, 2020

You're wrong, it does not.

You can install a pip package in a conda env. But this is actually not recommended.

When using `mamba install` or `conda install`, pip is not involved at all.

uranusjr · on Dec 4, 2020

You’re correct, Conda does not call pip when it installs packages. But pip does have a subtle involvement here: Many Python packages available to Conda are packaged based on installations made by pip, and (IMO lazily and incorrectly) inherits a lot of pip characteristics. This makes the packages “appear” to be installed by pip, and sometimes induce interpolation inconsistencies.

cbaines · on Dec 3, 2020

I used to think package managers and dependency resolvers were inseperable, I even did a little bit of work on the APT resolver.

Since using GNU Guix though, I'm so glad it doesn't have a dependency resolver as part of building or installing packages! It's so much better for it, no slow or unpredictable resolving, you know what it's going to do.

I think this is one reason why I've never used pip for managing Python software, I've only ever used Debian, and then Guix.

ryukafalz · on Dec 4, 2020

As a Guix user/occasional packager who has run into broken packages a few times (including Python ones)... I do wish it had a dependency resolver that could at least be run outside the build/install process, e.g. when updating libraries.

If I'm updating a library that has a bunch of dependent packages, it's hard to know whether or not something will break downstream. Sometimes this is unavoidable and you really do need to just test everything thoroughly, but sometimes the dependent packages have more knowledge of what library versions they need... and Guix doesn't seem to be aware of this.

FRidh · on Dec 4, 2020

Exactly. Python maintainer of Nixpkgs here.

What we need is to be able to use a resolver such as included by pip or poetry, to build up our package set. In Nixpkgs this is nowadays unfortunately done manually, and I suppose the same goes for Guix. In Nixpkgs the reason is simple: too eager pinning makes it impossible to resolve a package set that works with the entire set.

Now that pip has a resolver what is needed is a way to use constraints not to set only lower and upper bounds, but to enforce a version when resolving. That makes it usable for downstream integrators to construct their primary package set. One could then even make the next step and construct "stable" sets that extend the primary set.

yjftsjthsd-h · on Dec 3, 2020

Wait, how does guix avoid a dependency resolver? It still has dependencies for packages... are they just hardcoding exact dependencies (including versions) and relying on the "can install multiple versions of a package" property to make that work? That seems inefficient, though I guess maybe they need that for the "this hash means this exact binary" outcome?

cbaines · on Dec 3, 2020

Guix has packages, and packages have inputs (like dependencies), and you're right in that normally package definitions specify the exact dependencies it the code (they're hardcoded).

Guix package definitions are truely code though, so if you want to generate packages on the fly by using a dependency resolver, you can totally write some code to make that happen.

With respect to inefficiency, what do you mean? It's quite time efficient when building and installing to not have to attempt to resolve dependencies.

yjftsjthsd-h · on Dec 4, 2020

> With respect to inefficiency, what do you mean? It's quite time efficient when building and installing to not have to attempt to resolve dependencies.

It's going to burn disk space like crazy, isn't it? If I install packages foo-1.0 and bar-1.0 and foo-1.0 uses glibc-2.31 and bar-1.0 uses glibc-2.30 then I now have 2 versions of glibc... but that scales to every package I install and every library every one of them uses. Of course, this can be fixed... by automatically building new versions of every package any time any of their dependencies changes, in which case we're probably not wasting much disk because we traded and are now burning CPU time like there's no tomorrow (and network, and disk I/O, and memory, and anything else used in package builds). Basically, this sounds like reinventing static binaries, with all the downsides thereof.

cbaines · on Dec 4, 2020

Normally you'd just be using one glibc version. Packages that you installed a while ago may be using a different one, but it's not like every package has massively divergent dependencies.

Because of the immutable store, you can do file level deduplication, so if you have multiple versions of the same packages, you can deduplicate the identical files.

I think the worries about disk usage are relevant, but only on systems with small amounts of storage. These are still relevant though, and it's an important area to improve on. As for burning CPU time, I don't think there's a perfect solution to avoid this, but I think Guix is pretty good. Guix provides substitutes, so you don't have to build things locally on every machine (I'm looking at you Rubygems, pip and Python stuff is pretty bad also).

jrochkind1 · on Dec 3, 2020

If they are specifying exact hard-coded dependency versions... if a X -> Y -> A, and B -> A, and M -> N -> A... and a security patch is released for A, then X, Y, B, M and N all need new releases to specify new exact dependencies, in order to use the new A'?

It's disastrous for security patches, only highly inconvenient for things like performance improvement releases. But this is why we have dependency resolving, right? What am I missing?

cbaines · on Dec 4, 2020

No, you just update the package definition for A and release the updated package definitions, people will then be using the updated A, whether directly or through other packages (X, Y, B, M, N, ...).

There is an issue here of rebuilding all those dependent packages with the updated A, especially if it's something like glibc. Guix includes a mechanism called grafts that allows for package replacements, which allows avoiding this, and this is often used for releasing security fixes.

jrochkind1 · on Dec 4, 2020

I think I must be misunderstanding. When you said:

> you're right in that normally package definitions specify the exact dependencies it the code (they're hardcoded).

I thought that meant that package X might specify a dependency on A version 2.4.2 exactly. So if you want it to use A 2.4.3 instead, a new release of package X would have to be created, specifying 2.4.3.

But I think this is not in fact what you mean? In which case I don't yet understand the system you are describing and what you mean by not doing dependency resolution.

microtonal · on Dec 4, 2020

I thought that meant that package X might specify a dependency on A

In Nix and Guix, you would indeed not specify that X needs package A version 2.4.2. You could see it as X using the 'build recipe' of A as its dependency. So, if you bump the version of A to 2.4.3, then all packages that use A as a dependency will be rebuilt (or substituted from the a binary cache if the Guix/NixOS build infrastructure has already built the updated packages).

jrochkind1 · on Dec 4, 2020

Huh, in that case I have the converse question -- when specifying dependencies for X, can you not say A 1.x but not 2.x, because 2.x has or is expected to have backwards breaking changes? Otherwise, when A releases 2.0 with some backwards incompatibilities, and all packages that use A as a dependency are rebuilt, don't they break?

These issues (allow updates within limits) are what I understand as the point of dependency resolution, I'm trying to understand how you do without it.

microtonal · on Dec 5, 2020

These issues (allow updates within limits) are what I understand as the point of dependency resolution, I'm trying to understand how you do without it.

In such a case e.g. nixpkgs makes two attributes: a_1 and a_2 (and alias a to a_2). Packages that still require A 1.x will have a_1 as one of their dependencies, the rest a.

This is avoided as much as possible, but is sometimes necessary. Common examples are Gtk 2 applications or C/C++ applications that can only be built against a Python 2 interpreter.

dang · on Dec 3, 2020

Recent and related: https://news.ycombinator.com/item?id=25253236

qz2 · on Dec 3, 2020

It'll still be faster than NuGet. I had to wait 12 minutes (!) to install a package the other day while it resolved dependencies on a single package.

riedel · on Dec 3, 2020

Maybe by fixing the solver they can finally prove P=NP ;)

oehtXRwMkIs · on Dec 3, 2020

I think that will be proven before python fixes their package management.

noitpmeder · on Dec 3, 2020

At least they are trying to fix it, vs other languages where there are literally no paths forward. IMO python has one of the more mature packaging ecosystems out there at this point. It is so incredibly easy.

cozzyd · on Dec 3, 2020

But why should package management be language-dependent? No reason to stay within the confines of one programming language...

nawgz · on Dec 3, 2020

>IMO Python has one of the more mature packaging ecosystems out there

Ouch, what languages are you using where this is true? For example, JS / NPM / Yarn are absolutely blowing pip out of the water. I guess I can imagine Java or C++ users having your perspective though

uranusjr · on Dec 4, 2020

Every language community has its own priorities, and the package manager grows different traits to meet community needs. Those traits would come at a price, and different comunnities (and their package managers) would make different tradeoffs due to their different priorities. Python has a very involved history dealing with platform-native stuff, and provide a lot of convinience aroud specifying, providing, and obtaining those native stuff. But the complexity would leak into platform-agnostic packages, and gives an impression that packaging is worse when “all you want” is some .py scripts. JavaScript provides a much better interface for packaging and installation, but the expense is that native modules are much more difficult to package, and tend to fail spectacularly on installation when something does not work out. Many more things are like this from each side.

There are a lot of smart people working on these things in each ecosystem, and when you think some package manager is far surperior over others in every way, you are more than often simply wrong. Or saying it the other way (and paraphrasing a commenter from another thread), the only package manager you think is good is from the ecosystem you are not deeply familiar with.

nawgz · on Dec 4, 2020

>The only package manager you think is good is from the ecosystem

Or maybe, the only package manager you think is good is the one that has features you value and doesn't fail in ways you wouldn't expect?

For example, I have been trying to work with numpy and pandas a bit - two of the biggest Python libraries - and use them on my MacBook (itself a popular item). These installs fail, in the middle of an ungrokable stack of install logs. I have to shuffle thru useless messages to eventually track down the source, and then try to find a new compatible version. So sure, maybe there's native code in there, but I think claiming "useful summaries are not the package manager's job" is silly

But I also have had other terrible experiences: packaging is a bit overwrought; the terrible import/export/module system in Python means your dependency names have nothing to do with where you import from; SSL Certificate errors crop up at random on domains with valid certificate.

I am sure it's good enough to use, but I think it's a bit pie-in-the-sky to claim all package managers are good and you just have to be in the community. Shitty software exists

FloatArtifact · on Dec 3, 2020

Distribution still is a pain unless you're going through pip.

pydry · on Dec 3, 2020

who doesn't use pip?