> (no file system would need a path longer than 260 characters, right? :o) )
This is poking fun at Windows' PATH_MAX, but it's very easy to hit BINPRM_BUF_SIZE's 128 character path limit for #! lines; maybe not so much for humans, but with CI systems installing things in weird deep directories (e.g. virtualenvs ... see https://github.com/pypa/virtualenv/issues/596).
So the old adage about glass houses and stones is still good advice ;)
Unless they split the line without whitespace, that example is already non-portable because it's not well defined what happens to shebangs with more than one argument. IIRC linux merges all of them into a single argument, but other unixen do different things.
I had to check to see if it was a repost of e.g. an April joke proposal or something like that.
Nope. One guy randomly wants to arbitrarily limit random Python things to 1000000, cite vague optimization potential advantages (in CPython, yeah, right...), and doesn't even want to detail the impact of said optims when further asked.
IMO this is just a waste of time for everybody and should be firmly rejected ASAP. Even just proposing to limit to 1M "source" lines is extremely naive and the sign of somebody without much real world experience: there are many cases when "source" code is not the actual source, and has orders of magnitude more lines.
I still don't understand why anyone would want this. It seems pretty hostile to the users of Python without any real benefit. Maybe the benefit is the people who are pushing for the change get to smugly know that people won't do horrible things with the language. Maybe I don't get it, but... why?
It’s about having a complete specification so that the compiler and VM can be correct.
Correctness when you have arbitrarily long input gets complicated very easily. I was originally skeptical when reading, it seemed ridiculous and arbitrary, until I realized that’s what they were going for.
It is not. Python already doesn't have a spec, it has the CPython interpreter. This is repainting the facade while you still have a sump pump that isn't working.
There are a myriad of better ways to improve Python and this isn't it.
> Correctness when you have arbitrarily long input gets complicated very easily.
Indeed. Accepting arbitrary inputs frequently requires an arbitrary amount of resources. Even if there are no vulnerabilities to exploit, a denial of service attack may still be possible because processing untrusted input can result in massive memory allocations.
These variable length integers may consume an unbounded amount of memory when decoded. I think it's wise to constrain this by default and let the caller increase or turn off the limits if they want:
decode(buffer) # max=9, per spec
decode(buffer, max=64)
decode(buffer, max=math.inf)
> Picking an arbitrary limit less than 232 is certainly safer for many reasons, and very unlikely to impact real usage. We already have some real limits well below 106 (such as if/else depth and recursion limits).
> A range of -1,000,000 to 1,000,000 could fit in 21 bits and that three of those values could be packed into a 64-bit word.
In theory this could bring security and performance benefits. In practice if you claim to be limited by MAX_UINT you're likely to encounter issues if you try to close to that limit without very careful coding around overflow.
If we lowered the speed limit of highways to 25 mph, we would prevent many highway deaths, and could narrow the lanes to fit 4 lanes where previously there were 3 lanes.
Extended reading on this? I found myself chuckling at the absurdity of this at first, and then kept reading finding out that this is actually quite a good idea. What topics would this come under? General compiler architecture?
This could be really worked around. Having a huge string to exec would not trigger any of the proposed limits. But I'm not sure useful apps are packed that way. Multiple namespaces in a single file sound pretty tricky to do/manage.
* The steering council decided that they will make the final decision on it directly. [1]
* Their has definitely been a less than enthusiastic response from the python devs overall and some of the steering council members specifically, including GVR [2]
Good old ActivePerl 5.10.1 does it in about 1.8 seconds on my Windows laptop, while Strawberry Perl 5.26 is taking quite a while (multiple minutes already). I wonder why there's such a difference.
Program for reference.
use strict;
use warnings;
my $max_i = 1_000_000;
for my $i (0 .. $max_i) {
print "my \$x$i = $i;\n";
}
Edit: It took 80 minutes! Wow. There must be something significantly different going on, but I wouldn't know where to look.
- Running `lua5.3 big.py` takes about 1/6 the time on my Linux VM. (I had to use a big.py only half the size because this VM was too small for the original.)
So Python isn't the fastest at this but I wouldn't call it terrible. Basically all of that time was parsing/compiling, since the time is the same with big.py changed to `def foo(): x0=0 # etc.`
a million sequential memory accesses should not take 5 seconds. Maybe 5 ms. Most likely the time is spent on VM /interpreter startup and shutdown, OS overhead, etc.
Given the downvotes, I'm fairly certain I missed the direction of discussion. 5 seconds is an astronomical amount of time to do something as described in the parent comment.
Is 5 seconds such a long time really? For parsing a file, then creating and evaluating 1 million local variables, all inserted into the local namespace?
Made me wonder, how long does it take to compile a million line long C program? Let's see:
N = int(1E6)
print("""#include <stdio.h>
int main() {
""")
for x in range(N):
print (f"int x{x}={x};")
print("""
return 0;
}
""")
now generate the code and compile it:
time gcc big.c
drumroll ... ummm ... look at that ... doesnt finish, hangs forever (or at least longer that care to stick around wich was 10 minutes) ...
notably for 10,000 variables it takes just 0.3 seconds, but raising that number one order of magnitude to 100,000 makes it hang, 100% CPU 3% memory used. Something has an awful scaling behind the scenes.
But now we're timing an optimizing compiler, and not an interpreted program of 1 million lines. I understand that you couldn't generate a program of 1 million lines, but still, I would expect compilation to take longer than execution for just about any non-looping program. Again, I am missing the direction of this discussion.
Compilation is technically NP-Hard, meaning it scales very poorly, as you described, whereas interpretation is not (no register allocation, etc).
the main purpose of the exercise was to explore what happens when you throw 1M lines of code to a language,
our estimates to what happens could be wildly inaccurate,
I was pleased with Python finishing in 5 seconds because I thought it might not even work at all due to some internal implementation detail that I was not aware of - like the ruby example that fails with a stack error. I am pleased to see that Python can handle it.
As for the GCC compilation I find it quite unexpected that it compiles 10K lines in just 0.3 seconds, but then seems to hang on 100K lines. Not so easy to explain.
Remember that before Unix file system used to enforce what’s in the file instead of allowing storing whatever in whatever? That specific suffix/file validation that are carried in part on early Windows? People were sufficiently pissed to create Unix, which had none such bs.
Less rule is better than more rules, You can always put whatever you want into best practice but not expect people to follow hidden rules like this one.
The point here isn't that they are adding restrictions. Those restrictions are already there whether you like them or not, and in many cases that are quite arbitrary. This is a proposal to standardize many of the restrictions on a single arbitrary number, one that is large enough to account for most people's needs and small enough to make implementation easier. One of the motivations spelled out in the article is precisely to avoid hidden rules.
Pypy is in a spot where if you're after speed you would do it in another language, while package accessibility that cpython has is gone. I don't see it going mainstream but it's a nice project - and I can see why it has more limitation.
Following your argument I should be able to write any random words to a file and expect the Python interpreter to run it as the program I intended?
Don't see it as a forced rule to restrict people, I can't really see how 1 million lines is a restriction on hand writen code anyways and generators can work around the problem, but more as a contract that sets the boundary on what people can expect. Python interpreter developers know they don't have to expect arbitrary large programs and can optimize their code for everything below 1 million and Python scripts/app developers know within what limits they can expect their code to work.
Exactly, You can put arbitrary things in a .py folder, if it doesn't run, it's your problem.
Unix didn't limit what you can put in a file, it doesn't expect your program to work when you fill your program with random strings.
The point is, the intepreter developer can feel free to optimize for everything below 1 million, putting a hard cap is pointless. The same reason that most sh file are suffixed as sh despite not needing that suffix at all.
The problem with 1M lines of code as a limit, is that some ML tools actually export a hyper-fast executing C (or other language) program that emulates the built/trained model.
I don't know why people would assume that's enough of a limit if there really isn't a need to limit that aspect.
8 maximum parameters on a function
96 maximum constants
32 max globals
24 max locals
The self flagellating q developer will enjoy these limits under the guise of good coding practice enforcement (if you have more than 24 variables in your function, you should split your function up)
Are there common assembly/CPU native ways of partitioning a 64 bit register to store multiple smaller size values and do work on them? (I haven't written assembly since some MIPSs in college.)
21 bits seems weird if the goal is optimization. 24 (a multiple of 2, 4 and 8) would seem more natural and incrementally less controversial (8 MLOC vs 1, assuming signed line numbers)
You can't efficiently address 21 bit values. You can efficiently address 8, 16, 32, and 64 bit values, some architectures can efficiently address 24 bit values. No architectures make it efficient to address 21 bit values. If you wanted to pack stuff more efficiently you can pack 8 24 bit values in 192 bits. All of those are multiples of one byte. But even that is dumb.
On the scale of a program with one million lines of code, those extra 11 bits constitute a space savings of .00001% at a non inconsequential time cost to access weirdly aligned values. Thanks for the byte? Am I allowed to spend it all in one place?
Honestly, were I charged with implementing this feature I would ignore it in the name of performance and just stick those line refs in a 64 bit pointer to the line of code/bytecode. I'm honestly not even seeing that the proposed advantages to this idea are advantageous, even in the absence of the catastrophic negatives.
But what proportion of the Python interpreter's memory is consumed storing line numbers? How much slower will it run if it has to unpack the numbers any time it needs to use them? Does Python even store triplets of line numbers? (This seems unlikely to me, so to achieve any saving at all will require invasive code changes.) These questions could be answered with hard data, but the proposal has no data at all.
> But what proportion of the Python interpreter's memory is consumed storing line numbers?
I don't think the packing part was talking just about line numbers. They propose a number of things that would be limited to 1e6 elements. The question about performance is a good one and needs to be studied. This proposal is arguing from a different perspective though, one that focuses on reducing ambiguity rather than optimizing for performance. Python as a language is an exercise in sacrificing performance for other objectives, so I think that's fine for a language like this to consider this question even if it means performance could be impacted negatively.
This is poking fun at Windows' PATH_MAX, but it's very easy to hit BINPRM_BUF_SIZE's 128 character path limit for #! lines; maybe not so much for humans, but with CI systems installing things in weird deep directories (e.g. virtualenvs ... see https://github.com/pypa/virtualenv/issues/596).
So the old adage about glass houses and stones is still good advice ;)