Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
One million lines of code ought to be enough for anybody (lwn.net)
111 points by chmaynard on Dec 18, 2019 | hide | past | favorite | 57 comments


> (no file system would need a path longer than 260 characters, right? :o) )

This is poking fun at Windows' PATH_MAX, but it's very easy to hit BINPRM_BUF_SIZE's 128 character path limit for #! lines; maybe not so much for humans, but with CI systems installing things in weird deep directories (e.g. virtualenvs ... see https://github.com/pypa/virtualenv/issues/596).

So the old adage about glass houses and stones is still good advice ;)


Nix developers ran into this very limit.

https://lwn.net/Articles/779997/


Unless they split the line without whitespace, that example is already non-portable because it's not well defined what happens to shebangs with more than one argument. IIRC linux merges all of them into a single argument, but other unixen do different things.


And people struggle to find examples of Linux bending over backwards for compatibility…


I didn't hit that until I am start using node.js...

Then the npm<3 install modules in a dawn nested way and actually cause problem.


In unix, there are lots of limits people eventually run into, then either hack to overcome, or go back and painfully rearchitect.

- max size of a command line

- max open files

- kernel stack size

etc...


I had to check to see if it was a repost of e.g. an April joke proposal or something like that.

Nope. One guy randomly wants to arbitrarily limit random Python things to 1000000, cite vague optimization potential advantages (in CPython, yeah, right...), and doesn't even want to detail the impact of said optims when further asked.

IMO this is just a waste of time for everybody and should be firmly rejected ASAP. Even just proposing to limit to 1M "source" lines is extremely naive and the sign of somebody without much real world experience: there are many cases when "source" code is not the actual source, and has orders of magnitude more lines.


I still don't understand why anyone would want this. It seems pretty hostile to the users of Python without any real benefit. Maybe the benefit is the people who are pushing for the change get to smugly know that people won't do horrible things with the language. Maybe I don't get it, but... why?


It’s about having a complete specification so that the compiler and VM can be correct.

Correctness when you have arbitrarily long input gets complicated very easily. I was originally skeptical when reading, it seemed ridiculous and arbitrary, until I realized that’s what they were going for.


It is not. Python already doesn't have a spec, it has the CPython interpreter. This is repainting the facade while you still have a sump pump that isn't working.

There are a myriad of better ways to improve Python and this isn't it.


> Correctness when you have arbitrarily long input gets complicated very easily.

Indeed. Accepting arbitrary inputs frequently requires an arbitrary amount of resources. Even if there are no vulnerabilities to exploit, a denial of service attack may still be possible because processing untrusted input can result in massive memory allocations.

A concrete example I've recently dealt with:

https://github.com/multiformats/unsigned-varint/blob/master/...

These variable length integers may consume an unbounded amount of memory when decoded. I think it's wise to constrain this by default and let the caller increase or turn off the limits if they want:

  decode(buffer) # max=9, per spec
  decode(buffer, max=64)
  decode(buffer, max=math.inf)


From the article:

> Picking an arbitrary limit less than 232 is certainly safer for many reasons, and very unlikely to impact real usage. We already have some real limits well below 106 (such as if/else depth and recursion limits).

> A range of -1,000,000 to 1,000,000 could fit in 21 bits and that three of those values could be packed into a 64-bit word.

In theory this could bring security and performance benefits. In practice if you claim to be limited by MAX_UINT you're likely to encounter issues if you try to close to that limit without very careful coding around overflow.


If we lowered the speed limit of highways to 25 mph, we would prevent many highway deaths, and could narrow the lanes to fit 4 lanes where previously there were 3 lanes.


Extended reading on this? I found myself chuckling at the absurdity of this at first, and then kept reading finding out that this is actually quite a good idea. What topics would this come under? General compiler architecture?


Restricting memory usage of Python objects seems like a good idea.


List of languages that compile to python:

https://github.com/vindarel/languages-that-compile-to-python

This change is almost guaranteed to break most of these.


Also, aren't there packers that compress your entire Python project into a single file for easy deployment?


This could be really worked around. Having a huge string to exec would not trigger any of the proposed limits. But I'm not sure useful apps are packed that way. Multiple namespaces in a single file sound pretty tricky to do/manage.


Just want to note that is only a proposal which:

* The steering council decided that they will make the final decision on it directly. [1]

* Their has definitely been a less than enthusiastic response from the python devs overall and some of the steering council members specifically, including GVR [2]

[1] https://mail.python.org/archives/list/python-dev@python.org/... [2] https://mail.python.org/archives/list/python-dev@python.org/...


It made me wonder how fast would Python parse a simplistic program with 1 million lines.

  N = int(1E6)

  for x in range(N):
      print (f"x{x}={x}")
generates lines like:

  x0=0
  x1=1
  x2=2
  ...
running it with:

  time python big.py
takes just 5 seconds - I am impressed! Maybe the limit is too low.


Out of curiosity I did the same thing in Ruby for comparison.

   N = 1E6.to_int
   N.times { |i| puts "x#{i} = #{i}" }
Then running it with:

   time ruby big.rb

It never completed, because it allocates variables on stack.

   Traceback (most recent call last):
   big.rb: stack level too deep (SystemStackError)
   ruby big.rb  1451.52s user 23.74s system 97% cpu 25:08.23 total


Good old ActivePerl 5.10.1 does it in about 1.8 seconds on my Windows laptop, while Strawberry Perl 5.26 is taking quite a while (multiple minutes already). I wonder why there's such a difference.

Program for reference.

  use strict;
  use warnings;

  my $max_i = 1_000_000;

  for my $i (0 .. $max_i) {
    print "my \$x$i = $i;\n";
  }
Edit: It took 80 minutes! Wow. There must be something significantly different going on, but I wouldn't know where to look.


You are likely just benchmarking the time it takes to print one million lines to stdout.


I assume the first script isn't the benchmarked program, but that it just generates it. The pure parsing of the second script would be the benchmark.

Having said that, the fact it takes 5 seconds is pretty arbitrary—we can't really say it's good or bad unless there is some basis for comparison.


I might have not explained properly, the output of the program was saved into the file,

  python generate.py > big.py
then this file was executed with Python, that took 5 seconds:

  time python big.py
and yes, I had no idea what to expect, hence the test.


Some context:

- big.py is almost 15MB

- Running `lua5.3 big.py` takes about 1/6 the time on my Linux VM. (I had to use a big.py only half the size because this VM was too small for the original.)

So Python isn't the fastest at this but I wouldn't call it terrible. Basically all of that time was parsing/compiling, since the time is the same with big.py changed to `def foo(): x0=0 # etc.`


a million sequential memory accesses should not take 5 seconds. Maybe 5 ms. Most likely the time is spent on VM /interpreter startup and shutdown, OS overhead, etc.


That's not a million sequential memory accesses.

Assuming a fairly naive python interpreter, it involves

- Parsing a million lines of code

- Outputting bytecode for a million lines of code

- Running a million of lines of bytecode, including

- Allocating a million int objects (ints are heap allocated in python)

- Hashing a million identifiers

- Putting those million identifiers into a hashmap, each pointing to the corresponding into object

- Deallocating those million int object

Plus VM/Interpreter startup/shutdown. But we can benchmark that by running an empty python file and it's not significant.


Given the downvotes, I'm fairly certain I missed the direction of discussion. 5 seconds is an astronomical amount of time to do something as described in the parent comment.


Is 5 seconds such a long time really? For parsing a file, then creating and evaluating 1 million local variables, all inserted into the local namespace?

Made me wonder, how long does it take to compile a million line long C program? Let's see:

  N = int(1E6)

  print("""#include <stdio.h>
  int main() {
  """)

  for x in range(N):
     print (f"int x{x}={x};")

  print("""
    return 0;
  }
  """)
now generate the code and compile it:

  time gcc big.c 
drumroll ... ummm ... look at that ... doesnt finish, hangs forever (or at least longer that care to stick around wich was 10 minutes) ...

notably for 10,000 variables it takes just 0.3 seconds, but raising that number one order of magnitude to 100,000 makes it hang, 100% CPU 3% memory used. Something has an awful scaling behind the scenes.


But now we're timing an optimizing compiler, and not an interpreted program of 1 million lines. I understand that you couldn't generate a program of 1 million lines, but still, I would expect compilation to take longer than execution for just about any non-looping program. Again, I am missing the direction of this discussion.

Compilation is technically NP-Hard, meaning it scales very poorly, as you described, whereas interpretation is not (no register allocation, etc).

https://cs.stackexchange.com/questions/22435/time-complexity...


the main purpose of the exercise was to explore what happens when you throw 1M lines of code to a language,

our estimates to what happens could be wildly inaccurate,

I was pleased with Python finishing in 5 seconds because I thought it might not even work at all due to some internal implementation detail that I was not aware of - like the ruby example that fails with a stack error. I am pleased to see that Python can handle it.

As for the GCC compilation I find it quite unexpected that it compiles 10K lines in just 0.3 seconds, but then seems to hang on 100K lines. Not so easy to explain.


That code doesn’t imply simply 1 million sequential memory accesses in Python. Python is not C.


it has to read and parse 1 million lines, just reading the file with wc takes 22ms with

  wc big.py
so 5ms is way underestimating the job


No, don’t control what people do!

Remember that before Unix file system used to enforce what’s in the file instead of allowing storing whatever in whatever? That specific suffix/file validation that are carried in part on early Windows? People were sufficiently pissed to create Unix, which had none such bs.

Less rule is better than more rules, You can always put whatever you want into best practice but not expect people to follow hidden rules like this one.


The point here isn't that they are adding restrictions. Those restrictions are already there whether you like them or not, and in many cases that are quite arbitrary. This is a proposal to standardize many of the restrictions on a single arbitrary number, one that is large enough to account for most people's needs and small enough to make implementation easier. One of the motivations spelled out in the article is precisely to avoid hidden rules.


Maybe it’s not appropriate for the official cpython but would be appropriate for something like pypy?


Maybe, that would be up to the project.

Pypy is in a spot where if you're after speed you would do it in another language, while package accessibility that cpython has is gone. I don't see it going mainstream but it's a nice project - and I can see why it has more limitation.


Following your argument I should be able to write any random words to a file and expect the Python interpreter to run it as the program I intended?

Don't see it as a forced rule to restrict people, I can't really see how 1 million lines is a restriction on hand writen code anyways and generators can work around the problem, but more as a contract that sets the boundary on what people can expect. Python interpreter developers know they don't have to expect arbitrary large programs and can optimize their code for everything below 1 million and Python scripts/app developers know within what limits they can expect their code to work.


Exactly, You can put arbitrary things in a .py folder, if it doesn't run, it's your problem.

Unix didn't limit what you can put in a file, it doesn't expect your program to work when you fill your program with random strings.

The point is, the intepreter developer can feel free to optimize for everything below 1 million, putting a hard cap is pointless. The same reason that most sh file are suffixed as sh despite not needing that suffix at all.


Original title "One million ought to be enough for anybody".

The proposal: https://lwn.net/ml/python-dev/93cf822c-4d67-b8f7-1d91-7d8053... ([Python-Dev] PEP proposal to limit various aspects of a Python program to one million. )


The problem with 1M lines of code as a limit, is that some ML tools actually export a hyper-fast executing C (or other language) program that emulates the built/trained model.

I don't know why people would assume that's enough of a limit if there really isn't a need to limit that aspect.


This reminds me of q/kdb's limits[1]:

    8 maximum parameters on a function
    96 maximum constants
    32 max globals
    24 max locals
The self flagellating q developer will enjoy these limits under the guise of good coding practice enforcement (if you have more than 24 variables in your function, you should split your function up)

1: Table of Limit Errors at http://www.timestored.com/kdb-guides/kdb-database-limits#con...


Are there common assembly/CPU native ways of partitioning a 64 bit register to store multiple smaller size values and do work on them? (I haven't written assembly since some MIPSs in college.)


It is there on x86-64, not sure about other archs.


Isn't that how SIMD works?


100,000 lines was enough for Terry Davis.


21 bits seems weird if the goal is optimization. 24 (a multiple of 2, 4 and 8) would seem more natural and incrementally less controversial (8 MLOC vs 1, assuming signed line numbers)


From the article, the observation is that 3 21 bit values can be packed in a 64 bit integer.


Which certainly sounds like a terrible idea, and there's no actual evidence presented of performance gains or benefits.


The obvious benefit would be if your 3 values always fall in that range you would save 129 bits that would otherwise always be zero.


That's dumb though.

You can't efficiently address 21 bit values. You can efficiently address 8, 16, 32, and 64 bit values, some architectures can efficiently address 24 bit values. No architectures make it efficient to address 21 bit values. If you wanted to pack stuff more efficiently you can pack 8 24 bit values in 192 bits. All of those are multiples of one byte. But even that is dumb.

On the scale of a program with one million lines of code, those extra 11 bits constitute a space savings of .00001% at a non inconsequential time cost to access weirdly aligned values. Thanks for the byte? Am I allowed to spend it all in one place?

Honestly, were I charged with implementing this feature I would ignore it in the name of performance and just stick those line refs in a 64 bit pointer to the line of code/bytecode. I'm honestly not even seeing that the proposed advantages to this idea are advantageous, even in the absence of the catastrophic negatives.


But what proportion of the Python interpreter's memory is consumed storing line numbers? How much slower will it run if it has to unpack the numbers any time it needs to use them? Does Python even store triplets of line numbers? (This seems unlikely to me, so to achieve any saving at all will require invasive code changes.) These questions could be answered with hard data, but the proposal has no data at all.


> But what proportion of the Python interpreter's memory is consumed storing line numbers?

I don't think the packing part was talking just about line numbers. They propose a number of things that would be limited to 1e6 elements. The question about performance is a good one and needs to be studied. This proposal is arguing from a different perspective though, one that focuses on reducing ambiguity rather than optimizing for performance. Python as a language is an exercise in sacrificing performance for other objectives, so I think that's fine for a language like this to consider this question even if it means performance could be impacted negatively.


129 bits compared to what?

If we made the limit 2^32 the total size of those 3 values would only be 96 bits, and you would only be saving 32 bits.


Compared to a 64 bit size. But you're still saving space either way.


How many lines of code can an organization effectively maintain?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: