Repeat after me: UTF-8 is the sane default in this day and age. This is a good change.
The whole "the ISO 6429 C1 control code 'application program command'" thing is a bit surprising though. (I'm guessing this change doesn't actually avoid this directly? If you sent an APC it'd still do it, it's just that APC is multiple bytes in UTF-8, and hopefully a bit rarer?)
> Reinterpreting US-ASCII in an arbitrary encoding
This way will likely work — at least, I thought. The vast majority of encodings are a superset of ASCII, so reinterpreting ASCII as them is valid. The only one I know of that isn't is EBCDIC, and I've never seen it used. (Said differently, non-superset-of-ASCII codecs are incredible rare to encounter, so the above assumption usually holds.) (The reverse, reinterpreting arbitrary data as ASCII, is not going to work out as well.)
Though it is rather horrifying how easily it is to dump arbitrary data into a terminals stream. Unix does not make this easy for the program. The vast majority of programs, I'd say, really just want to output text. Yet, they're connected to a terminal. Or better, if perhaps a program could say, "I'm outputting arbitrary binary data", or even "I'm outputting a application/tar+gzip"; the terminal would then know immediately to not interpret this input. And in the case of tar+gzip, it would have the opportunity to do something truly magical: it could visualize the octets (since trying to interpret a gzip as UTF-8 is insane); it could even just note that the output was a tar, and list the tar's contents like tar -t. If the program declares itself aware, like "application/terminal.ansi", then okay, you know: it's aware; interpret away.
But it doesn't, so it can't. Part of the difficulty is probably that the TTY is both input and output (not that the input can't also declare a mimetype or something similar). And the vast majority of programs don't escape their user input before sending it to a terminal; it's like one giant "terminal-XSS" or "SQL-injection-for-your-terminal". And it is probably unreasonable to expect it; I don't really know of any good libraries around terminal I/O; most programs I see that do it assume the world is an xterm and just encode the raw bytes, right there, and pray w.r.t. user input.
catting the linux kernel's gzip into tmux can have consequences from "lol" to "I guess we need a new tmux session".
It was also just today that I discovered that neither GNU's `ps` nor `screen` support Unicode, at least, for characters outside the BMP.
UTF-16 isn't a superset of ASCII, for one. Doesn't seem that anyone uses a native UTF-16 terminal, but if you're trying to use grep or whatnot on a UTF-16 encoded file, it'll happily silently not do what you want...
The whole "the ISO 6429 C1 control code 'application program command'" thing is a bit surprising though. (I'm guessing this change doesn't actually avoid this directly? If you sent an APC it'd still do it, it's just that APC is multiple bytes in UTF-8, and hopefully a bit rarer?)
> Reinterpreting US-ASCII in an arbitrary encoding
This way will likely work — at least, I thought. The vast majority of encodings are a superset of ASCII, so reinterpreting ASCII as them is valid. The only one I know of that isn't is EBCDIC, and I've never seen it used. (Said differently, non-superset-of-ASCII codecs are incredible rare to encounter, so the above assumption usually holds.) (The reverse, reinterpreting arbitrary data as ASCII, is not going to work out as well.)
Though it is rather horrifying how easily it is to dump arbitrary data into a terminals stream. Unix does not make this easy for the program. The vast majority of programs, I'd say, really just want to output text. Yet, they're connected to a terminal. Or better, if perhaps a program could say, "I'm outputting arbitrary binary data", or even "I'm outputting a application/tar+gzip"; the terminal would then know immediately to not interpret this input. And in the case of tar+gzip, it would have the opportunity to do something truly magical: it could visualize the octets (since trying to interpret a gzip as UTF-8 is insane); it could even just note that the output was a tar, and list the tar's contents like tar -t. If the program declares itself aware, like "application/terminal.ansi", then okay, you know: it's aware; interpret away.
But it doesn't, so it can't. Part of the difficulty is probably that the TTY is both input and output (not that the input can't also declare a mimetype or something similar). And the vast majority of programs don't escape their user input before sending it to a terminal; it's like one giant "terminal-XSS" or "SQL-injection-for-your-terminal". And it is probably unreasonable to expect it; I don't really know of any good libraries around terminal I/O; most programs I see that do it assume the world is an xterm and just encode the raw bytes, right there, and pray w.r.t. user input.
catting the linux kernel's gzip into tmux can have consequences from "lol" to "I guess we need a new tmux session".
It was also just today that I discovered that neither GNU's `ps` nor `screen` support Unicode, at least, for characters outside the BMP.