Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Personally, I think this approach -- which as far as the engine is concerned looks like a largely-automated "translation" of xetex from C/C++ to Rust -- is misguided.

The xetex code is a byzantine tangle of disparate pieces that evolved over the course of a number of years and several changes of direction; it originally started as a personal tool to address one individual's use case for a "Unicode-capable TeX", and grew from there. While it generally works really well, it's not a solid piece of engineering on which to build the future.

The resulting code doesn't benefit from being Rust, except for the added buzzword-ness. It's simply a translation of the C code, littered with unsafe blocks, as it has not been architected to work with Rust's ownership semantics, borrow checker, etc.

What XeTeX (or LuaTeX, though I don't find the Lua integration important -- but that depends heavily on your use cases) needs is a rewrite that preserves backward compatibility for documents, while re-architecting the engine using a modern language such as Rust. Simply wrapping the 1980s-era code in a "skin" of Rust syntax brings little value.



The advantage of this approach (as opposed to a rewrite), is that you reproduce all the bugs and misfeatures that people in the wild depend on. You can then add a test suite and start refactoring and gradually move to a codebase you're happy with, while breaking few or no users on the way.

Another approach is to split it into blocks and replace parts of the system, but whether that is feasible depends on the software and how modular it is.

Finally, you could try a clean-room rewrite, but that could take years without visible results and is hard to find the motivation for if existing software works ok.


But the advantage of avoiding C specific bugs is shallow, because the Rust code contains lots of unsafe blocks. Bugfixes made to the original source are not automatically ported to the Rust code base.

From my point of view the chosen porting "strategy" doesn't make much sense. It is more of a toy project to see what's possible.

What would really make sense would be starting with an extensive test suite and trying to build a properly architected Rust implementation according to it.


Your suggestion would be the start of a clean-room or fresh rewrite - so option 3 above, but those are often surprisingly long and unrewarding endeavours, particularly in old software like this with lots of users used to its quirks and bugs and who disagree about what the spec should be.

Automated translation is obviously not an end goal in itself and doesn't improve code quality, but it could be the start of a successful rewrite as it does at least let them replicate exactly what the old program does, which is very important for end users.

A good example of this strategy being used successfully is the use of the tool c2go to convert the Go runtime and compiler from C to go. That involved quite a specific tool tailored to the codebase in question, and a lot of manual cleanup afterwards.

https://docs.google.com/document/d/1P3BLR31VA8cvLJLfMibSuTdw...


The unsafe blocks can be removed one by one as time goes on. It's no different to any other legacy refactoring project: get the old code on a new platform, instrument/add unit tests, refactor piece by piece until the end result is acceptable.


> It's no different to any other legacy refactoring project

There are two approaches to a project like that, as you say, take something which works and iteratively make it better, or derive a specification from the project which works and create a new from-scratch implementation to meet that specification.

My experience with the "from-scratch" approach is that it is very easy to miss details in the specification that will only be found out later, so it is very easy to underestimate the amount of work required. Ironically, that contributes towards making it easier to kick off the project, as it looks like it will be easier and cheaper. Especially if there is a view to drop features in the new version, which is fine until the actual users find out about the plan.

Another issue is that the old system which is being improved is often actually still in use, and the users still want new features even while the new system is being developed. Either those requests can be rejected, or implemented twice (once in the new system, once in the old). When incrementally improving the current system instead, those new features may end up touching areas of the code that have already been improved, making them cheaper to implement, not more expensive.

Basically, I think you're right. Keep the current system working, and improve it without breaking it.


The more I read about legacy (and actively maintained) project refactoring, the more firmly I find myself agreeing that the gradual replacement is the right way to go about things in nearly every case.

Whether we like it or not, the old system is a source of truth about how things are done, so the only way to preserve this knowledge fully is to copy the whole thing as-is and then "restate" parts of that knowledge in a more organised/modern way by refactoring, leaving the rest in place.


If you’re interested in this topic, you should read “How to work effectively with legacy code” by Michael Feathers.


It's been on my reading list for a while, I'll take this as a reminder to read it :)


That looks really interesting, thanks for recommending it here.


Everyone has a different way of handling this but if I had to, I would also take the approach described in this project: a variant of the strangle pattern of refactoring.

However, I don't know of a single one of these c to rust translation projects that has succeeded. I don't know of any large c to rust rewrite project that has succeeded so that doesn't help judge between rewrite patterns. It may just be that rewrites don't yield results at all.


Is is not clear if all the Bugs will manifest in the same manner after the translation.


How do you not break people who depend on misfeatures, while eliminating misfeatures?

Hey, I think Windows drive letter names are misfeature.

Let's have a Windows-in-Rust and they will soon be a thing of the past.


Lets say you have 200 misfeatures, most of which you don't even know exist. You might want to fix 10 of them and not care about the rest, but if you start by breaking all 200 of them users will be mad and stop using your port.


I don't agree.

Even if an automated translation isn't using the new language in a "proper"/idiomatic way this can still be a good base to start working, while ensuring compatibility. From there one cna start and extract pieces and rewrite individual pieces and do lots of refactoring.

It is a long and tidious process, but a process which has a chance of leading to less bugs than a rewrite from scratch, by working on smaller pieces which can be verified one at a time, even if no proper test suite exists.


If I recall correctly, they did a machine-assisted conversion of the Go compiler from C to Go using a similar technique. They wrote a compiler for C to Go, only bothering to handle the particulars of that one codebase, and changing the C if required. This emitted very un-idiomatic Go code, which they could then clean up as required.

What they got immediately was a few classes of bug gone. No more memory leaks, and NPEs and out-of-bounds errors are now defined to fail in a nicer way. Then they could spend time making their new code more idiomatic.


I absolutely agree, but would go even further.

When you say Xetex "not a solid piece of engineering on which to build the future", I think the same thing also applies to the original Tex engine written by Knuth. By modern software engineering standards, the original Tex implementation is a nightmare. It's enormously difficult to extend or add new features, and this has resulted in (1) comparatively few extensions being made and (2) when such extensions have been made (e.g. Xetex), they are very difficult technically.


I agree, but one also can't blame Knuth: he wrote the program the way he knew best (he's a machine-code programmer at heart), under the constraints at the time (portability at various academic sites circa 1980 practically dictated Pascal, then Pascal's limitations required a preprocessor like WEB, etc). In fact, the earlier (TeX78) implementation in the SAIL language was written less monolithically, as a bunch of separate modules.

He also did his best to make the implementation and source code understandable, publishing the program in print as an extensively documented/commented book (another reason for WEB), gave a workshop of 12 lectures about the implementation of the program, even had a semester-long course at Stanford with that book (program source code) as textbook (with exercises and exam problems). He also wrote TeX with hooks and some of its core functionality written as extensions using those hooks, hoping it would show others how to extend it. He has multiple times expressed surprise that more people didn't write their own versions of TeX. “Rewriting a typesetting system is fairly easy.” He seems to have overestimated the ability of others to read his code.

If anything, I think a lesson from the TeX situation is that one's work can be too good: if he had simply published the algorithms at a high level (only the Knuth-Plass line-breaking algorithm was published as an independent paper) then maybe others would have implemented/combined them in interesting ways, but by publishing the entire source code and offering rewards for bugs etc, TeX got a (deserved) reputation as a very high quality stable and bug-free codebase and everyone wanted to use literally TeX itself. What's worse is that for a few years after it was created, TeX was possibly more widely available and more portable (what with its TRIP test and all that) than any single programming language (one had a much higher chance of TeX macros working consistently everywhere TeX was used, than code written in say C or Pascal): so it must have seemed natural to write large things like LaTeX entirely in TeX macros. As “TeX macros” wasn't designed or intended as a full-fledged programming language, we can see the effects today.


> As “TeX macros” wasn't designed or intended as a full-fledged programming language, we can see the effects today.

Making them Turing complete was a conscious decision, though a reluctant one:

> Guy Steele began lobbying for more capabilities early on, and I [Knuth] put many such things into the second version of TEX, TEX82, because of his urging.

http://maps.aanhet.net/maps/pdf/16_15.pdf


Oddly, I disagree with this take. TeX stands as a ridiculously stable codebase. Something that is not valued by near anyone in industry today. Such that the things we think make good engineering are standing on empirically weak arguments.

Now, aesthetically I absolutely agree. It is an ugly language by most standards. But, if folks tried more extension and less porting to a new language, I'd wager they could get far. Instead, we seem to mainly get attempts at trying an extension by first establishing a new base language. Every time.


> But, if folks tried more extension and less porting to a new language, I'd wager they could get far.

This is what happens - pdfTex, Xetex, Luatex: these are all extensions of the core Tex implementation. The problem is that making these extensions is extraordinarily difficult given the software architecture of Tex. And, notoriously, sharing improvements between extensions is also very very difficult. The end result is that we have compatibility few improvements and complete stagnation is some aspects in the typesetting space (for example, no alternative/improvement to Tex's pagination algorithm).

The problem is that Tex is architected as a monolithic application which you can't easily plug extra stuff into. All of the extensions to Tex have worked by forking the source code entirely, which I think is not a great model.


> Simply wrapping the 1980s-era code

I just checked the Wikipedia page on XeTeX (not knowing what the heck that is at all). It's in fact actually something very recent in terms of the TeX timeline; it was released in 2004. It has Unicode support, which seems to be the big thing.

That is more recent than, I think, the last time I used TeX; which worked absolutely fine.

Someone rewriting TeX in Rust should work with Knuth's original Pascal sources, in my opinion, not some knock-off (and look at XeTeX behaviors and documentation in order to do the Unicode stuff in a compatible way).


> the last time I used TeX; which worked absolutely fine.

I've heard that original TeX is the only software in the world that doesn't have any bugs (no one found so far).


That's certainly not correct; while it is much closer to bug-free than most software, there have been numerous fixes since the original release (as well as a few enhancements).

See https://ctan.org/tex-archive/systems/knuth/dist/errata, particularly the file "tex82.bug".

Knuth will be reviewing bug reports and potentially issuing additional fixes again next year (see http://www.tug.org/texmfbug/).


I found one years ago that I didn't report. When I issued Ctrl-D on the interactive TeX prompt to bail out, it failed to issue a newline, leaving the operating system prompt juxtaposed to the right of the TeX prompt.

According to ISO C, "[w]hether the last line [of a text stream] requires a terminating new-line character is implementation-defined", so terminating the program without the last character written to stdout (a text stream) being a newline is not maximally portable.

That's a peculiar and possibly unique situation in the standard: whether or not a requirement exists is implementation-defined. Logically, that is as good as it being required, since any implementation can make it required. Those not making it required are just supplying a documented extension in place of undefined behavior.


> Simply wrapping the 1980s-era code in a "skin" of Rust syntax brings little value.

That's right for TeX, but XeTeX was first released in 2004.


Yes, but the bulk of its code is the original TeX code, from 1984. It did not attempt to reimplement or modernise the core code. Until a few years ago, it was even still built from tex.web plus a set of change-files plus some C/C++ libraries. Nowadays, the main change-files have been merged into the WEB source, for easier management, but it's still the old TeX code at heart.

Actually, there was another intermediate stage: XeTeX is in effect a descendant of TeXgX, an extended version of TeX that integrated with the now-discontinued QuickDraw GX graphics and font technology on classic Mac OS. But anyhow, it's still a direct descendant of Knuth's code. (No criticism intended: TeX was -- and still is -- a fantastic piece of work, but its code is from a different era and was shaped by constraints that are irrelevant today.)


> littered with unsafe blocks

Good points. It's unfortunate that headlines rarely distinguish safe Rust from unsafe, when so much of the advantage of Rust depends on it. You may even get a hostile response for asking about it ( https://news.ycombinator.com/item?id=24141493 )


I think you're getting your hostile answers for implying that unsafe rust is equivalent to C. It's most definitely not. Using unsafe grants you certain powers, but it does not disable all of rusts features - for example the borrow checker is not turned off by unsafe.

See https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html#unsa...


Safety doesn't just come from language features, it also comes from the language disallowing dangerous actions. Rust's unsafe mode opens the door to undefined behaviour, of the sort that plagues so many C/C++ codebases (buffer overflows etc). A program written in safe Rust offers far better assurances than a program making heavy use of unsafe Rust: safe Rust is unable to result in undefined behaviour. (Bugs in the compiler and standard library may still cause mischief, but that's another matter, the intent of the safe Rust subset is to be guaranteed free from UB.)

You may be right that a program written in 100% unsafe Rust might still be less prone to undefined behaviour than a program written in C, but that's not my point. Excessive use of unsafe features undermines the considerable safety advantages that Rust offers over C, and it's regrettable when this is disregarded.


> Safety doesn't just come from language features, it also comes from the language disallowing dangerous actions. Rust's unsafe mode opens the door to undefined behaviour, of the sort that plagues so many C/C++ codebases (buffer overflows etc). A program written in safe Rust offers far better assurances than a program making heavy use of unsafe Rust: safe Rust is unable to result in undefined behaviour.

Safety is not an absolute, it's a spectrum. No one denies that safe rust is better than unsafe rust on the safety scale.

> You may be right that a program written in 100% unsafe Rust might still be less prone to undefined behavior than a program written in C, but that's not my point. Excessive use of unsafe features undermines the considerable safety advantages that Rust offers over C, and it's regrettable when this is disregarded.

It's not disregarded. The point you are disregarding that when porting a C application to rust, unsafe rust is a step up from C, not step down from safe rust. Unless you choose to rewrite from ground up (which is infeasible in many places), you'll need unsafe rust, either for binding or by using tooling that converts the C sources to rust. But once you have unsafe rust, you already get all the help that the borrow checker brings and you can gradually shrink the unsafe code. It's a matter of practicality, you seem to be advocating for absolutes and I think that's earning you the down votes you're receiving.


> Safety is not an absolute, it's a spectrum.

Use of the safe subset of Rust means the compiler and standard-library offer you a guarantee of the absence of undefined behaviour. That's an absolute guarantee of safety, under Rust's understanding of the word.

It doesn't give you a guaranteed absence of memory-leaks. It certainly doesn't give you a guarantee of whole-program correctness, as Rust isn't a formal verification framework. Both these properties are beyond the scope of 'safety' as Rust uses it.

> It's not disregarded.

It is. Projects are described as written in Rust, treating safe Rust and unsafe Rust equally.

> The point you are disregarding that when porting a C application to rust, unsafe rust is a step up from C, not step down from safe rust.

I explicitly acknowledged this.

> Unless you choose to rewrite from ground up (which is infeasible in many places), you'll need unsafe rust, either for binding or by using tooling that converts the C sources to rust.

Sure, no disagreement there.

> But once you have unsafe rust, you already get all the help that the borrow checker brings and you can gradually shrink the unsafe code.

Sure, and that's a good use of Rust's unsafe features.

> you seem to be advocating for absolutes

Not really. If I had meant to argue that Rust shouldn't have unsafe features, I'd have done so.

In an ideal world all code would be written in a way that completely closes the door on undefined behaviour, but we agree there are good reasons Rust includes its unsafe features, and there are good practical reasons to use them. I'm advocating for a greater emphasis on the use of the safe subset of Rust.

Written in Rust tells me something about the software. Written in 100% safe Rust tells me much more about the software. That's essentially my point.

This distinction doesn't arise with languages like C and JavaScript. All C is unsafe, and all JavaScript is safe. For languages like Rust and D, there's value in being upfront about the use of their safe subsets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: