So essentially lichess chose StackOverflow approach - (rather) beefy servers, instead of "treating them like a cattle".
Interesting that they accumulate and periodically store game state. Unfortunately it is not very clear, where they store ongoing game state - in redis or on server itself. Also cost breakdown doesn't have server for redis, only for DB.
BTW, their github has better architectural picture, than overly simplified one in the article: https://raw.githubusercontent.com/lichess-org/lila/master/pu.... Unfortunately, I'm afraid, drawing something like that during interview may not land a job at faang =(
Note that they have cost per game fairly low: $0.00027, 3,671 games per dollar.
p.s. I'm not saying that Lichess's approach is the best or faang is the worst. Remember, lichess had 10 hours outage exactly because of the architecture chosen (single datacenter dependency). https://lichess.org/@/Lichess/blog/post-mortem-of-our-longes... . And outages like that are exactly the reasons why multi-datacenter and multi-region architectures are drilled down into faang engineers.
My point is is that there are cases when this approach is legit, but typical interview is laser focused on different things, and most probably won't appreciate the "old style" approach to the problem. I'm sure that if Thibault will ever decide to land in faang he will neither do whiteboard coding nor system design.
The downtime here is mostly OVH's fault. They're not known for fast support on hardware failures, that's why they're cheap. If they had this architecture on AWS EC2 and could just spin up a new AMI, then they'd only have a few minutes downtime, and the same simple architecture.
I remember Meta having a few outages of their own. And outlook as well. So I'm not sure what to think now. But sure, on paper FAANG is redundant and hence better.
In my experience, issues scale exponentially with scale. So handling 10x the traffic might mean 100x the potentially issues. Redundancy helps with that so when something inevitably fails, the architecture is able to automatically recover and the end user doesn't see any degradation. So what works for lichess wouldn't work for Meta.
Redid runs on the main server, where lila runs, as indicated in the diagram you linked. And moves are buffered in lila. Redis is only used for pub-sub.
Why feel anything about it at all? You work at FAANG: be glad for the money or quit if there isn't any. You don't work at FAANG: bad hiring makes it easier for you to get hired and make money.
You haven't considered the third option: couldn't care less about working at these companies because of different reasons (personal, financial, geography, cv or whatever).
My criticism was mostly towards the very poor metrics these companies have introduced behind hiring, albeit I can understand that given the gigantic amount of applications they get a mechanism for removing false positives is acceptable even if missing on false negatives.
And even more that it spread to companies that do not have their problems and can't afford false negatives.
This is a limited, self-centered way of thinking (not self-ish, just self in the neutral sense of the word).
Looking at second-order effects, many companies look up to FAANG for "best practices", which often includes them blindly copying their hiring practices. Without feeling or calling out any healthy skepticism, the software hiring world becomes a worse place overall.
Interesting that they accumulate and periodically store game state. Unfortunately it is not very clear, where they store ongoing game state - in redis or on server itself. Also cost breakdown doesn't have server for redis, only for DB.
BTW, their github has better architectural picture, than overly simplified one in the article: https://raw.githubusercontent.com/lichess-org/lila/master/pu.... Unfortunately, I'm afraid, drawing something like that during interview may not land a job at faang =(
Note that they have cost per game fairly low: $0.00027, 3,671 games per dollar.
Their cost breakdown, for ones who are curious https://docs.google.com/spreadsheets/d/1Si3PMUJGR9KrpE5lngSk...
p.s. I'm not saying that Lichess's approach is the best or faang is the worst. Remember, lichess had 10 hours outage exactly because of the architecture chosen (single datacenter dependency). https://lichess.org/@/Lichess/blog/post-mortem-of-our-longes... . And outages like that are exactly the reasons why multi-datacenter and multi-region architectures are drilled down into faang engineers.
My point is is that there are cases when this approach is legit, but typical interview is laser focused on different things, and most probably won't appreciate the "old style" approach to the problem. I'm sure that if Thibault will ever decide to land in faang he will neither do whiteboard coding nor system design.