Best quote:
"Start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you"
It sounds stupid but if you really do have a resilient and redunant infrastructure it shouldn't matter. If you fear someone randomly unplugging things then you have work to do ;-)
Netflix (who also survived the outage) wrote a blogpost last year in which they talked about a system they call "Chaos Monkey" which does this exact thing.
If the point is "test your supposed redundancy," I agree 100%. But please, for your customers' sake, don't do it in the middle of the day, especially if it's your first test.
To be clear, he did say start practicing during planned maintenance times, but I agree: your first try on a live system shouldn't be during peak times.
It sounds stupid but if you really do have a resilient and redunant infrastructure it shouldn't matter. If you fear someone randomly unplugging things then you have work to do ;-)