I previously ran 150,000 AMD gpus in all conditions at 100% utilization for years. I currently have a multi-million $ cluster of enterprise AMD GPUs.
A couple real world points:
1. They generally don't just fail. More likely a repairable component on a board fails and you can send it out to be repaired.
2. For my current stuff, I have a 3 year pro support contract that can be extended. Anything happens, Dell goes and fixes it. We also haven't had someone in our cage at the DC in over 6 months now.
I have to maintain our GPU's. Generally the worst parts are the watercooling pressure, the HVAC, and the power. I can run it stable only at 300W per CPU, the normal max is 310W. Now with throttling to 300 it's stable for a year, before it burned two mainboards already, with lots of downtimes.
My experience is that power problems stem from not having good power and/or poor airflow.
I'm convinced that this is why we haven't had any issues in our current location. Zero outside air, zero dust, insanely well built zero expense spared airflow and power supply / management.
A couple real world points:
1. They generally don't just fail. More likely a repairable component on a board fails and you can send it out to be repaired.
2. For my current stuff, I have a 3 year pro support contract that can be extended. Anything happens, Dell goes and fixes it. We also haven't had someone in our cage at the DC in over 6 months now.