After much anxiety and frustration, RabbitEars is finally back on its server at Silica Broadband and better than ever. While there may still be some issues around and I may take things down here and there as I fix configuration issues I may have overlooked, we should be good. If you spot anything that doesn't work or is misbehaving, please let me know!
I need to express my thanks to Dave at Silica Broadband who has generously hosted RabbitEars for more than three years now, and has been a massive help in this process. Fantastic to work with, too. I cannot say enough nice things.
We swapped the power supply to no avail, tested for RAM issues which came up negative, and ultimately bit the bullet and swapped the two 2TB SSDs for two new 4TB SSDs. As the disk replacement necessitated reloading the entire system, I had him change the configuration from the on-board RAID-1 to software RAID-1 within Linux, so now hopefully I can better handle issues like this in the future.
As for what happened, I wish I could say. I'm not clear why a RAID-1 array would fail in the way that this one did. I'm hoping that by switching to software RAID, I can prevent it from happening again. But basically what would happen is the site would either go away entirely, or the disks would go read-only such that I could see it and back it up but not write to it (that's when database errors would appear). I'd run a disk check and reboot, sometimes a few times in sequence, and then it would come back to life. My guess is that this odd behavior is tied to the on-board RAID-1 somehow.
I was still dragging my feet as I was leery of trying to reload the whole system remotely. But Scott at Satellite Guys, ever a fount of good ideas, suggested setting up a temporary server with the company he uses for Satellite Guys, which is called Contabo. I went ahead and did so--it was less expensive than I'd have thought to set myself up for a month. It took some trial and error to get up and running, but I took notes to help support the reload of the permanent server, so it was all beneficial. The next time the system failed hard, Dave wasn't immediately available to bring it back up, so I rolled everyone over to it. It held up surprisingly well, to the point that I had Russ set up the Live Bandscan on it. It was definitely close to capacity, but everything held up okay. And having the site up and running meant we weren't scrambling to get it back up and running, and kept my stress level down. So, many thanks to Scott for the excellent idea and great recommendation. I'll keep this in mind for the future, and despite telling him privately I wasn't planning to keep the temporary server, I have to admit, I'm considering keeping it alive just in case.
Anyway, thanks to everyone for their patience. Hopefully we're in a good spot now!