I had a humbling experience today. Hopefully, there's something in this wall of text that might be helpful to someone else. TLDR at the end.
Each of my proxmox clusters has a Lenovo Tiny that I use as a supervisor node. This supervisor node's job is to report which nodes are online and offline through
/etc/pve/.members
and this information is pulled in through ssh from another Tiny that's in my automation/monitoring/stats cluster.
That other Tiny automates shutdowns and startups since I don't have consistent power available to me. We see multiple power outages a day.
Today, the automated startup failed for one of my clusters after a power outage. When I went to see what happened, I saw that the supervisor node's ethernet lights were off. I have all my systems facing back-to-front for both das blinkenlights and the aesthetics of cable management:
View attachment 204976
Reseated the power connector, but still no ethernet lights. Tested the power connector with a multimeter and saw 0V between the centre pin and outside. A dead power adapter? But how? I have a 5000VA mainline stabilizer connected to an inverter which is then connected to an online ups, with MCBs and surge protectors in between all of them. No surge should've made it through.
I spent most of the night turning half of the house upside down looking for a spare Lenovo adapter — I had a spare Tiny so I should've had a power adapter for it somewhere. Didn't find it anywhere.
So I cut off the connector so that I could splice in another adapter. But I wanted to make sure this adapter was dead — and it wasn't! Measured 20V on the bare stripped wire.
I went to check the continuity on the other half of the wire I'd cut off, the one with the connector. There was none. No continuity between the centre pin and the bare wire I'd stripped off. Was there a loose or bad connection from fatigue?
I tried twisting the strain relief and intermittently heard a beep from the multimeter. A few more beeps later and I realize the centre pin is not a power pin, the power contacts are on the inside of the connector, and the centre pin is probably just a sense pin.
The power adapter was perfectly fine. I hastily spliced the wire back together and checked for 20V, found and it plugged it in. But still no ethernet lights.
So now I'm thinking I need to swap in my spare Tiny. Took this one out and started disassembling it since I needed the processor, memory and storage from it. Briefly considered it might not be starting because of a bad BIOS battery. I've seen that in the past. Battery tests 3.3V. Put it back in and tried powering it on with just one stick of RAM and processor and nothing else.
It powered up! Unbelievable. Did this mean that something died and prevented it from starting up? Tried two sticks of RAM, and it turned on. Plugged in the cooling fan, it turned on. Slotted in the SSD, it turned on. Added the ethernet adapter, and it turned on. What? I closed up the Tiny and it powered up just fine.
Connected the ethernet cables, no lights. Brought in a spare keyboard and monitor and I see Proxmox's login screen. Got in and did a
ip a
. All interfaces accounted for except the physical ones showed
NO CARRIER
.
It's been seven hours into this and there was nothing wrong with this Tiny. It was powering on just fine, I just couldn't see the front fascia with how it was installed so I went by just the ethernet lights and wrongly assumed it was not powering on. The first lesson learned: check both front and rear lights before assuming something is dead.
But why were the ethernet lights off? Turns out when I redid the aesthetic cable management a few months ago, I plugged the cable for this Tiny into a switch that would be powered off during extended power outages.
During such outages, the network is powered off five whole minutes after the cluster is powered off, plenty of time for the automation cluster to see that the nodes went offline. The node in charge of safe shutdowns and startups will only execute a cluster startup if all the nodes are offline, since it does the startup by toggling smart plugs and you don't want to be in a situation where a running node abruptly loses power.
I shifted focus to the automation cluster and the Tiny responsible shutdowns and startups. Uptime was a few hours — it should've been a few months. Something happened. Somehow a surge got through the many surge protectors, the stabilizer, the inverter, and the eye-wateringly-expensive online-freaking-ups that's not in eco mode and this Tiny was restarted.
When it restarted, it could not communicate with the supervisor node since that didn't have network connectivity. Without any way of knowing which nodes were online, it never executed the automated startup sequence. And that is exactly how I intended things to work but forgot all about it.
My brain feels like mush right now. I don't know what safeguards I need to put in place to prevent this from reoccurring, I just know that I need more safeguards. Also, I need to move the Tinys to a DC power system since apparently, AC isn't reliable even with a low-six-digit investment in power rectification. That'll be the other lesson learned.
TLDR: Consider shifting focus to agriculture, it'll be relaxing.