Weird type of hard shutdown under random loads

Stronk

Level G
So here's a strange problem I'm having with my desktop with the following specs:
Ryzen 3600XT
Vega 64 Liquid (Power limited to 250W)
2x16GB 3000 Corsair LPX
1x8GB 3200 Corsair LPX
(All 3 sticks running at 3266Mhz @ 1.38V)
MSI B450M Mortar (Non-Max)
Corsair CX550

Now, I'm having shutdowns under random loads. I was using Manjaro till a few days ago, and kept getting segmentation faults or just hard shutdowns while compiling a dependency, namely rocfft (via yay). It either showed a segfault, or just shutdown, at any percentage between 7-10. Temps were mot a problem as they never crossed 80C.
To narrow down the problem, I first ran RAM at stock speed (2133) to see if that was the issue, but still ran into the problems. My CPU was also overclocked so I thought could be it , although highly unlikely. So I did a cmos reset and yet again the same segfault and hard shutdowns, and that left mobo as the only culprit. But then, I'm not sure why, I locked CPU clocks to 3Ghz, kept RAM at regular 3266 speed and voila - it compiled without a hitch!

So I concluded then that the CPU is faulty, and almost filed an RMA request with AMD but then below happened today.

I switched to Ubuntu 20.04 today, and trued running a benchmark which puts medium-high compute (NOT 3D graphics) load on the GPU and occasional single thread 100% load on the CPU, and I could never finish the benchmark because it just kept shutting down during the benchmark (ai-benchmark 0.1.2 in case this is relevant), and this time even downclocking the CPU doesn't seem to help. I ruled out GPU by even power limiting it to 150W, but still kept getting hard shutdowns. The CPU never went above 40W of power consumption, and the GPU never crossed 60C in Temps or the set power limit.
And the weird part about these shutdowns is that the computer wouldn't immediately switch on (by pressing power button) once it shut down, but instead I had to switch the mains off and on (either at plug point or on the PSU itself) to get it to power on, ie, I couldn't immediately switch it on once it shutdown without power-cycling.
I've tried running GPU-only stress tests and benchmarks and the computer didn't shutdown, have run CPU-only benchmarks and no shutdowns either so I'm really flummoxed as to why it is happening.

Any ideas? Does the weird shutdown behavior point to a PSU problem, or is the CPU still likely at fault? Any replies will be appreciated, thank you.
 
Last edited:
Seems to me like a faulty cpu cooler which isn't connected properly or isn't functioning properly.
Nope, I've been monitoring Temps and that is definitely not the case.
Also after reading a bit more online, it seems the PSU is at fault. Any pointers on how to go about claiming warranty for corsair PSUs?
 
did you enable kdump?
Ah damn, now that you remind me, I didn't install it on Ubuntu. Will do that and see what it has once I encounter another crash. Thanks!
did you enable kdump?
Crashed twice since installing this, nothing in kdump. Seems like a PSU issue after all, what with the power cycling and all. But I'm still confused by the segmentation fault errors I got - but could just be a bad (ie, even slightly unstable) overclock on either the CPU or RAM hopefully.
 
Last edited:
Double check if your gpu power limit is really working... could be insufficient psu capacity.
It is, made sure of that. The only consistency I have noticed is that system shuts down when a sudden load is applied across the CPU or GPU, even if it's within capacity of the PSU.

Plus it was shutting down sometimes even while just compiling rocfft which onlybuses cpu, which doesn't consume more than 85W
 
So I did very detailed tests today, and I think I've finally zeroed down on the problem. It wasn't the PSU, but rather the RAM and GPU. First of all, the GPU was spiking to temps at 70c+ (even with a 235W power limit), and since this is a liquid cooled unit, the shutdown Temps are at 80C. That is partly the reason why computer shut down under any compute loads. I've made the fan curve much more aggressive and reduced power limit to 200W for compute tasks, and now the temps peak at 72C - still beyond the thermal throttle limit of 70C but at least it doesn't shut down.

Second was the RAM - pushing voltage to 1.45V got rid of the segmentation faults, and once that was done I figured out the GPU.

Thank you everyone for your help.
 
Back
Top