Debugging Random Reboots with Claude Code: A PSU Power Limit Story

My Linux server started rebooting randomly during CPU benchmarks. I had no idea where to start, so I asked Claude Code to help diagnose. Twenty minutes later, we found the root cause and a working fix.

This is a story about AI-assisted debugging - specifically, how an AI assistant’s systematic approach can cut through hardware issues that would take hours of Googling.

The Problem

I was benchmarking local Whisper models for speech-to-text on a home server (Intel i9-10900K, 550W PSU). During heavy transcription loads, the system would randomly reboot. No warning, no error message - just instant power loss.

I described the symptoms to Claude Code: “Server reboots randomly under CPU load. No kernel panic. What should I check?”

The AI-Guided Diagnosis

Claude Code walked me through a systematic diagnostic process. Each step built on the previous one.

Step 1: Check the Logs

journalctl -b -1

Claude noted that the logs stopped abruptly mid-operation. No error, no shutdown sequence. “This is actually diagnostic,” it explained. “Software crashes leave traces. Instant power loss doesn’t.”

Step 2: Look for Hardware Errors

dmesg | grep -i error

Found Machine Check Errors (MCE). Claude explained these indicate hardware-level problems: thermal, memory, or power delivery.

Step 3: Rule Out Thermal

apt install lm-sensors
sensors

Temps showed 39-51C under load. Well within spec. Claude crossed thermal off the list.

Step 4: Check MCE Details

apt install rasdaemon
ras-mc-ctl --errors

No active errors. The MCE messages were stale.

Step 5: The Diagnosis

Based on the evidence - logs stopping without kernel panic, load-dependent crashes, normal temps - Claude identified the likely cause: PSU power limits.

It asked about my PSU (550W) and looked up the i9-10900K specs. Under Turbo Boost with all cores loaded, this CPU can spike to 250W+. My PSU was undersized.

The Fix Attempts

Claude suggested Intel RAPL to limit CPU power draw:

# Set PL1=125W, PL2=180W
echo 125000000 > /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_0_power_limit_uw
echo 180000000 > /sys/class/powercap/intel-rapl/intel-rapl:0/constraint_1_power_limit_uw

Still crashed.

Tried lower limits (95W/125W). Still crashed.

Claude explained why: “RAPL operates on millisecond timescales. Your PSU’s overcurrent protection trips in microseconds. The PSU cuts power before RAPL can throttle.”

Software can’t fix hardware that fails faster than software can react.

The Working Fix

Claude’s solution: disable Turbo Boost entirely to prevent power spikes.

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

System became stable. Claude then wrote a systemd service to make it persistent:

# /etc/systemd/system/disable-turbo.service
[Unit]
Description=Disable CPU Turbo Boost
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

systemctl daemon-reload
systemctl enable disable-turbo.service

Why AI-Assisted Debugging Worked

I could have Googled “random Linux reboots” and spent hours reading forum posts about kernel bugs, driver issues, and memory problems. Instead, Claude Code:

Asked the right questions - immediately focused on whether logs showed clean shutdown vs. power cut
Followed a systematic process - ruled out causes one by one instead of jumping to conclusions
Knew the domain - understood MCE errors, RAPL timing, PSU OCP behavior
Explained the “why” - didn’t just give commands, but explained why RAPL couldn’t work

The debugging took about 20 minutes of back-and-forth. Most of that was waiting for package installs and running tests.

The Trade-offs

With Turbo disabled, the i9-10900K runs at 3.7GHz base instead of boosting to 5.3GHz. About 30% slower for my benchmarks.

The proper fix is a 750W+ PSU. But for now, disabling Turbo keeps the server stable.

For the Whisper benchmarks: local inference was 10-20x slower than cloud APIs (Groq) even with Turbo. The conclusion held - use cloud for production.

Key Takeaways

Random reboots without kernel panic = power issue, not software. Logs stopping abruptly is the tell.
Intel CPUs lie about power - the i9-10900K’s 125W “TDP” can spike to 250W+ under Turbo
RAPL can’t save you from PSU trips - hardware protection is faster than software throttling
AI assistants excel at systematic debugging - they don’t get distracted by red herrings or skip steps
The fix isn’t always hardware - disabling Turbo is a valid workaround when PSU upgrade isn’t immediate

Next time you hit a weird hardware issue, try describing it to Claude Code. The systematic approach might save you hours of forum diving.

Related: CLAUDE.md: Building Persistent Memory for AI Coding Agents - Make Claude Code remember your project’s quirks between sessions.

The Problem#

The AI-Guided Diagnosis#

Step 1: Check the Logs#

Step 2: Look for Hardware Errors#

Step 3: Rule Out Thermal#

Step 4: Check MCE Details#

Step 5: The Diagnosis#

The Fix Attempts#

The Working Fix#

Why AI-Assisted Debugging Worked#

The Trade-offs#

Key Takeaways#