Haven’t heard much about DreamHost’s Infrastructure team (aside from Parts 1 and 2 of this series)? That means they’re doing their job!
In their world, silence is golden, and chaos is always just a few milliseconds away.
While the rest of us are streaming, sleeping, or building our online presence, this team is walking cold aisles, checking dashboards, replacing RAID cards, and preparing for emergencies that (hopefully) never happen.
This is a peek into the war rooms, weird moments, and 24/7 vigilance that keeps DreamHost, and your site, online.

The Day Portland Went Dark
If there’s one story that sums up the team’s resilience, it’s the Portland outage of November 2023.
“PGE (Portland General Electric) was performing maintenance on one of the two power feeds to the data center,” explained Chris Lewis, DreamHost’s Manager of Data Center Operations, regarding an issue with the building’s power supply.
Then the other power feed failed, causing a complete power loss.
Batteries kicked in. Generators roared to life and everything seemed OK. But when PGE turned the maintenance feed back on, it caused a switching malfunction, sending a dangerous power charge that sparked a fire and blew out a bank of breakers. With minutes of backup battery left, the team faced a worst-case scenario: complete data center failure.
“Almost 75-80% of our racks went down, including core networking,” Chris said. “It was madness.”
And yet, within 14 hours and a total team collaboration, everything was back up. Here’s how the team recovered service significantly faster than the estimated 2 days to a week (worst case):
- Anticipation of Catastrophe
- The team expected a catastrophic failure (not this one specifically) to take out the power and proactively migrated to a system where most machines could start without network dependency, saving significant time.
- Preventative Maintenance and Resilient Systems
- Consistent preventative maintenance and resilient systems (raids and zpools) protected data and allowed for sane replacement times rather than rushing to restore broken arrays.
- Proximity and Preparedness of On-Call Staff
- On-call admins live within an hour of the data center and were on-site quickly (this instance, the on-call admin was on-site twenty minutes after they were alerted).
- With the power loss locking doors from the outside and disabling physical security systems, having someone on-site to manually verify identities was critical—since letting anyone onto the floor without proper checks would be a major security violation (or worse, a Mission: Impossible breach).
- Skilled and Global TechOps Team with Clear Leadership
- A vast, skilled, and global TechOps team provided fresh shifts and remote oversight.
- Clear leadership funneled information and prioritized tasks, enabled by a deep understanding of the layers and trust in the team to keep everyone updated on the status of all the moving pieces.
💡Did You Know?
DreamHost’s diesel generators can power entire data centers for up to 24 hours as long as fuel deliveries keep coming. During the outage, they were the only reason lights (and sites) stayed on.
Redundancy: The Art of Being Paranoid (In a Good Way)
Disaster recovery isn’t made up on the spot, it’s engineered well in advance.
“We assume something will break,” said Luke Odom, DreamHost’s Director of IT Operations. “So we design for failure.”
- Power? Every rack has redundant PDUs (power distribution units) on separate feeds.
- Network? Multiple ISPs (internet service providers) via different entry points.
- Storage? RAID arrays, replication layers, and backups ready to go.
Chris puts it plainly, applying Murphy’s Law, “Anything that can go wrong, will go wrong,” it’s not a matter of if failure occurs, but when. The critical factor lies in how fast they can recover, or better yet, prevent such unexpected events from occurring.

Monitoring: The Silent Shield
Keeping things online isn’t glamorous, but it’s constant.
“We walk the floor, check lights, review dashboards every day,” Chris said. “Sometimes the hardware doesn’t report its own failure, so we go looking.”
That diligence means small issues get caught before they become full-blown outages.
Current projects include upgrading firmware, code patching vulnerabilities, and rolling out better reporting so issues can be spotted (and solved) even faster.
“We monitor everything we can,” Chris said. “And when something slips through, we figure out why and fix the system.”
When Automation Breaks (Because It Does)
Automation is useful, until it isn’t.
“We break automation on the regular,” Luke said. “Half of what we do isn’t really automation, it’s just tools that make our hands-on work faster and easier.”
A major security vulnerability, for example, forced the team to reboot nearly everything and upgrade the BIOS across their infrastructure. It was tedious and unautomated, until they built a tool to streamline repeatable steps. They still use it today.
Server provisioning that once required hours now takes 20 minutes using OpenStack and Ansible. But if firmware, OS versions, or drivers change, scripts fail, and it’s back to hands-on work.

War Story #2: Cow Field Edition
Sometimes, infrastructure support happens in… unconventional places.
While vacationing in southern Georgia, Luke got a call: RAID failure. Hardware down and servers offline. As a last resort, the team phoned Luke for help.
“I sat on a four-wheeler while herding cows,” he said. “I walked the tech through reassembling the array with one working drive. We duplicated it, rebuilt the RAID, and got the customers back online,” he said.
Yes, from the middle of a cow field.
War Story #3: Gamer Jimmy, Fraud Slayer
Nearly a decade ago, DreamHost was hit with a wave of fraudulent sign-ups. Hackers would compromise accounts, order dedicated servers or VPS offerings, and use them to launch attacks or mine crypto.
Enter: Gamer Jimmy.
Not a team member, but a notorious hacker whose activity inspired an internal script.
“One of the guys wrote a script named after “Gamer Jimmy” Chris said. “It scanned for fraud indicators and auto-rejected suspicious requests.”
It worked.
💡Did You Know?
The FBI once installed a secret surveillance tap on a DreamHost customer: under the data center floor. It was later discovered during a center shutdown. The asset tag location was literally: underfloor.
DreamHost Infrastructure: The Unsung Heroes
At the end of the day, DreamHost’s infrastructure team isn’t just about blinking lights and airflow. They’re the ones who drop everything—meetings, sleep, even vacations—to keep your site online.
“You don’t see us until something breaks,” Chris said. “But that’s kind of the point. If we’re invisible, we’re doing it right.”
And if you do see them?
There’s probably a burning switchboard or a four-wheeler involved. And somehow, your server still made it.
