Rolling upgrades are here — upgrade Ceph with zero downtime

NodeFoundry 0.14 ships rolling upgrade support for major Ceph version transitions. Here’s how we built it and what to expect.

The problem with Ceph upgrades

Upgrading Ceph has historically meant scheduling downtime windows. Each OSD daemon needs to stop, update, and restart — and doing this naively across a cluster means degraded pools and potential I/O stalls that last hours.

The community’s answer is cephadm, which handles rolling restarts. But orchestrating the full upgrade — draining OSDs, respecting failure domain boundaries, watching health before moving on — requires careful sequencing that cephadm leaves to the operator.

What NodeFoundry does differently

When you trigger an upgrade via nf cluster upgrade or the dashboard, NodeFoundry:

Validates preconditions — cluster must be HEALTH_OK, all OSDs must be up+in, no ongoing backfill.
Drains one host at a time — sets noout and norebalance flags, marks OSDs down, waits for PGs to reach active+clean.
Upgrades the host — updates Ceph packages, restarts daemons in the correct order (mon → mgr → osd).
Waits for recovery — polls until the OSD returns up+in and PG health is clean before moving to the next host.
Respects --max-concurrent — for large clusters, you can upgrade multiple hosts in parallel within a failure domain.

The whole process is visible in the dashboard as a live timeline, with per-node status and an estimated completion time.

Trying it

# Upgrade to Squid, one node at a time
nf cluster upgrade prod-01 --ceph-version squid

# Upgrade OS packages alongside Ceph, two nodes at a time
nf cluster upgrade prod-01 --os-packages --max-concurrent 2

# Preview what would happen without executing
nf cluster upgrade prod-01 --ceph-version squid --dry-run

Rolling upgrades are available on NodeFoundry 0.14+. Existing clusters will see the Upgrade button appear in the cluster header once the agent updates.

Rolling upgrades are here — upgrade Ceph with zero downtime

Rolling upgrades are here — upgrade Ceph with zero downtime

The problem with Ceph upgrades

What NodeFoundry does differently

Trying it

We're happy to walk you through it.