We’ve been operating Ceph in production for 12 years

Before NodeFoundry, our team spent over a decade operating Ceph clusters — for cloud providers, financial services firms, and data-intensive startups. We’ve seen multi-petabyte clusters, we’ve handled 3 AM OSD failures, and we’ve written more Ansible playbooks for Ceph than we care to count.

The operational gap

Ceph is genuinely the right answer for software-defined storage. It’s battle-tested, it scales, and the community is excellent. But operating it requires expertise that’s hard to hire and institutional knowledge that lives in runbooks, not tooling.

Every team we talked to had some version of the same story: a senior engineer who “knows Ceph” carries the on-call burden, and when that person leaves, institutional knowledge walks out the door with them.

What we decided to build

NodeFoundry encodes what good Ceph operations look like into software. Not a managed service — you run it on your hardware, in your network, with your data staying where it belongs. But the operational intelligence — the drain sequences, the upgrade orchestration, the SMART monitoring, the failure domain awareness — that’s built in.

We’re building the tool we always wished existed.

We've been operating Ceph in production for 12 years. Here's why we built NodeFoundry.

We’ve been operating Ceph in production for 12 years

The operational gap

What we decided to build

We're happy to walk you through it.