Changelog
New releases, fixes, and improvements to NodeFoundry.
Added
- Rolling upgrade support for major Ceph version transitions (Reef → Squid). NodeFoundry drains each OSD, upgrades the host OS and Ceph packages, and recommissions before moving to the next node.
- IPMI power-cycle and hard reboot actions are now available directly from the node list view in the dashboard.
- New
nf cluster health --verboseflag shows per-OSD and per-placement-group status inline. - API endpoint
POST /v1/clusters/{id}/upgradewith progress streaming viaGET /v1/operations/{id}.
Changed
- Dashboard cluster overview page redesigned with a 24-hour utilization timeline and IOPS sparklines per node.
- API rate limits for Professional tier raised from 300 to 600 requests per minute.
- The
nf cluster create --waitflag now streams live progress instead of polling.
Fixed
- S.M.A.R.T. polling failure on certain NVMe devices that report no temperature sensor.
nf node listnow correctly shows nodes that are in maintenance mode rather than showing them as offline.- Race condition in the OSD activation sequence on nodes with more than 24 drives.
Fixed
- Race condition in OSD drain that could stall upgrade operations on clusters with more than 64 OSDs.
- Alert webhook retries were not respecting exponential backoff — they would flood the endpoint on network errors.
- Node registration timing issue on certain Supermicro BIOS versions that send DHCP requests before the NIC link is stable.
- Fixed an off-by-one error in the PG count calculation when creating erasure-coded pools with
k=4, m=2.
Added
- S.M.A.R.T. monitoring for HDDs and NVMe devices, with configurable alert thresholds per drive model. Predictive failure events generate dashboard warnings and optional webhook alerts.
- Centralized log aggregation — OSD, monitor, and service logs from all nodes are indexed in Loki and searchable from the dashboard and via
nf log search. - Webhook alerting: route cluster health events to Slack, PagerDuty, or any HTTP endpoint. Configure per-severity thresholds and retry policies.
- New
nf log searchCLI command with--since,--until,--node, and--levelflags.
Changed
- Minimum supported Ceph version raised to Pacific (16.2.x). Luminous and Nautilus clusters are no longer supported for new deployments.
- iPXE bootstrap image now uses a read-only tmpfs overlay to improve boot reliability on nodes with marginal disks.
- DHCP lease timeout reduced from 24 hours to 4 hours to improve re-registration after IP changes.
Added
- NVMe-oF gateway service provisioning. Deploy and manage NVMe-oF targets on top of RBD pools from the dashboard and CLI.
- Multi-cluster selector in the dashboard — switch between clusters without navigating away.
nf pool create --erasurewith support for named erasure code profiles and automatic PG count calculation.- Node maintenance mode:
nf node maintenance <name>drains OSDs and suppresses alerts while you perform hardware work.
Changed
- Cluster create API now returns an
operation_idand supports streaming progress via SSE at/v1/operations/{id}/stream. - Dashboard metrics retention increased from 7 days to 30 days on Professional tier.
Fixed
- CephFS MDS placement on single-rack topologies was placing both active and standby on the same host.
- CRUSH map generation failed on clusters where node hostnames contained uppercase letters.
Added
- General availability. NodeFoundry is out of beta.
- Object Gateway (RGW) service provisioning with S3-compatible endpoint and configurable zonegroups.
- API key management: create scoped read-only and read-write keys per account.
- Ceph Squid (19.x) support.