iPXE booting 100 nodes: what we learned scaling the bootstrap network

When we tested NodeFoundry against a 100-node cluster, the iPXE bootstrap process — which works perfectly at 10–20 nodes — started showing intermittent failures. Here’s what broke and how we fixed it.

TFTP congestion

iPXE defaults to fetching the kernel and initrd over TFTP, which has no congestion control. At 100 simultaneous boots, the master node’s TFTP server was getting saturated, causing timeouts and retries that cascaded into longer boot times.

Fix: We switched to HTTP-based iPXE chaining. The master node serves a small initial iPXE script over TFTP, which immediately chains to an HTTP endpoint. HTTP handles concurrency gracefully, and boot times dropped from 4+ minutes to under 90 seconds at 100 nodes.

DHCP lease exhaustion

Our test subnet was a /24, and with IPMI, PXE, and OS interfaces all requesting leases simultaneously, we ran into a lease exhaustion scenario during a mass reboot.

Fix: NodeFoundry now reserves a dedicated DHCP range for PXE boot and issues short-TTL leases (5 minutes) during the bootstrap phase. Post-boot, nodes request a permanent lease on the storage network.

iPXE script chaining timeouts

iPXE has a built-in timeout for chain loading. In high-latency network environments (100GbE switches with STP convergence delays), the chain request would occasionally time out before the switch finished converging.

Fix: We added configurable chain timeout parameters in the master node’s iPXE template, with a default of 30 seconds to cover STP convergence windows.

iPXE booting 100 nodes: what we learned scaling the bootstrap network

iPXE booting 100 nodes: what we learned scaling the bootstrap network

TFTP congestion

DHCP lease exhaustion

iPXE script chaining timeouts

We're happy to walk you through it.