Infrastructure is not real because it boots once. It becomes real when a second person can maintain it, a damaged node can be rebuilt from documented materials, and predictable failures are handled with checklists instead of panic.
Every community node should have a legible operating baseline: where the documentation lives, what hardware is installed, where the backups go, who can respond, and how to rebuild the box after a disk failure, bad update, power event, or cabinet swap.
A serious node keeps more than one copy, in more than one place, with more than one recovery use. Runtime data, configuration, public docs, package caches, and encrypted private archives do not all belong in the same backup lane.
Do not ask one copy to do every job. Local snapshots restore quickly, peer mirrors restore collaboratively, and encrypted off-site archives protect against site-wide loss.
The clean restore path matters more than the theoretical elegance of the original deployment. If a service node dies, a community team should know how to image storage, import the saved configuration, restore current data, verify network roles, and re-enter federation safely.
Not every failure is equal. Power, backhaul, storage, misconfiguration, and security events need different first moves. A useful runbook narrows the response quickly: isolate, preserve, restore, communicate, and only then optimize.
The best incident process makes it obvious what kind of failure happened, what not to touch yet, who should know, and when to shift from stabilization into restore.
Protect the access layer, confirm battery or UPS runtime, shut down non-critical services first, and log whether the outage was local only or community-wide.
Keep the node serving locally. Record queue growth, pause non-essential outbound jobs, and avoid confusing users by pretending the WAN-dependent features are still healthy.
Stop writes if corruption is suspected, preserve evidence, identify the last good snapshot, and move quickly toward media replacement rather than squeezing life out of a dying disk.
Rotate secrets, isolate affected services, capture logs, notify named stewards, and prefer temporary service reduction over silent compromise.
Community-run systems survive because knowledge is shared deliberately. The operational layer should define who can approve changes, who can perform a restore, where contacts live, what spare kit is stocked, and how a new steward learns the environment without guessing.
Use the node spec for cabinet layout, VLAN structure, services, and the hardware baseline this runbook is meant to support.
Open Node SpecThe federation guide explains how peers and mirrors cooperate. This page explains how operators keep those relationships healthy over time.
Open Federation GuideThe service matrix defines locality, sync scope, mirrored artifacts, and approval boundaries so restore and incident work happen against clear trust rules.
Open Service MatrixThe identity guide makes the operator side explicit: credential custody, break-glass access, peer trust suspension, and what has to rotate when a steward leaves or a secret is exposed.
Open Identity & Trust GuideThe operator handbook defines steward roles, escalation ladders, and custody boundaries so incident response does not quietly depend on one person’s informal authority.
Open Operator HandbookThe runbooks turn this operations layer into concrete procedures for identity/session issues, mirrors, relay/realtime failures, and backup recovery.
Open Service RunbooksThe terminal and power chain matter because they feed the operational system. Recovery planning should include the endpoint hardware, not just the rack.
Open Device BlueprintThe homepage now ties together blueprint, network stack, federation, operations, service governance, stewardship, and runbooks as one practical autonomy stack.
Open Homepage Network Layer