🛠️ Operations & Recovery Runbook · v1.0 · 2026

E-UBI Operations Runbook

Infrastructure is not real because it boots once. It becomes real when a second person can maintain it, a damaged node can be rebuilt from documented materials, and predictable failures are handled with checklists instead of panic.

🗂️ Labeled assets 💾 Verified backups ♻️ Repeatable restores 🚨 Incident lanes 👥 Steward handoff
Community Networking / Local Infrastructure Nodes expansion and future network systems work: Caleb Bott. This runbook layer turns the node and federation pages into something communities can actually sustain: backup cadence, restore sequence, incident triage, spare-part policy, and operational handoff that survives volunteer turnover.

Operational readiness means the system is explainable under stress

Every community node should have a legible operating baseline: where the documentation lives, what hardware is installed, where the backups go, who can respond, and how to rebuild the box after a disk failure, bad update, power event, or cabinet swap.

📐 Build docs
port map, BOM, VLANs, labels
🧪 Routine checks
backups, storage, link health
♻️ Restore drill
rebuild from image + snapshot
👥 Steward handoff
new operator can take over
Local source of truth
Docs mirror + printed quick sheet
The node should carry its own topology notes, passwords workflow, device map, and recovery references locally, with a short paper quick-start stored in the cabinet.
Named ownership
Primary + secondary stewards
Two people should know how to inspect health, replace disks, rotate credentials, and perform a basic restore. One heroic maintainer is not a resilience strategy.
Failure inventory
Know the likely breakpoints
Power bricks, SSDs, cabling, radios, router configs, and WAN failover paths should be treated as expected failure surfaces and documented that way.
Recovery objective
Rejoin service in hours, not weeks
The goal is not perfect uptime. The goal is a restore path that a community team can execute with modest tools and clear instructions.

Backups should form a ladder, not a single hope

A serious node keeps more than one copy, in more than one place, with more than one recovery use. Runtime data, configuration, public docs, package caches, and encrypted private archives do not all belong in the same backup lane.

💾 Backup and recovery ladder

Backup ladder for an E-UBI community node Diagram showing local live storage, local snapshot storage, a peer or regional mirror, and encrypted off-site archives with restore arrows returning to the node. Live Node Storage database · docs · configs · service state fast restore source for current service Local Snapshot nightly image + config export cabinet or nearby backup SSD Peer / Regional Mirror docs, packages, signed snapshots recovery from neighboring site Encrypted Off-Site Archive weekly/monthly cold storage + retained history last line of defense, not daily runtime Restore image replay rejoin peers fast local recovery copy retained snapshots peer restore / mirror lane
primary live service state local snapshots peer / regional recovery copies encrypted off-site archives

Do not ask one copy to do every job. Local snapshots restore quickly, peer mirrors restore collaboratively, and encrypted off-site archives protect against site-wide loss.

# Example maintenance cadence [daily] check = service health, queue depth, free disk, WAN state [weekly] check = successful snapshots, UPS/DC runtime, log noise, package cache integrity [monthly] check = test restore on spare media, rotate credentials, verify contact sheet [quarterly] check = cabinet cleanout, replace suspect cables, review steward handoff docs

The restore drill should be rehearsed before disaster

The clean restore path matters more than the theoretical elegance of the original deployment. If a service node dies, a community team should know how to image storage, import the saved configuration, restore current data, verify network roles, and re-enter federation safely.

1
Replace or re-image the failed boot/storage media
Start from a known-good base image with documented package versions, hostname pattern, network plan, and service definitions rather than attempting ad hoc surgery on a corrupted disk.
2
Restore the node identity and config set
Reapply interface assignments, VLAN/trunk expectations, service env files, PM2 or systemd definitions, scheduled jobs, and the local docs mirror index before restoring dynamic data.
3
Recover current data from the nearest trustworthy source
Prefer the newest verified local snapshot. If that is unavailable, step outward to a peer mirror or encrypted archive, documenting which recovery tier was used.
4
Validate local services before rejoining broader sync
Check docs, dashboards, databases, relay queues, and cached assets locally first. Do not reconnect a confused node to federation until its local state is coherent.
5
Resume queued sync and publish an operator note
Once the node is healthy, release its queued jobs, watch for drift or duplicates, and log what failed, what was restored, and what should be improved before the next incident.

Treat incidents as bounded lanes, not vague emergencies

Not every failure is equal. Power, backhaul, storage, misconfiguration, and security events need different first moves. A useful runbook narrows the response quickly: isolate, preserve, restore, communicate, and only then optimize.

🚨 Incident triage lanes

Incident triage lanes for E-UBI infrastructure Diagram showing five incident categories leading to first actions, communication, and restore or escalation outcomes. Power eventUPS/DC runtime, graceful shutdown Backhaul lossWAN down, LAN still useful Storage faultdisk errors, snapshot failure Config driftVLANs, services, bad deploy Security concern First Actions stabilize power · preserve logs freeze risky changes · isolate faults switch to local mode if upstream is gone decide restore tier before improvising notify steward list Comms local status board operator log entry public notice if needed peer alert for restore help Outcome recover rebuild escalate document

The best incident process makes it obvious what kind of failure happened, what not to touch yet, who should know, and when to shift from stabilization into restore.

Power event

Protect the access layer, confirm battery or UPS runtime, shut down non-critical services first, and log whether the outage was local only or community-wide.

Backhaul loss

Keep the node serving locally. Record queue growth, pause non-essential outbound jobs, and avoid confusing users by pretending the WAN-dependent features are still healthy.

Storage fault

Stop writes if corruption is suspected, preserve evidence, identify the last good snapshot, and move quickly toward media replacement rather than squeezing life out of a dying disk.

Security concern

Rotate secrets, isolate affected services, capture logs, notify named stewards, and prefer temporary service reduction over silent compromise.

Sustainable infrastructure needs a human handoff model

Community-run systems survive because knowledge is shared deliberately. The operational layer should define who can approve changes, who can perform a restore, where contacts live, what spare kit is stocked, and how a new steward learns the environment without guessing.

Spare kit
SSD, cables, PSU, AP, labeled boot media
Cheap, failure-prone components should be locally stocked. Waiting weeks for a replacement power adapter is not a sophisticated resilience posture.
Operator packet
Contacts, diagrams, credentials workflow
Keep a current contact sheet, change log, and credential rotation procedure accessible to authorized stewards so turnover does not reset the whole operation.
Change discipline
One log for material changes
Router updates, service changes, VLAN edits, tunnel rotation, and backup destination changes should land in a single operator log rather than scattered chat messages.
Training mode
Shadow, drill, then own
A new volunteer should first observe, then perform a drill with supervision, then take responsibility for a bounded maintenance task with written sign-off.