🛠️ Operations & Recovery Runbook · v1.0 · 2026

E-UBI Operations Runbook

Infrastructure is not real because it boots once. It becomes real when a second person can maintain it, a damaged node can be rebuilt from documented materials, and predictable failures are handled with checklists instead of panic.

🗂️ Labeled assets 💾 Verified backups ♻️ Repeatable restores 🚨 Incident lanes 👥 Steward handoff

Community Networking / Local Infrastructure Nodes expansion and future network systems work: Caleb Bott. This runbook layer turns the node and federation pages into something communities can actually sustain: backup cadence, restore sequence, incident triage, spare-part policy, and operational handoff that survives volunteer turnover.

Section 01 — Readiness

Operational readiness means the system is explainable under stress

Every community node should have a legible operating baseline: where the documentation lives, what hardware is installed, where the backups go, who can respond, and how to rebuild the box after a disk failure, bad update, power event, or cabinet swap.

📐 Build docs
port map, BOM, VLANs, labels

→

🧪 Routine checks
backups, storage, link health

→

♻️ Restore drill
rebuild from image + snapshot

→

👥 Steward handoff
new operator can take over

Local source of truth

Docs mirror + printed quick sheet

The node should carry its own topology notes, passwords workflow, device map, and recovery references locally, with a short paper quick-start stored in the cabinet.

Named ownership

Primary + secondary stewards

Two people should know how to inspect health, replace disks, rotate credentials, and perform a basic restore. One heroic maintainer is not a resilience strategy.

Failure inventory

Know the likely breakpoints

Power bricks, SSDs, cabling, radios, router configs, and WAN failover paths should be treated as expected failure surfaces and documented that way.

Recovery objective

Rejoin service in hours, not weeks

The goal is not perfect uptime. The goal is a restore path that a community team can execute with modest tools and clear instructions.

Section 02 — Backup Ladder

Backups should form a ladder, not a single hope

A serious node keeps more than one copy, in more than one place, with more than one recovery use. Runtime data, configuration, public docs, package caches, and encrypted private archives do not all belong in the same backup lane.

💾 Backup and recovery ladder

primary live service state local snapshots peer / regional recovery copies encrypted off-site archives

Do not ask one copy to do every job. Local snapshots restore quickly, peer mirrors restore collaboratively, and encrypted off-site archives protect against site-wide loss.

# Example maintenance cadence
[daily]
check = service health, queue depth, free disk, WAN state

[weekly]
check = successful snapshots, UPS/DC runtime, log noise, package cache integrity

[monthly]
check = test restore on spare media, rotate credentials, verify contact sheet

[quarterly]
check = cabinet cleanout, replace suspect cables, review steward handoff docs

Section 03 — Restore Sequence

The restore drill should be rehearsed before disaster

The clean restore path matters more than the theoretical elegance of the original deployment. If a service node dies, a community team should know how to image storage, import the saved configuration, restore current data, verify network roles, and re-enter federation safely.

Replace or re-image the failed boot/storage media

Start from a known-good base image with documented package versions, hostname pattern, network plan, and service definitions rather than attempting ad hoc surgery on a corrupted disk.

Restore the node identity and config set

Reapply interface assignments, VLAN/trunk expectations, service env files, PM2 or systemd definitions, scheduled jobs, and the local docs mirror index before restoring dynamic data.

Recover current data from the nearest trustworthy source

Prefer the newest verified local snapshot. If that is unavailable, step outward to a peer mirror or encrypted archive, documenting which recovery tier was used.

Validate local services before rejoining broader sync

Check docs, dashboards, databases, relay queues, and cached assets locally first. Do not reconnect a confused node to federation until its local state is coherent.

Resume queued sync and publish an operator note

Once the node is healthy, release its queued jobs, watch for drift or duplicates, and log what failed, what was restored, and what should be improved before the next incident.

Section 04 — Incident Response

Treat incidents as bounded lanes, not vague emergencies

Not every failure is equal. Power, backhaul, storage, misconfiguration, and security events need different first moves. A useful runbook narrows the response quickly: isolate, preserve, restore, communicate, and only then optimize.

🚨 Incident triage lanes

The best incident process makes it obvious what kind of failure happened, what not to touch yet, who should know, and when to shift from stabilization into restore.

Power event

Protect the access layer, confirm battery or UPS runtime, shut down non-critical services first, and log whether the outage was local only or community-wide.

Backhaul loss

Keep the node serving locally. Record queue growth, pause non-essential outbound jobs, and avoid confusing users by pretending the WAN-dependent features are still healthy.

Storage fault

Stop writes if corruption is suspected, preserve evidence, identify the last good snapshot, and move quickly toward media replacement rather than squeezing life out of a dying disk.

Security concern

Rotate secrets, isolate affected services, capture logs, notify named stewards, and prefer temporary service reduction over silent compromise.

Section 05 — Stewardship

Sustainable infrastructure needs a human handoff model

Community-run systems survive because knowledge is shared deliberately. The operational layer should define who can approve changes, who can perform a restore, where contacts live, what spare kit is stocked, and how a new steward learns the environment without guessing.

Spare kit

SSD, cables, PSU, AP, labeled boot media

Cheap, failure-prone components should be locally stocked. Waiting weeks for a replacement power adapter is not a sophisticated resilience posture.

Operator packet

Contacts, diagrams, credentials workflow

Keep a current contact sheet, change log, and credential rotation procedure accessible to authorized stewards so turnover does not reset the whole operation.

Change discipline

One log for material changes

Router updates, service changes, VLAN edits, tunnel rotation, and backup destination changes should land in a single operator log rather than scattered chat messages.

Training mode

Shadow, drill, then own

A new volunteer should first observe, then perform a drill with supervision, then take responsibility for a bounded maintenance task with written sign-off.

Single node

Start from the Network Stack Specification

Use the node spec for cabinet layout, VLAN structure, services, and the hardware baseline this runbook is meant to support.

Open Node Spec

Regional layer

Pair operations with Federation

The federation guide explains how peers and mirrors cooperate. This page explains how operators keep those relationships healthy over time.

Open Federation Guide

Service governance

Know which services and secrets need local control

The service matrix defines locality, sync scope, mirrored artifacts, and approval boundaries so restore and incident work happen against clear trust rules.

Open Service Matrix

Identity & trust

Rotate credentials and recover authority cleanly

The identity guide makes the operator side explicit: credential custody, break-glass access, peer trust suspension, and what has to rotate when a steward leaves or a secret is exposed.

Open Identity & Trust Guide

Operator stewardship

Name who can respond and approve

The operator handbook defines steward roles, escalation ladders, and custody boundaries so incident response does not quietly depend on one person’s informal authority.

Open Operator Handbook

Service runbooks

Use service-specific response steps

The runbooks turn this operations layer into concrete procedures for identity/session issues, mirrors, relay/realtime failures, and backup recovery.

Open Service Runbooks

Hardware base

Connect it to the Device Blueprint

The terminal and power chain matter because they feed the operational system. Recovery planning should include the endpoint hardware, not just the rack.

Open Device Blueprint

Manifesto layer

Return to the homepage architecture

The homepage now ties together blueprint, network stack, federation, operations, service governance, stewardship, and runbooks as one practical autonomy stack.

Open Homepage Network Layer