Service-class runbooks · v1.0 · 2026

E-UBI Service Runbooks

Runbooks turn architecture into repeatable maintenance. This page focuses on the service classes that most directly determine whether a community node can still authenticate users, serve public docs, relay social activity, and recover after outages without improvising under stress.

🔐 identity/session📚 mirrors & artifacts⚡ relay & realtime💾 backup lanes🛠️ safe degraded modes

Community Networking / Local Infrastructure Nodes expansion and future network systems work: Caleb Bott. These runbooks are written against the current TheEtherNet implementation: username-first auth, 15-minute access tokens, 7-day persisted refresh tokens, 24-hour guest sessions, JWT-backed Socket.IO handshakes, and process-level rather than schema-level moderation boundaries.

Section 01 — Service Classes

Start with the service classes that define survivability

Not every process deserves the same urgency. The first runbooks should cover the services that keep identity legible, local documentation reachable, public coordination possible, and recovery material fresh enough to matter.

Identity / session broker

The path that turns a username into a usable session and keeps that session alive through refresh rotation. Breakage here looks like repeated 401s, failed login, or dead guest access.

Docs / package mirrors

Offline knowledge and static artifacts

If manuals, procedures, images, and public notices disappear, the site loses the instructions needed to repair itself and coordinate the next move.

TheEtherNet relay

REST + realtime social lane

Posting, messaging, and presence rely on both HTTP bearer auth and Socket.IO handshake auth. Failures often look partial rather than total, so runbooks must verify both paths.

Backup / sync worker

Snapshots, exports, remote copies

A quiet backup failure becomes tomorrow’s outage. This lane protects restore credibility and regional recovery capacity.

Section 02 — Identity / Session Runbook

Treat auth as a service boundary, not just a route file

The current TheEtherNet implementation uses username-based login, short access tokens, refresh-token rotation in PostgreSQL, and a client interceptor that retries once on 401 by posting to `/ethernet-api/auth/refresh`. That means the auth runbook must verify both the server-side token store and the client-side refresh behavior.

🔐 Current auth/session path

👤 `/auth/login`
username + password

→

🎟️ access 15m
Bearer on every REST request

→

♻️ refresh 7d
stored in PostgreSQL, rotated on refresh

→

🔌 Socket.IO handshake
JWT verified again for realtime

Classify the auth failure

Separate login rejection, refresh failure, guest-session failure, and realtime auth failure. They share components but do not imply the same root cause.

Verify the username-first path

Confirm the account exists by `username`, not email. Check whether the affected account is a guest account, because guests cannot recover through the normal password-login route.

Check refresh-token health

If users log in but then bounce into repeated 401s, inspect refresh-token rotation, expiry, database reachability, and clock correctness before blaming the client.

Verify the client retry path once

The Axios client retries once after a 401 using `/ethernet-api/auth/refresh`. If that request fails, the client logs out. That makes refresh health a visible user-impact boundary.

Re-test Socket.IO separately

Realtime auth uses the same JWT family but a different connection path. A successful REST login does not prove the websocket handshake and proxy path are healthy.

What to look for first

401 bursts after a previously valid session, invalid refresh-token responses, guest-only failures, or sockets rejecting with unauthorized while normal page loads still work.

Safe degraded mode

Keep read-only or already-authenticated local use alive where possible, but avoid issuing risky trust changes until refresh integrity and time sync are verified.

Escalate when

Refresh tokens appear invalid across multiple users, guest creation fails unexpectedly, or any fix requires rotating secrets, changing trust material, or break-glass recovery.

# Identity/session quick facts from current TheEtherNet
register = username + password (+ optional display name)
login = username + password
access_token = 15 minutes
refresh_token = 7 days, stored in PostgreSQL
guest_username = ghost_xxxxxxxx
guest_session = 24h access + 24h refresh
client_401_behavior = single refresh attempt, then logout

Section 03 — Docs / Mirror Runbook

Protect the knowledge layer first

Mirrors and static artifact lanes are often treated as secondary because they do not feel interactive. In practice they are the fastest path back to competence during an outage. If the community loses the docs mirror, it also loses diagrams, restore steps, and public notices.

Verify the local copy before the remote path

Check the local mirror or static bundle on the node itself. Only after the local copy is confirmed should you chase CDN, federation, or external publication problems.

Check freshness, not just existence

A mirror can be up and still be dangerously stale. Verify the latest update time, known-good build artifacts, and whether the change log aligns with what operators believe is current.

Protect storage headroom

Mirrors quietly fail when disks fill, copy jobs half-complete, or old artifacts accumulate. Treat free space as an operational signal, not a cleanup chore.

Degrade to local read-only if necessary

If remote publication is broken, keep the local mirror readable. Losing edit/publish convenience is preferable to losing the practical documentation needed for repair.

Typical symptoms

404s on public docs, stale BOMs, missing diagrams, broken asset paths, or a mirror that loads locally but not through the regional or public route.

Likely causes

Build artifact drift, bad copy job, storage pressure, stale sync job, or remote publication failure rather than a total local-service loss.

Escalate when

The only current copy is at risk, checksum/signature confidence is gone, or a mirror mismatch could mislead operators during a real repair event.

Section 04 — Relay / Realtime Runbook

Verify both the REST path and the socket path

TheEtherNet’s social layer is split across authenticated HTTP routes and Socket.IO connections. The site proxy sends REST calls through `/ethernet-api` and websocket traffic through `/ethernet-socket.io`, so a relay incident can be partial: page loads may work while live messaging fails, or auth may work for sockets but not for post creation.

⚡ Relay boundary to verify

🌐 site proxy
`/ethernet-api` + `/ethernet-socket.io`

→

🧱 Express routes
posts, messages, profile, friends

→

🔌 Socket.IO auth
token in handshake.auth.token

→

👥 steward moderation boundary
no schema-level role system today

Confirm whether the failure is HTTP, websocket, or both

Test a normal authenticated page action and a realtime action separately. Do not assume a posting failure and a messaging failure share a root cause.

Verify bearer auth before chasing business logic

Because routes are protected by shared JWT middleware, an expired or missing bearer token can masquerade as a posts, friends, or profile failure.

Check the socket handshake path

Socket.IO expects the token in the auth handshake. If REST works but realtime fails, focus on handshake auth, proxy routing, and token freshness rather than the database first.

Separate technical outage from governance dispute

The current schema does not implement a dedicated moderator or steward role. If the issue is about who is allowed to act rather than whether the socket connects, route it through the handbook and identity policy instead of pretending it is a purely technical bug.

Section 05 — Backup / Sync Runbook

Backups are believable only if the restore path stays warm

The backup lane is where service continuity becomes real. Snapshots, exports, and peer copies should be recent enough, verifiable enough, and documented enough that a damaged node can come back without guessing which file matters or who is allowed to unlock it.

Primary signal

Age of last known-good snapshot

A green job status matters less than the age of the last restorable copy.

Secondary signal

Off-site or peer copy presence

One local disk is convenience, not resilience. Verify at least one separate failure domain.

Custody rule

Restore rights are explicit

Possession of an encrypted archive should not silently imply authority to restore or inspect its full contents.

Drill target

Rebuild one service class end to end

Runbooks only mature when an actual restore drill proves they are complete and ordered correctly.

Confirm latest good copy and its location

Record which snapshot is considered current, where it lives, and who can authorize use of it.

Verify restore metadata before disaster day

Keep checksums, encryption notes, service ordering, and recovery contacts close to the archive so the backup is operationally meaningful.

Restore the minimum useful slice first

Bring back docs, identity, and local coordination before chasing full cosmetic parity. Restore priority should follow community usefulness, not technical neatness.

Log what changed during recovery

Record new tokens, new peer trust, replaced hardware, service versions, and what was intentionally left offline. Tomorrow’s steward will need that timeline.

Stewardship layer

Use the Operator Handbook

The handbook defines who can authorize rotation, restore, peer suspension, and custody actions when these runbooks cross into governance.

Open Operator Handbook

Operations layer

Pair this with the Operations Runbook

The broader runbook covers cadence, drills, spares, and incident logging. This page gives the service-specific decision path.

Open Operations Runbook

Identity & trust

Check custody and peer-trust boundaries

Use the identity guide when a restore depends on secret material, revoked peers, or recovery authority that must be re-established cleanly.

Open Identity & Trust Guide

Service governance

Keep locality and approval rules explicit

The service matrix helps decide whether a degraded mode should stay local, mirror outward, or wait for steward approval.

Open Service Matrix

Regional layer

Use it with Federation

Backups and mirrors matter most when they can restore a site across failure domains instead of only inside one cabinet.

Open Federation Guide

Social layer

Bridge into TheEtherNet

These runbooks exist to keep the social layer locally survivable and regionally recoverable without hiding authority boundaries.

Open TheEtherNet