Service-class runbooks · v1.0 · 2026

E-UBI Service Runbooks

Runbooks turn architecture into repeatable maintenance. This page focuses on the service classes that most directly determine whether a community node can still authenticate users, serve public docs, relay social activity, and recover after outages without improvising under stress.

🔐 identity/session📚 mirrors & artifacts⚡ relay & realtime💾 backup lanes🛠️ safe degraded modes
Community Networking / Local Infrastructure Nodes expansion and future network systems work: Caleb Bott. These runbooks are written against the current TheEtherNet implementation: username-first auth, 15-minute access tokens, 7-day persisted refresh tokens, 24-hour guest sessions, JWT-backed Socket.IO handshakes, and process-level rather than schema-level moderation boundaries.

Start with the service classes that define survivability

Not every process deserves the same urgency. The first runbooks should cover the services that keep identity legible, local documentation reachable, public coordination possible, and recovery material fresh enough to matter.

Identity / session broker
Login, refresh, guest access
The path that turns a username into a usable session and keeps that session alive through refresh rotation. Breakage here looks like repeated 401s, failed login, or dead guest access.
Docs / package mirrors
Offline knowledge and static artifacts
If manuals, procedures, images, and public notices disappear, the site loses the instructions needed to repair itself and coordinate the next move.
TheEtherNet relay
REST + realtime social lane
Posting, messaging, and presence rely on both HTTP bearer auth and Socket.IO handshake auth. Failures often look partial rather than total, so runbooks must verify both paths.
Backup / sync worker
Snapshots, exports, remote copies
A quiet backup failure becomes tomorrow’s outage. This lane protects restore credibility and regional recovery capacity.

Treat auth as a service boundary, not just a route file

The current TheEtherNet implementation uses username-based login, short access tokens, refresh-token rotation in PostgreSQL, and a client interceptor that retries once on 401 by posting to `/ethernet-api/auth/refresh`. That means the auth runbook must verify both the server-side token store and the client-side refresh behavior.

🔐 Current auth/session path

👤 `/auth/login`
username + password
🎟️ access 15m
Bearer on every REST request
♻️ refresh 7d
stored in PostgreSQL, rotated on refresh
🔌 Socket.IO handshake
JWT verified again for realtime
1
Classify the auth failure
Separate login rejection, refresh failure, guest-session failure, and realtime auth failure. They share components but do not imply the same root cause.
2
Verify the username-first path
Confirm the account exists by `username`, not email. Check whether the affected account is a guest account, because guests cannot recover through the normal password-login route.
3
Check refresh-token health
If users log in but then bounce into repeated 401s, inspect refresh-token rotation, expiry, database reachability, and clock correctness before blaming the client.
4
Verify the client retry path once
The Axios client retries once after a 401 using `/ethernet-api/auth/refresh`. If that request fails, the client logs out. That makes refresh health a visible user-impact boundary.
5
Re-test Socket.IO separately
Realtime auth uses the same JWT family but a different connection path. A successful REST login does not prove the websocket handshake and proxy path are healthy.

What to look for first

401 bursts after a previously valid session, invalid refresh-token responses, guest-only failures, or sockets rejecting with unauthorized while normal page loads still work.

Safe degraded mode

Keep read-only or already-authenticated local use alive where possible, but avoid issuing risky trust changes until refresh integrity and time sync are verified.

Escalate when

Refresh tokens appear invalid across multiple users, guest creation fails unexpectedly, or any fix requires rotating secrets, changing trust material, or break-glass recovery.

# Identity/session quick facts from current TheEtherNet register = username + password (+ optional display name) login = username + password access_token = 15 minutes refresh_token = 7 days, stored in PostgreSQL guest_username = ghost_xxxxxxxx guest_session = 24h access + 24h refresh client_401_behavior = single refresh attempt, then logout

Protect the knowledge layer first

Mirrors and static artifact lanes are often treated as secondary because they do not feel interactive. In practice they are the fastest path back to competence during an outage. If the community loses the docs mirror, it also loses diagrams, restore steps, and public notices.

1
Verify the local copy before the remote path
Check the local mirror or static bundle on the node itself. Only after the local copy is confirmed should you chase CDN, federation, or external publication problems.
2
Check freshness, not just existence
A mirror can be up and still be dangerously stale. Verify the latest update time, known-good build artifacts, and whether the change log aligns with what operators believe is current.
3
Protect storage headroom
Mirrors quietly fail when disks fill, copy jobs half-complete, or old artifacts accumulate. Treat free space as an operational signal, not a cleanup chore.
4
Degrade to local read-only if necessary
If remote publication is broken, keep the local mirror readable. Losing edit/publish convenience is preferable to losing the practical documentation needed for repair.

Typical symptoms

404s on public docs, stale BOMs, missing diagrams, broken asset paths, or a mirror that loads locally but not through the regional or public route.

Likely causes

Build artifact drift, bad copy job, storage pressure, stale sync job, or remote publication failure rather than a total local-service loss.

Escalate when

The only current copy is at risk, checksum/signature confidence is gone, or a mirror mismatch could mislead operators during a real repair event.

Verify both the REST path and the socket path

TheEtherNet’s social layer is split across authenticated HTTP routes and Socket.IO connections. The site proxy sends REST calls through `/ethernet-api` and websocket traffic through `/ethernet-socket.io`, so a relay incident can be partial: page loads may work while live messaging fails, or auth may work for sockets but not for post creation.

⚡ Relay boundary to verify

🌐 site proxy
`/ethernet-api` + `/ethernet-socket.io`
🧱 Express routes
posts, messages, profile, friends
🔌 Socket.IO auth
token in handshake.auth.token
👥 steward moderation boundary
no schema-level role system today
1
Confirm whether the failure is HTTP, websocket, or both
Test a normal authenticated page action and a realtime action separately. Do not assume a posting failure and a messaging failure share a root cause.
2
Verify bearer auth before chasing business logic
Because routes are protected by shared JWT middleware, an expired or missing bearer token can masquerade as a posts, friends, or profile failure.
3
Check the socket handshake path
Socket.IO expects the token in the auth handshake. If REST works but realtime fails, focus on handshake auth, proxy routing, and token freshness rather than the database first.
4
Separate technical outage from governance dispute
The current schema does not implement a dedicated moderator or steward role. If the issue is about who is allowed to act rather than whether the socket connects, route it through the handbook and identity policy instead of pretending it is a purely technical bug.

Backups are believable only if the restore path stays warm

The backup lane is where service continuity becomes real. Snapshots, exports, and peer copies should be recent enough, verifiable enough, and documented enough that a damaged node can come back without guessing which file matters or who is allowed to unlock it.

Primary signal
Age of last known-good snapshot
A green job status matters less than the age of the last restorable copy.
Secondary signal
Off-site or peer copy presence
One local disk is convenience, not resilience. Verify at least one separate failure domain.
Custody rule
Restore rights are explicit
Possession of an encrypted archive should not silently imply authority to restore or inspect its full contents.
Drill target
Rebuild one service class end to end
Runbooks only mature when an actual restore drill proves they are complete and ordered correctly.
1
Confirm latest good copy and its location
Record which snapshot is considered current, where it lives, and who can authorize use of it.
2
Verify restore metadata before disaster day
Keep checksums, encryption notes, service ordering, and recovery contacts close to the archive so the backup is operationally meaningful.
3
Restore the minimum useful slice first
Bring back docs, identity, and local coordination before chasing full cosmetic parity. Restore priority should follow community usefulness, not technical neatness.
4
Log what changed during recovery
Record new tokens, new peer trust, replaced hardware, service versions, and what was intentionally left offline. Tomorrow’s steward will need that timeline.