ADR-0020: Observability Strategy

Status

Accepted

2025-11-28

Anchorpipe recently added rate limiting, idempotency, and cron cleanup. These features require telemetry to confirm effectiveness and diagnose incidents.
Existing logging was inconsistent and metrics focused only on ingestion latency.
CI/CD requires visibility into health endpoints, Redis usage, and queue throughput before promoting releases.

Adopt a three-pillar observability plan:

Logging
- All server modules use the shared logger (@/lib/server/logger) which injects timestamps and request IDs.
- Structured objects are logged instead of concatenated strings to ease parsing (e.g., logger.info('SIEM forwarding completed', { total, success, failed })).
- Security-sensitive logs (HMAC auth) redact secrets but include repo IDs for auditing.
Metrics
- HTTP metrics captured via httpRequestDurationMs (Prometheus histogram) for /api/status, ingestion, and future critical routes.
- Rate limit middleware exposes X-RateLimit-* headers while Redis counters power dashboards.
- Background jobs (idempotency cleanup) log counts so they can be scraped via log-based metrics.
Tracing / Telemetry
- recordTelemetry is used for high-value events (api.status, ingestion flows). Each event stores request IDs and durations for correlation.
- While distributed tracing is not yet deployed, the ADR mandates propagating x-request-id so later adoption of OTEL is straightforward.

Rely solely on third-party APM – rejected due to cost and the need for on-prem/self-hosting flexibility.
Custom logging per module – rejected as it reintroduces inconsistency.

Operators can correlate rate limit responses, ingestion outcomes, and cron jobs via shared IDs.
Metrics feed into alerts (e.g., ingestion latency spikes, failed SIEM forwarding) without per-team instrumentation.
Documentation for rate limiting, idempotency, and testing now references observability hooks so new contributors follow the pattern.

Additional maintenance to ensure new modules emit metrics.
Slight performance overhead from logging/metrics, mitigated by batching and async logging.

Ensure middleware, ingestion route, SIEM forwarder, and GitHub App service log structured objects with requestId.
Expand Prometheus metrics in future milestones to include queue depth and Redis latency.
Track these decisions in the new architecture guides so docs, tests, and ADRs stay aligned.