Observability
Quidnug exposes rich operational telemetry at /metrics and health at
/api/health. First-party dashboards and alert rules ship in-repo.
Prometheus metrics
Section titled “Prometheus metrics”Hit GET /metrics for the full exposition. Key metric families:
| Metric | Type | Meaning |
|---|---|---|
quidnug_tx_accepted_total{type} | counter | Accepted transactions by type. |
quidnug_tx_rejected_total{type,code} | counter | Rejections by type and error code. |
quidnug_block_tier_total{tier} | counter | Blocks classified per PoT tier. |
quidnug_trust_query_depth | histogram | Depth of successful trust queries. |
quidnug_trust_query_duration_seconds | histogram | End-to-end query latency. |
quidnug_gossip_outbound_total{kind} | counter | Messages pushed per gossip kind. |
quidnug_gossip_inbound_total{kind,src} | counter | Messages received by kind and source. |
quidnug_guardian_recovery_inflight | gauge | Recoveries currently inside a time-lock window. |
quidnug_epoch_rotations_total{quid} | counter | Rotations observed per subject. |
quidnug_nonce_ledger_size | gauge | Size of the nonce ledger (per-signer highest). |
quidnug_http_request_duration_seconds | histogram | API latency per handler. |
Grafana dashboard
Section titled “Grafana dashboard”Import deploy/observability/grafana-dashboard.json as-is. It covers the
full quidnug_* family with rows for:
- Traffic, accepted vs rejected, per type.
- Trust queries, depth, latency, cache hit rate.
- Consensus, tier distribution, orphaned blocks.
- Gossip, outbound push, inbound receive, peer health.
- Guardians, active quorums, recoveries in flight, vetoes.
- Resources, Go runtime, HTTP p50/p95/p99 by handler.
Prometheus alert rules
Section titled “Prometheus alert rules”deploy/observability/prometheus-alerts.yml ships production-ready rules:
- TxRejectRate, rejection rate > X% over 5 min.
- TrustQueryP99Slow, p99 query > N seconds.
- PeerUnhealthy, fewer than K healthy peers.
- GuardianRecoveryStuck, recovery held in time-lock > window × 1.5.
- NodeStarved, rate limiter rejecting a non-trivial fraction of legitimate traffic.
Tune thresholds to your deployment, they ship with reasonable defaults for a three-node consortium.
Health probes
Section titled “Health probes”GET /api/health, liveness + readiness. Returns details about the trust engine, nonce ledger, and peer connectivity.GET /api/info, version, feature flags, supported domains.
Structured logs
Section titled “Structured logs”Logs are JSON when LOG_LEVEL is not pretty. Ship them to any
log aggregator; the event and quidnug_tx_* fields are stable
across versions.
When something goes wrong
Section titled “When something goes wrong”Start at the metrics dashboard. The rejection counters broken down by
code are almost always where you’ll find the story, see
error codes and FAQ.