Skip to content

Observability Code Design

Grasp the overall status of the AI Model Governance platform at a glance with 6 health indicators.

IndicatorPromQLThreshold (Green / Yellow / Red)
Request Countsum(rate(application_usecase_command_requests_total[5m]))Normal range / +-50% fluctuation / +-80% fluctuation
Success Ratesum(rate(responses{response_status="success"}[5m])) / sum(rate(responses[5m])) * 100> 99.9% / > 99% / < 99%
P95 Latencyhistogram_quantile(0.95, sum(rate(duration_bucket[5m])) by (le))< 200ms / < 500ms / > 500ms
Error Ratesum(rate(responses{response_status="failure"}[5m])) / sum(rate(responses[5m])) * 100< 0.1% / < 1% / > 1%
Availability1 - (sum(rate(responses{error_type="exceptional"}[5m])) / sum(rate(responses[5m])))> 99.9% / > 99.5% / < 99.5%
Throughputsum(rate(responses{response_status="success"}[5m]))Stable relative to baseline / -20% / -50%
L1 Scorecard Dashboard
├── Row 1: Stat panels x 6 (health indicators)
├── Row 2: Time series graph (request count + error rate overlay)
├── Row 3: Time series graph (P95/P99 latency)
└── Row 4: Table (recent error top 10 by error.code)

The L2 dashboard drills down by request.layer x request.category.name x request.handler.name dimensions.

# Requests per second by handler
sum(rate(application_usecase_command_requests_total[5m])) by (request_handler_name)
# P95 latency by handler
histogram_quantile(0.95,
sum(rate(application_usecase_command_duration_bucket[5m])) by (le, request_handler_name)
)
# Error rate by handler
sum(rate(application_usecase_command_responses_total{response_status="failure"}[5m])) by (request_handler_name)
/ sum(rate(application_usecase_command_responses_total[5m])) by (request_handler_name) * 100
# Distribution by error.type
sum(rate(application_usecase_command_responses_total{response_status="failure"}[5m])) by (error_type)
# P95 latency by repository
histogram_quantile(0.95,
sum(rate(adapter_repository_duration_bucket[5m])) by (le, request_handler_name)
)
# Error rate by external service
sum(rate(adapter_external_service_responses_total{response_status="failure"}[5m])) by (request_handler_name)
/ sum(rate(adapter_external_service_responses_total[5m])) by (request_handler_name) * 100
# Published events by event type
sum(rate(adapter_event_requests_total[5m])) by (request_handler_name)
# Processing time by event handler
histogram_quantile(0.95,
sum(rate(application_usecase_event_duration_bucket[5m])) by (le, request_handler_name)
)
L2 Drill-Down Dashboard
├── Variable: $layer (application/adapter), $category, $handler
├── Row 1: Selected handler request count + error rate
├── Row 2: P50/P95/P99 latency time series
├── Row 3: error.type distribution (expected vs exceptional)
├── Row 4: error.code top 10 table
└── Row 5: DomainEvent publish → Handler chain

ConditionPromQLAction
error.type=exceptional spikerate(responses{error_type="exceptional"}[5m]) > 0.01System error -> infrastructure check, log review
Overall error rate > 10%rate(responses{response_status="failure"}[5m]) / rate(responses[5m]) > 0.1Declare incident immediately
External service timeout cascadeMultiple external services with simultaneous error.type=exceptionalCheck dependency service outage
groups:
- name: ai-governance-p0
rules:
- alert: ExceptionalErrorRateHigh
expr: |
sum(rate(application_usecase_command_responses_total{error_type="exceptional"}[5m]))
/ sum(rate(application_usecase_command_responses_total[5m]))
> 0.01
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "System error rate exceeded 1%"
description: "error.type=exceptional ratio: {{ $value | humanizePercentage }}"

P1 — Warning (Respond Within 15 Minutes)

Section titled “P1 — Warning (Respond Within 15 Minutes)”
ConditionPromQLAction
Key handler P95 > 1shistogram_quantile(0.95, duration_bucket{request_handler_name="..."}) > 1Analyze slow queries, external API latency
Error rate > 5%rate(responses{response_status="failure"}[5m]) / rate(responses[5m]) > 0.05Classify by error.code and identify root causes
EventHandler processing delayhistogram_quantile(0.95, application_usecase_event_duration_bucket) > 5Analyze event handler bottleneck
- alert: HandlerLatencyHigh
expr: |
histogram_quantile(0.95,
sum(rate(application_usecase_command_duration_bucket[5m])) by (le, request_handler_name)
) > 1.0
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "{{ $labels.request_handler_name }} P95 latency exceeded 1 second"

P2 — Info (Review During Business Hours)

Section titled “P2 — Info (Review During Business Hours)”
ConditionPromQLAction
P95 > 500mshistogram_quantile(0.95, duration_bucket) > 0.5Check performance trends, register in backlog
New error.code appearsPreviously unseen error.code emergesAnalyze new error path
Traffic pattern changeRequest count +-50% relative to baselineReview capacity planning

public sealed class CreateDeploymentCommand
{
public sealed record Request(
string ModelId, // Default(L+T): Unbounded ID
string EndpointUrl, // Default(L+T): URL
[CtxTarget(CtxPillar.All)] string Environment, // All(L+T+MetricsTag): Bounded (2 values)
[CtxTarget(CtxPillar.Default | CtxPillar.MetricsValue)]
decimal DriftThreshold // Default + MetricsValue: numeric
) : ICommandRequest<Response>;
}
public sealed record ReportedEvent(
ModelIncidentId IncidentId, // Default(L+T)
[CtxTarget(CtxPillar.All)] IncidentSeverity Severity, // All(L+T+MetricsTag): Bounded (4 values)
ModelDeploymentId DeploymentId // Default(L+T)
) : DomainEvent;

Generated ctx.* Field Mapping Example (CreateDeploymentCommand)

Section titled “Generated ctx.* Field Mapping Example (CreateDeploymentCommand)”
ctx FieldLoggingTracingMetricsTagMetricsValue
ctx.create_deployment_command.request.model_idOO--
ctx.create_deployment_command.request.endpoint_urlOO--
ctx.create_deployment_command.request.environmentOOO-
ctx.create_deployment_command.request.drift_thresholdOO-O
# Error rate by deployment environment (Staging vs Production)
sum(rate(application_usecase_command_responses_total{response_status="failure"}[5m]))
by (ctx_create_deployment_command_request_environment)
/ sum(rate(application_usecase_command_responses_total[5m]))
by (ctx_create_deployment_command_request_environment) * 100
# Auto-isolation latency by incident severity
histogram_quantile(0.95,
sum(rate(application_usecase_event_duration_bucket[5m]))
by (le, ctx_reported_event_severity)
)

When an alert is triggered, analyze the root cause in the following order:

  1. Check L1 Scorecard — Assess overall health status
  2. L2 Drill-Down — Identify the problematic handler by request.handler.name
  3. Distributed Tracing — Search for slow traces of the handler, analyze span chain
  4. ctx. Segments* — Check if concentrated in a specific environment (Staging/Production) or severity
  5. Detailed Logs — Identify root cause via error.code and @error fields

Check the actual Observable Port status and pipeline configuration in the Implementation Results.