Skip to content

Observability Develop

project-spec -> architecture-design -> domain-develop -> application-develop -> adapter-develop -> observability-develop -> test-develop

Performed after Observable Port and CtxEnricher are implemented in the adapter-develop skill. Assumes Functorium’s 3-Pillar (Logging/Metrics/Tracing) pipeline is registered in DI.

The Functorium framework is strong at observability collection. [GenerateObservablePort] automatically provides Logging/Metrics/Tracing to all adapters, and CtxEnricher simultaneously propagates business context to the 3-Pillar.

However, collection alone is not enough. Without a strategy for how to analyze collected data, how to determine which metrics are healthy, and how to act when problems occur — dashboards become “graphs you just look at.”

The observability-develop skill bridges this gap: instrument -> analyze -> alert -> act.

PhaseActivityDeliverable
1. Observability StrategyKPI-to-metric mapping, baseline setting, ctx.* propagation strategyObservability strategy document
2. Dashboard DesignL1 scorecard, L2 drilldown, DomainEvent trackingDashboard layout
3. Alert DesignP0/P1/P2 classification, thresholds, alert hygieneAlert rules document
4. Analysis + ActionDistributed tracing diagnosis, hypothesis-experiment, review templatesAnalysis procedure document
Design observability
Design the dashboard
Analyze metrics
Set up alerts
Analyze performance

Maps business performance indicators to Functorium observation fields:

Business KPITechnical MetricFunctorium Field
User response timeP95 latencyresponse.elapsed (Histogram)
Service availabilityError rateresponse.status + error.type
Per-feature usageRequest countrequest.handler.name (Counter)
Payment success rateSuccess/failure ratioresponse.status by request.handler.name
MetricCommand BaselineQuery BaselineExternal API Baseline
P95 Latency< 200ms< 50ms< 1000ms
Error Rate< 0.1%< 0.1%< 1%
Throughput> 100 RPS> 500 RPS-
CtxPillarPurposeExample FieldsCardinality
Logging onlyDebug/detailed dataRequest body, parameter detailsUnlimited
Logging + Tracing (Default)Identifiers, tracing contextcustomer_id, order_idHigh
All (+ MetricsTag)Segment analysiscustomer_tier, regionMust be low
MetricsValueNumeric recordingorder_total_amount-

Cardinality Rule: Only use fields with limited unique values for MetricsTag (customer_tier: 3-5 kinds, customer_id: millions -> forbidden).

IndicatorPromQL ExampleStatus
Request Countrate(usecase_request_total[5m])Throughput trend
Success Rate1 - (error_total / request_total)99.9% or higher
P95 Latencyhistogram_quantile(0.95, duration_bucket)< 200ms
Error Raterate(error_total[5m]) / rate(request_total[5m])< 0.1%
Exceptional Errorsrate(error_total{error_type="exceptional"}[5m])Converge to 0
DomainEvent Throughputrate(event_publish_total[5m])Trend check

Decomposes into request.layer x request.category.name x request.handler.name 3 dimensions to identify bottlenecks.

PriorityConditionExampleResponse
P0 (Immediate)error.type = "exceptional" spikeDB connection failure, external API timeoutOn-call page
P1 (1 hour)P95 > 1s or error rate > 5%Specific handler performance degradationSlack alert
P2 (Daily)P95 > 500ms or new error codeGradual performance degradationDashboard review

When a problem signal is detected, diagnose the cause with distributed tracing:

  1. Signal detection — Identify anomaly from dashboard/alert
  2. Trace query — Search request.handler.name = "X" AND duration > threshold
  3. Span analysis — Check which child span is consuming time
  4. Hypothesis — DB N+1? Cache miss? External API delay?
  5. Experiment — Apply improvement and compare against baseline

Fields automatically collected by the Functorium Source Generator.

FieldDescriptionExample
request.layerArchitecture layer"application", "adapter"
request.category.nameRequest category"usecase", "repository", "event"
request.category.typeCQRS type"command", "query", "event"
request.handler.nameHandler class name"CreateProductCommand"
request.handler.methodHandler method name"Handle", "GetById"
response.statusResponse status"success", "failure"
response.elapsedProcessing time (seconds)Recorded as Histogram instrument
error.typeClassificationDescriptionAlert Response
expectedBusiness errorDomain rule violation, validation failureMonitor only (normal flow)
exceptionalSystem errorDB connection failure, external API timeoutP0/P1 alert (immediate response)
aggregateComposite errorMultiple validation failure accumulationMonitor (Apply pattern result)

error.code is a domain-specific error code. E.g.: "ProductName.Required", "Order.InvalidTransition".

ComponentPatternExample
Meter Name{service.namespace}.{layer}[.{category}]AiGovernance.application.usecase
Instrument Name{layer}.{category}[.{cqrs}].{type}application.usecase.command.duration

Uses dot separation, lowercase, and plural forms.

  • Collection is just the beginning — Design the full cycle: instrument -> analyze -> alert -> act
  • Start from business KPIs — Don’t just look at technical metrics; translate to business impact
  • Cardinality management — Forbid high-cardinality fields in MetricsTag (prevent unbounded series)
  • Alerts must be actionable — If you can’t answer “What should I do when I receive this alert?”, remove it