Skip to content

Observability Type Design Decisions

Maps the SLOs and KPIs defined in business requirements to Functorium metrics.

Business KPIFunctorium MetricField/Tag
Model registration completion rateapplication.usecase.command.responsesrequest.handler.name=RegisterModelCommand, response.status
Incident auto-quarantine response timeapplication.usecase.event.durationrequest.handler.name=QuarantineDeploymentOnCriticalIncidentHandler
Compliance assessment pass rateapplication.usecase.command.responsesrequest.handler.name=InitiateAssessmentCommand, response.status
External service stabilityadapter.external_service.responsesresponse.status, error.type
Overall error rateapplication.usecase.command.responsesresponse.status=failure

MetricInstrumentTypeDescription
UseCase request countapplication.usecase.{cqrs}.requestsCounterCommand/Query/Event request count
UseCase processing timeapplication.usecase.{cqrs}.durationHistogramCommand/Query/Event processing time (seconds)
UseCase error countapplication.usecase.{cqrs}.responsesCounterresponse.status=failure filter
MetricInstrumentTypeDescription
Repository call countadapter.repository.requestsCounterRepository method call count
Repository processing timeadapter.repository.durationHistogramRepository processing time (seconds)
External Service call countadapter.external_service.requestsCounterExternal service call count
External Service processing timeadapter.external_service.durationHistogramExternal service processing time (seconds)
MetricInstrumentTypeDescription
Event publication countadapter.event.requestsCounterDomainEvent publication count
Event handler processing timeapplication.usecase.event.durationHistogramEventHandler processing time (seconds)

CategoryMetricTargetPromQL
Command UseCaseP95 latency< 200mshistogram_quantile(0.95, sum(rate(application_usecase_command_duration_bucket[5m])) by (le))
Query UseCaseP95 latency< 50mshistogram_quantile(0.95, sum(rate(application_usecase_query_duration_bucket[5m])) by (le))
RepositoryP95 latency< 100mshistogram_quantile(0.95, sum(rate(adapter_repository_duration_bucket[5m])) by (le))
External ServiceP95 latency< 500mshistogram_quantile(0.95, sum(rate(adapter_external_service_duration_bucket[5m])) by (le))
Overall error rateError ratio< 0.1%rate(responses{response_status="failure"}[5m]) / rate(responses[5m])

The ctx.* fields and CtxPillar propagation strategy used in this project.

PropertyCtxPillarRationale
Name (string)Default (L+T)Model name is unbounded cardinality -> MetricsTag prohibited
Version (string)Default (L+T)SemVer string -> for trace searching
Purpose (string)Logging onlyLong string (500 chars) -> for debug/audit
PropertyCtxPillarRationale
ModelId (string)Default (L+T)Unbounded ID -> for trace searching
EndpointUrl (string)Default (L+T)URL -> for trace searching
Environment (string)All (L+T+MetricsTag)Bounded (2 values: Staging/Production) -> safe for segment analysis
DriftThreshold (decimal)Default + MetricsValueNumeric -> for Histogram distribution analysis
PropertyCtxPillarRationale
DeploymentId (string)Default (L+T)Unbounded ID -> for trace searching
Severity (string)All (L+T+MetricsTag)Bounded (4 values: Critical/High/Medium/Low) -> safe for segment analysis
Description (string)Logging onlyLong string (2000 chars) -> for debug/audit
Cardinality LevelMetricsTag AllowedExample (this project)
Fixed (bool)Safe
BoundedLow (enum, < 20 values)Conditionally allowedEnvironment (2 values), Severity (4 values), RiskTier (4 values), DeploymentStatus (6 values)
Unbounded (string, Guid)ProhibitedModelId, DeploymentId, EndpointUrl
Numeric (decimal, int)WarningDriftThreshold -> use MetricsValue instead
Is the property for debugging? (Purpose, Description)
├── YES -> Logging only: [CtxTarget(CtxPillar.Logging)]
└── NO -> Needs trace searching?
├── NO -> [CtxIgnore]
└── YES -> Use as metric segment?
├── NO -> Default (L+T)
└── YES -> Is cardinality bounded?
├── YES -> [CtxTarget(CtxPillar.All)]
└── NO -> Is it numeric?
├── YES -> [CtxTarget(CtxPillar.Default | CtxPillar.MetricsValue)]
└── NO -> Keep Default (MetricsTag prohibited)

In the next step, we materialize this metric design into dashboards, alerts, and code patterns in Code Design.