Intelligent alerting with MultiTool

Intelligent alerting detects production issues by comparing a service's current behavior to its recent baseline, rather than against thresholds you configure manually. When telemetry shows the service performing meaningfully worse than its baseline, MultiTool generates an alert.

This page explains how the model works, how it differs from threshold-based alerting, and what tradeoffs shape alert behavior.

How intelligent alerting works

MultiTool ingests OpenTelemetry data from your service and builds a statistical model of its recent behavior. That model is the service's baseline. When you deploy a new version, MultiTool begins building a second model from the new version's telemetry and compares the two.

The comparison asks whether the distribution of a given metric has shifted relative to the baseline. If the new version's distribution has shifted in the wrong direction and the shift is statistically significant, MultiTool fires an alert. This is a test for change in central tendency, not a check against any fixed value.

The baseline is derived from the telemetry of your currently running version and is re-established when you deploy. Alerts are therefore scoped to the lifecycle of a deployment: the question intelligent alerting answers is "is this version worse than the one it replaced?"

Thresholds versus baseline comparison

Traditional alerting systems ask operators to define thresholds in advance. A team might configure an alert when average CPU utilization exceeds 90% for five minutes, or when 5xx errors exceed a known acceptable rate.

That approach gives teams control, but it creates ongoing maintenance work. Each service needs its own thresholds across several metrics, and those thresholds need to change as traffic patterns, infrastructure, and expectations evolve. Once a team owns more than a handful of services, tuning and retuning thresholds becomes a substantial amount of work.

The cost of getting thresholds wrong runs in both directions. Thresholds set too loosely produce false negatives, and real incidents go undetected. Thresholds set too tightly produce false positives — operators are paged for non-events and, over time, learn to discount the alerting system. This effect, commonly called alert fatigue, is well-documented and contributes to operator burnout.

Intelligent alerting avoids asking teams to define acceptable values up front. The service's own recent behavior becomes the reference point, and the comparison is always relative.

Supported signals

Intelligent alerting currently supports baseline comparison on HTTP response status codes. A rise in 5xx responses typically indicates backend failure; a rise in 4xx responses may reflect client noise, a broken route, or drift between client and server behavior. For most services, status codes are the most direct signal of correctness, and they were the first signal we built intelligent alerting around.

The following signals are on the near-term roadmap:

Latency and throughput. Latency reflects user wait time; throughput reflects how efficiently the service handles work.
CPU and memory. Resource saturation reduces headroom for traffic spikes and eventually contributes to latency and failed requests.
Supplementary signals. Disk space, IOPS, network bandwidth, and — for queue-backed services — queue depth and dead-letter counts.

Baseline comparison is designed to generalize across these signals. The same statistical approach that works for status codes applies to latency distributions, resource utilization, and queue metrics, so each new signal extends intelligent alerting rather than changing how it behaves.

What makes a change meaningful

Web services fluctuate from minute to minute, and small changes in telemetry are often noise. Intelligent alerting's job is to separate real regressions from that noise with controllable confidence.

Two questions determine whether an alert fires:

Is there enough data to make a reliable judgment? With very little data, almost any difference between the baseline and the new version could be explained by chance. With enough data, even small shifts can be identified with confidence.
Is the observed shift large enough to matter? A tiny movement in central tendency may be technically detectable but not worth paging anyone about.

These two questions correspond to the two ways an alerting system can be wrong, and they map onto the tradeoffs you can tune.

Tuning speed, certainty, and sensitivity

Alerting always involves tradeoffs between how fast you learn about a problem, how certain you are that it's real, and how small a problem you want to catch. MultiTool exposes these tradeoffs as explicit configuration.

If you want to...	You need to...	The tradeoff is...
Reduce false positives	Wait for more telemetry before firing	Slower time to alert
Alert faster	Fire on less telemetry	More false positives
Catch smaller regressions	Detect smaller shifts in central tendency	More data required, or more false positives
Ignore minor noise	Only detect larger shifts	Slower or weaker detection of subtle regressions

These tradeoffs are inherent to any statistical detection system. What MultiTool provides is the ability to position your team on the curve rather than accept a default. A team running safety-critical infrastructure will tune differently from a team running a consumer-facing feature flag service, and intelligent alerting accommodates both.

How this relates to SLOs

Intelligent alerting and SLOs solve different problems.

SLOs define longer-horizon reliability goals — commitments about availability, latency percentiles, or error budgets measured over days or weeks. Intelligent alerting focuses on short-term behavioral changes, such as whether a new deployment is producing more errors than the version it replaced.

The two approaches work well together. SLOs track whether a service is meeting reliability expectations over time. Intelligent alerting catches when behavior begins to degrade in the moment.

Next steps

Quickstart guide — get a service reporting to MultiTool and see your first baseline
How MultiTool uses OpenTelemetry — what we ingest, how it's processed, and how baselines are built from it
Using the Claude plugin — investigate alerts and baseline shifts directly from your editor