diff --git a/docset.yml b/docset.yml index bc3520a11f..d38d85bdce 100644 --- a/docset.yml +++ b/docset.yml @@ -124,8 +124,8 @@ subs: ls-pipelines-app: "Logstash Pipelines" maint-windows-app: "Maintenance Windows" maint-windows-cap: "Maintenance windows" - alerting-v2: "experimental alerting features" - alerting-v2-cap: "Experimental alerting features" + alerting-v2-system: "experimental alerting system" + alerting-v2-system-cap: "Experimental alerting system" custom-roles-app: "Custom Roles" data-source: "data view" data-sources: "data views" diff --git a/explore-analyze/alerting.md b/explore-analyze/alerting.md index b0aa40b604..104e7e0822 100644 --- a/explore-analyze/alerting.md +++ b/explore-analyze/alerting.md @@ -8,28 +8,55 @@ applies_to: products: - id: kibana - id: cloud-serverless + - id: elasticsearch + - id: cloud-hosted +navigation_title: Alerting +description: Watch your data and respond to conditions automatically with Elastic alerting. Compare Kibana alerting, the experimental ES|QL-based alerting system, and Watcher to find the right fit. --- -# Alerting +# Alerting [alerting-overview] -Alerting tools in Elasticsearch and Kibana provide functionality to monitor data and notify you about significant changes or events in real time. This page provides an overview of how the key components work. +Elastic alerting helps you watch your data and respond when something needs attention, whether that is a metric crossing a limit, an asset leaving an area on a map, or an unusual pattern in your time series. You set the conditions and how people should be notified. Elastic runs the checks for you. -## Alerts +Elastic offers three alerting systems, summarized below. Each has a **Get started** link to the full guide for that option. If you're not sure which to use, refer to [Choose an alerting system](alerting/choose-an-alerting-system.md). -Alerts are notifications generated when specific conditions are met. These notifications are sent to you through channels that you previously set such as email, Slack, webhooks, PagerDuty, and so on. +## {{kib}} alerting -Alerts are created based on rules, which define the criteria for triggering them. Rules monitor the data indexed in Elasticsearch and evaluate conditions on a defined schedule to identify matches. For example, a threshold rule can generate an alert when a value crosses a specific threshold, while a machine learning rule activates an alert when an anomaly detection job identifies an anomaly. +```{applies_to} +stack: ga +serverless: ga +``` + +{{kib}} alerting gives you ready-made rule types that work with applications such as APM, metrics, security, and uptime monitoring. You set conditions on a schedule you choose and send notifications through common channels (email, chat apps, webhooks, on-call tools, and more). Setup uses forms and clear steps, so you do not need to learn a query language first. It is a strong fit when you want broad coverage out of the box. + +[Get started with {{kib}} alerting →](alerting/alerts.md) + +## {{alerting-v2-system-cap}} + +```{applies_to} +stack: experimental 9.5+ +serverless: experimental +``` + +The {{alerting-v2-system}} is built on {{esql}}. You write the query that defines what to watch for, choose how alert episodes are tracked per series, and control notifications through action policies that handle routing, frequency, and notification batching. The {{alerting-v2-system}} also adds alert episode lifecycle tracking, per-series snooze, and rules on alert episodes for correlation and escalation. It is a strong fit when you want full control over what data travels with each alert episode and how your team is notified. + +:::{note} +The {{alerting-v2-system}} runs next to {{kib}} alerting on {{serverless-full}} and {{stack}} 9.5 and later. You do not have to move everything at once. Teams can copy or rebuild rules when they are ready. {{kib}} alerting will remain available. +::: + +[Get started with the {{alerting-v2-system}} →](alerting/kibana-alerting-experimental.md) ## Watcher + ```{applies_to} +stack: ga serverless: unavailable ``` -You can use Watcher for alerting and monitoring specific conditions in your data. It enables you to define rules and take automated actions when certain criteria are met. Watcher is a powerful alerting tool for custom use cases and more complex alerting logic. It allows advanced scripting using Painless to define complex conditions and transformations. +Watcher is for unusual or highly tailored setups where you need scripts, chained steps, or close control over {{es}} APIs. It does not use the main {{kib}} rules UI used by {{kib}} alerting. It is available on the {{stack}} only, not in {{serverless-full}}. :::{tip} -For most use cases, you should use Kibana Alerts instead of Watcher. Kibana Alerts allows rich integrations across use cases like APM, metrics, security, and uptime. Prepackaged rule types simplify setup and hide the details of complex, domain-specific detections, while providing a consistent interface across Kibana. - -Watcher is not available in {{serverless-full}}. +For most teams, {{kib}} alerting or the {{alerting-v2-system}} is easier to adopt than Watcher: both work within {{kib}}'s rules UI and don't require writing {{es}} watch definitions. ::: +[Get started with Watcher →](alerting/watcher.md) diff --git a/explore-analyze/alerting/choose-an-alerting-system.md b/explore-analyze/alerting/choose-an-alerting-system.md new file mode 100644 index 0000000000..77ee7db540 --- /dev/null +++ b/explore-analyze/alerting/choose-an-alerting-system.md @@ -0,0 +1,38 @@ +--- +applies_to: + stack: ga + serverless: ga +products: + - id: kibana + - id: cloud-serverless + - id: elasticsearch + - id: cloud-hosted +description: Compare Kibana alerting, the experimental ES|QL-based alerting system, and Watcher by use case and deployment type to select the right tool for your monitoring needs. +--- + +# Choose an alerting system [choose-an-alerting-system] + +Elastic has three alerting systems. You only need one. Pick the one that fits how you want to define rules and route notifications. + +## Select by use case + +| Goal | Suggested system | +|---|---| +| Monitor metrics, logs, or uptime with ready-made rules and no query language | [{{kib}} alerting](alerts.md) | +| Use rules built for Security, Observability, APM, or Maps | [{{kib}} alerting](alerts.md) | +| Write {{esql}} to define exactly what to detect and what data each alert episode carries | [{{alerting-v2-system-cap}}](kibana-alerting-experimental.md) {applies_to}`serverless: experimental` {applies_to}`stack: experimental 9.5+` | +| Query alert history in Discover or build dashboards from alert data | [{{alerting-v2-system-cap}}](kibana-alerting-experimental.md) {applies_to}`serverless: experimental` {applies_to}`stack: experimental 9.5+` | +| Manage notification routing, grouping, and throttling in one place, reusable across rules | [{{alerting-v2-system-cap}}](kibana-alerting-experimental.md) {applies_to}`serverless: experimental` {applies_to}`stack: experimental 9.5+` | +| Build highly custom logic with scripting and chained inputs | [Watcher](watcher.md) {applies_to}`stack: ga` {applies_to}`serverless: unavailable` | + +## Compare at a glance + +| | {{kib}} alerting | {{alerting-v2-system-cap}} | Watcher | +|---|---|---|---| +| **Best for** | Teams using built-in rule types with form-based setup | Teams that need full control over detection and notification routing | Custom alerting logic requiring scripting | +| **Rule definition** | Select a rule type and fill in parameters | Write an {{esql}} query | Write a JSON watch definition | +| **Alert data** | In-place updates, limited query support | Append-only events queryable with {{esql}} in Discover | Watch history index | +| **Notifications** | Configured per action on each rule | Centralized action policies, reusable across rules | Action-level throttling and conditions | +| **Noise reduction** | Snooze per rule, maintenance windows | Per-episode acknowledge or deactivate, per-series snooze, maintenance windows, match condition routing in action policies | Action conditions and throttling | +| **Available on {{serverless-full}}** | Yes | Yes, {applies_to}`serverless: experimental` | No | +| **Available on {{stack}}** | Yes | Yes, {applies_to}`stack: experimental 9.5+` | Yes | diff --git a/explore-analyze/alerting/kibana-alerting-experimental.md b/explore-analyze/alerting/kibana-alerting-experimental.md new file mode 100644 index 0000000000..161f0cdbf6 --- /dev/null +++ b/explore-analyze/alerting/kibana-alerting-experimental.md @@ -0,0 +1,178 @@ +--- +applies_to: + stack: experimental 9.5+ + serverless: experimental +products: + - id: kibana + - id: cloud-serverless +description: The experimental Kibana alerting system uses ES|QL rules to detect conditions, track problems as alert episodes, and route notifications through reusable action policies. +--- + +# {{alerting-v2-system-cap}} overview [alerting-overview] + +The {{alerting-v2-system}} in {{kib}} is marked **Experimental** in the UI and is subject to change before general availability. It includes rules, alert episodes, action policies, workflows, and connectors. For the existing {{kib}} alerting system, refer to [{{kib}} alerting](alerts.md). + +The {{alerting-v2-system}} watches your {{es}} data continuously, so your team doesn't have to. You define the conditions that matter, such as when to open an issue, who should know, and how often to notify them. The system handles the rest. + +## The core idea [kibana-alerting-v2-overview] + +The {{alerting-v2-system}} separates *detecting* a problem from *acting* on it. Rules focus purely on what to watch for in your data. Action policies handle who gets notified, when, and how, independently of any rule. You can build and test detection logic before wiring up any notifications, and update notification routing across all rules in one place without editing the rules themselves. + +## The four building blocks + +The {{alerting-v2-system}} is built around four objects: rules, alert episodes, action policies, and workflows, each with a distinct role. + +### Rules + +A rule defines what to watch for in your data and how often to check. Every rule runs in one of two modes: alert or signal. + +- **Alert**: Opens an alert episode when the rule finds a match, keeps it open until the condition clears, and can notify your team or trigger automated actions when the state changes. Use this when you want the system to track an issue and tell someone about it. +- **Signal**: Records each match as a data point with no ongoing tracking and no notifications. Use this when you want to capture activity for later querying and investigation. + + + +### Alert episodes +In Alert mode, the rule opens one alert episode per problem and keeps it open until the condition clears. The alert episode moves through states (pending, active, recovering, inactive) giving you one lifecycle to triage rather than a separate item per rule check. You manage alert episodes on the **Alerts** page. + + +### Action policies +An action policy is the gating layer between an alert episode and a workflow. It decides whether and when to invoke a workflow by evaluating suppression, match conditions, and frequency. You can set conditions to filter which alert episodes it applies to, for example, only critical severity alert episodes from a specific service. Global action policies apply to alert episodes from any rule in the space. Per-rule action policies scope to a single rule for more targeted routing. + + +### Workflows +A workflow is what actually sends the message or runs the automation, for example, posting to Slack or sending an email. Action policies hand off to workflows for delivery. Without a workflow attached, no notification is sent. + + +## How the pieces fit together [how-pieces-fit-together] + + +What happens after a rule finds something depends entirely on the rule's mode. + +### Alert mode + +Use Alert mode when you want to track issues and be notified. The rule opens an alert episode when the condition is met and keeps it open until the condition clears. + +``` +Rule runs → the rule's conditions are met → writes a rule event + → alert episode (pending → active) + → [dispatcher] → action policy → workflow → notification + → condition clears (recovering → inactive) + → [dispatcher] → action policy → workflow → notification +``` + +1. The rule evaluates {{esql}} on a schedule and writes a rule event to `.rule-events`. +2. The rule event opens or updates an alert episode, which is tracked until the condition resolves. +3. The dispatcher runs on a short interval, independently of the rule schedule, and picks up active alert episodes. +4. For each active alert episode, the dispatcher evaluates all enabled action policies. Each policy runs the episode through a sequence of gates: suppression, match conditions, and frequency. +5. For policies where the episode clears all gates, the dispatcher invokes the configured workflows. +6. Workflows deliver the notification or run the automation (email, chat, webhook, and so on). + +#### Example: Rule runs in Alert mode + +An SRE team creates a rule in Alert mode that checks checkout service latency every five minutes. When p95 exceeds 2 seconds for more than one consecutive check, the rule opens an alert episode. An action policy with a `rule.tags: "checkout"` matcher picks it up, skips low-severity alert episodes, and sends a Slack message through an on-call workflow. + +The on-call engineer gets one message, investigates, fixes a slow query, and latency drops. The alert episode recovers automatically. No dashboard watching required. + +:::{image} ../images/rule-alert-mode-diagram.png +:alt: Diagram of Alert mode flow. A rule runs ES|QL on a schedule. When it finds a match, it writes a rule event tied to an ongoing alert episode. The alert episode moves through pending, active, recovering, and inactive states. An action policy matches eligible alert episodes and routes them to a workflow, which delivers a notification. +::: + +### Signal mode + +Use Signal mode when you want to record matches for querying and analysis without alerting anyone. The rule writes a signal and stops. An alert episode is not opened, and notifications are not sent. + +``` +Rule runs → the rule's conditions are met → writes a signal + → queryable in Discover + → no alert episode, no action policy, no notification +``` + +1. The rule evaluates {{esql}} on a schedule and writes a rule event (signal) to `.rule-events`. +2. The signal is available for querying in Discover, dashboards, and ES|QL. +3. No alert episode is opened. No action policy is evaluated. No notification is sent. + +#### Example: Rule runs in Signal mode + +A security team wants to track when a rarely-used admin API endpoint is called. Individual calls are not inherently suspicious, so they do not want a notification every time the rule fires. The pattern only becomes meaningful in context. They create a rule in Signal mode that checks for requests to the endpoint every hour. + +The rule writes a signal each time it finds a match. No alert episodes are opened and no notifications go out. + +:::{image} ../images/rule-detect-mode-diagram.png +:alt: Diagram of Signal mode flow. A rule runs ES|QL on a schedule. When it finds a match, it writes a signal to .rule-events. The signal is available for querying in Discover, dashboards, and ES|QL. +::: + +Later, an on-call alert fires for unusual privilege escalation on the same host. During the investigation, the team queries `.rule-events` in Discover and finds that the admin API endpoint was called three times in the hour before the escalation. The Signal mode rule events did not surface the incident. The Alert mode rule did. But without the signals already in the index, the admin API activity would have been invisible during the post-incident review. + + +## {{alerting-v2-system-cap}} glossary [key-concepts-glossary] + +These terms appear throughout the {{alerting-v2-system}} docs. If a term is unclear while reading, check its definition here before going further. + +**Action policy** +: The gating layer between an alert episode and a workflow. You configure suppression rules, match conditions to filter which alert episodes it applies to, frequency settings to control batching and cooldown, and which workflow should send the message. Global action policies apply to alert episodes from any rule in the space. Per-rule action policies scope to a single rule. + + +**Alert episode** +: The complete record of one problem tracked in Alert mode, from when it was first detected to when it recovered. An alert episode moves through states (pending, active, recovering, inactive) as the situation changes. This is what you see and act on on the **Alerts** page. + +**Breach** +: A single moment when a rule's query finds a match. One breach doesn't necessarily trigger a notification. You can configure a rule to require several consecutive breaches before it confirms the problem is real. + +**Dispatcher** +: The background process in {{kib}} that runs the notification pipeline. On a short interval (around 5 seconds), it evaluates each enabled action policy against active alert episodes and sends notifications. The dispatcher runs on its own cadence, separate from the rule schedule. + + +**{{esql}}** +: The query language every rule uses to search your data. To learn more, refer to the [{{esql}} reference](elasticsearch://reference/query-languages/esql.md). + +**Notification** +: The message or action delivered when an alert episode matches an action policy and a workflow sends it. Examples include a Slack message, an email, or a webhook call. + + +**Rule** +: The definition of what to watch for in your data, how often to check, and what counts as a problem. Rules run on a schedule. In Signal mode they produce signals. In Alert mode they track ongoing alert episodes. + + +**Rule event** +: A record written to `.rule-events` every time a rule runs and its query finds a match. Every rule produces rule events. In Signal mode the record is a signal. In Alert mode it belongs to an alert episode. + +**Severity** +: A label you can attach to alert episodes to indicate how serious they are. Severity is available as a filter in action policies, so you can route critical alert episodes differently from low-priority ones. + + +**Signal** +: A rule event recorded when a rule runs in Signal mode. Signals are stored and queryable, but they don't open alert episodes or trigger notifications. + +**Threshold** +: The condition a rule uses to decide when something is worth alerting on. This includes both the query that detects the problem and settings that control how many times the condition must be met before an alert episode opens or closes. + + +**Workflow** +: The automation that sends a message or runs an action when an action policy decides a notification should go out. Examples include posting to Slack, sending an email, or calling a webhook. + + +## Next steps + +To understand how the {{alerting-v2-system}} fits into {{kib}}'s alerting options, refer to [Alerting](../alerting.md) or [Choose an alerting system](choose-an-alerting-system.md). \ No newline at end of file diff --git a/explore-analyze/images/rule-alert-mode-diagram.png b/explore-analyze/images/rule-alert-mode-diagram.png new file mode 100644 index 0000000000..3c28586a7b Binary files /dev/null and b/explore-analyze/images/rule-alert-mode-diagram.png differ diff --git a/explore-analyze/images/rule-detect-mode-diagram.png b/explore-analyze/images/rule-detect-mode-diagram.png new file mode 100644 index 0000000000..cf35161810 Binary files /dev/null and b/explore-analyze/images/rule-detect-mode-diagram.png differ diff --git a/explore-analyze/toc.yml b/explore-analyze/toc.yml index 2c5ef9becb..4844295fd0 100644 --- a/explore-analyze/toc.yml +++ b/explore-analyze/toc.yml @@ -382,6 +382,8 @@ toc: - file: report-and-share/reporting-troubleshooting-pdf.md - file: alerting.md children: + - file: alerting/choose-an-alerting-system.md + - file: alerting/kibana-alerting-experimental.md - file: alerting/alerts.md children: - file: alerting/alerts/alerting-getting-started.md