-
Notifications
You must be signed in to change notification settings - Fork 240
Add experimental alerting features alerts pages #6524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
159 changes: 159 additions & 0 deletions
159
explore-analyze/alerting/kibana-alerting-experimental/alerts.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| --- | ||
| navigation_title: Alerts | ||
| applies_to: | ||
| stack: unavailable | ||
| serverless: preview | ||
| products: | ||
| - id: kibana | ||
| description: "Alert episodes in experimental alerting features: lifecycle states, series and episodes, signals versus alerts, and where to find them." | ||
| --- | ||
|
|
||
| # experimental alerting features alerts | ||
|
|
||
| When a rule fires repeatedly on the same problem, a flat list of events doesn't tell you when the issue started, whether it's still happening, or how long it's been going on. Alert episodes fill that gap. Each episode is a persistent record of one issue on one series, from first breach through recovery, with every evaluation appended to the same history. Nothing is overwritten. | ||
|
|
||
| <!--[CONTENT NEEDED for M2: UI. Once the navigation and page name have been confirmed, add instructions for opening the Alerts page.] | ||
| --> | ||
|
|
||
| ## Alert lifecycle [alert-lifecycle] | ||
|
|
||
| Every alert episode moves through these states: | ||
|
|
||
| ``` | ||
| inactive → pending → active → recovering → inactive | ||
| ``` | ||
|
|
||
| | State | What it means | | ||
| | --- | --- | | ||
| | Inactive | Problem fully resolved. You get a recovery notification. | | ||
| | Pending | Errors detected, but the system is waiting to confirm it's a real problem before fully alerting. | | ||
| | Active | Problem confirmed and ongoing. This is when you get notified. | | ||
| | Recovering | Errors have stopped, but the system is waiting to confirm it's truly resolved. | | ||
|
|
||
| <!-- TODO: Uncomment when PR #6523 (rules) is merged: | ||
| Activation and recovery thresholds control how many consecutive evaluations must agree, or how long the condition must persist, before transitioning. Refer to [Configure a rule](rules/configure-a-rule.md#activation-recovery-thresholds) to learn more about these settings. | ||
| --> | ||
| Activation and recovery thresholds control how many consecutive evaluations must agree, or how long the condition must persist, before transitioning. | ||
|
|
||
| ### Example: First breach opens, first clear closes | ||
|
|
||
| A checkout-latency rule runs in Alert mode every 5 minutes. Latency breaches at 14:05 and clears at 14:50: | ||
|
|
||
| 1. **14:00** — Routine check. p95 is within budget. The episode is `inactive`. | ||
| 2. **14:05** — p95 jumps to 3.1s. The first breach is detected. With no activation threshold, the episode opens immediately as `active`. | ||
| 3. **14:10–14:45** — Every evaluation finds high latency. The same episode stays `active`. No new episodes are created. | ||
| 4. **14:50** — p95 drops back under 2s. With no recovery threshold, the episode resolves immediately to `inactive`. | ||
|
|
||
| One problem is tracked in one episode, even though the rule evaluated many times while the condition was ongoing. | ||
|
|
||
| <!-- TODO: Add image once alert-episode-example-without-threshold.png is available in explore-analyze/images/ | ||
| :::{image} ../../images/alert-episode-example-without-threshold.png | ||
| :alt: Timeline of a checkout-latency alert episode without thresholds. At 14:05, p95 jumps to 3.1s and the episode opens immediately as active. It stays active through 14:45. At 14:50, p95 drops back under 2s and the episode resolves immediately as inactive. | ||
| ::: | ||
| --> | ||
|
|
||
| ### Example: Waiting for confirmation before opening and closing | ||
|
|
||
| The same checkout-latency rule, now with an activation threshold of 2 consecutive breaches and a recovery threshold of 2 consecutive clears: | ||
|
|
||
| 1. **14:00** — Routine check. p95 is within budget. The episode is `inactive`. | ||
| 2. **14:05** — p95 jumps to 3.1s. The first breach is detected. The episode is created in `pending` and the system starts counting consecutive breaches. | ||
| 3. **14:10** — p95 is still elevated. The second consecutive breach meets the activation threshold. The episode moves from `pending` to `active`, and the engineer is paged. | ||
| 4. **14:10–14:45** — Latency stays elevated. The episode remains `active`. | ||
| 5. **14:50** — p95 drops back under 2s. The first clean check moves the episode to `recovering`. The system starts counting consecutive clears. | ||
| 6. **14:55** — A second consecutive clear meets the recovery threshold. The episode moves from `recovering` to `inactive`. | ||
|
|
||
| Thresholds prevent brief spikes from opening episodes and transient dips from closing them prematurely. The episode waits in `pending` until the problem is confirmed, and waits in `recovering` until the resolution is confirmed. | ||
|
|
||
| <!-- TODO: Add image once alert-episode-example-with-activation-threshold.png is available in explore-analyze/images/ | ||
| :::{image} ../../images/alert-episode-example-with-activation-threshold.png | ||
| :alt: Timeline of a checkout-latency alert episode with activation threshold of 2 and recovery threshold of 2. At 14:05, the first breach puts the episode in pending. At 14:10, the second consecutive breach moves it to active. At 14:50, the first clean check moves it to recovering. At 14:55, the second consecutive clear resolves it to inactive. | ||
| ::: | ||
| --> | ||
|
|
||
| ## Series | ||
|
|
||
| A series is the ongoing relationship between a rule and one specific thing it monitors. | ||
|
|
||
| Your rule monitors services. Each service it tracks has its own series, one for `checkout-service`, one for `payment-service`, and so on. A series exists for as long as that rule keeps monitoring that service. | ||
|
|
||
| Think of it like a patient's medical file. The file exists as long as the patient is in the system. Individual health incidents come and go, but the file persists. | ||
|
|
||
| <!-- TODO: Uncomment when PR #6523 (rules) is merged: | ||
| For the fields that identify a series in alert event documents, refer to [Rule event and field reference](rules/rule-event-field-reference.md#rule-reference). | ||
| --> | ||
|
|
||
| ### How series and episodes relate | ||
|
|
||
| An episode lives inside a series. A series can contain many episodes over its lifetime, one for each time that service had a problem. | ||
|
|
||
| ``` | ||
| Series: checkout-service | ||
| │ | ||
| ├── Episode 1: errors on April 10 (active → inactive) | ||
| ├── Episode 2: errors on April 15 (active → inactive) | ||
| └── Episode 3: errors on April 18 (active right now) | ||
| ``` | ||
|
|
||
| The series is the container. Episodes are the individual problems that happened within it. When the series breaches again after recovering, a new episode starts. | ||
|
|
||
| This means you can track "the checkout service was broken from 02:14 to 03:21" and "the payment service was broken at the same time" as separate episodes, even when both come from the same rule. | ||
|
|
||
| :::{tip} | ||
| Snooze operates at the series level, not the episode level. If you snooze `checkout-service`, you're silencing all notifications from that series for the next X hours, regardless of how many new episodes start during that time. You're quieting a specific ongoing situation, not a single alert. | ||
| ::: | ||
|
|
||
| ### A practical way to think about it | ||
|
|
||
| | Concept | Analogy | | ||
| | --- | --- | | ||
| | Rule | A security camera watching the building | | ||
| | Series | The camera's feed for one specific door | | ||
| | Episode | A specific incident caught on that feed | | ||
| | Rule events | The individual video frames | | ||
|
|
||
| The camera runs continuously (rule), always watching door 3 (series). One night someone breaks in. That's an episode. The frames captured during the break-in are the rule events. | ||
|
|
||
| ## Signals versus alerts | ||
|
|
||
| Every time a rule finds a match, it writes a document to `.rule-events`. Whether that document is a signal or an alert depends on the rule's mode, and that choice determines whether the system only records what happened or actively tracks it through to resolution. | ||
|
|
||
| A **signal** is a one-time observation. The system writes it and moves on, no lifecycle, no notifications, no follow-up. An **alert** participates in an episode. The system links it to every other document from the same problem, tracks the lifecycle states, and routes notifications through action policies. | ||
|
|
||
| | Type | What it is | When it's created | | ||
| | --- | --- | --- | | ||
| | Signal | A point-in-time record that the query matched (`type: signal`). Stored in `.rule-events`. | Rules in Detect mode | | ||
| | Alert | A lifecycle-tracked episode with `type: alert` and `episode.*` fields. Stored in `.rule-events`. | Rules in Alert mode | | ||
|
|
||
| A rule in Detect mode only writes signals. It never opens episodes, so action policies have nothing to match against. | ||
|
|
||
| ## Where alerts live | ||
|
|
||
| Alert events are stored in `.rule-events`. Triage actions (acknowledge, snooze, resolve) are stored in `.alert-actions`. Both are queryable in Discover. | ||
|
|
||
| <!-- TODO: Fix later | ||
| Every time you take an action on an episode — acknowledging it, snoozing it, resolving it, editing its tags — {{kib}} writes a new document to `.alert-actions`. These documents are append-only and can be queried in Discover for auditing and metrics such as mean time to acknowledge (MTTA). For field definitions, refer to [Alert states and fields reference](alerts/alert-states-and-fields-reference.md#alert-states-reference). | ||
| --> | ||
|
|
||
| <!--[CONTENT NEEDED for M2: UI. "V2 Alerting Preview" is a development-phase navigation label. Once the navigation and page name have been confirmed, add instructions for opening the Alerts page.] | ||
| --> | ||
|
|
||
| ### Data stream storage and retention | ||
|
|
||
| Both `.rule-events` and `.alert-actions` are data streams, append-only, time-series stores optimized for writes. On every rule evaluation, {{kib}} writes a **new document** to `.rule-events` rather than updating the previous one. Each document is a point-in-time snapshot. The `episode.status` field records the lifecycle state the episode was in at that exact evaluation. Nothing is overwritten. | ||
|
|
||
| <!-- TODO: Fix later | ||
| Because every evaluation produces its own document, you can reconstruct the full history of an episode by querying all documents that share the same `episode.id`. Refer to [Query alerts and signals in Discover](alerts/query-alerts-and-signals-in-discover.md#explore-alerts-discover) for example queries. | ||
|
nastasha-solomon marked this conversation as resolved.
nastasha-solomon marked this conversation as resolved.
|
||
| --> | ||
|
|
||
| Retention is managed automatically through ILM. Older backing indices move through storage tiers and are deleted when the retention window expires. You do not need to manually remove documents. {{kib}} manages versioning, retention, and lifecycle for both streams. Do not change their mappings or index settings. | ||
|
|
||
| ## Related pages | ||
|
|
||
| - **[View and manage alerts](alerts/view-and-manage-alerts.md):** Open the alert episodes table, triage active episodes, and acknowledge, snooze, or resolve them. | ||
| - **[Query alerts and signals in Discover](alerts/query-alerts-and-signals-in-discover.md):** Use {{esql}} to query `.rule-events` and `.alert-actions` for ad hoc analysis and dashboards. | ||
| - **[Alert states and fields reference](alerts/alert-states-and-fields-reference.md):** Look up lifecycle states, field names, and episode document structure. | ||
| <!-- TODO: Uncomment when PR #6525 (workflows/notifications) is merged: | ||
| - **[Notifications](notifications.md):** Set up action policies to route alert episodes to the right people and channels. | ||
| - **[Notification gating](notifications/notification-gating.md):** Understand how acknowledge, snooze, and deactivate control whether an episode triggers a notification. | ||
| --> | ||
58 changes: 58 additions & 0 deletions
58
...erting/kibana-alerting-experimental/alerts/alert-states-and-fields-reference.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| --- | ||
| navigation_title: Alert states and fields | ||
| applies_to: | ||
| stack: unavailable | ||
| serverless: preview | ||
| products: | ||
| - id: kibana | ||
| description: "Reference for episode status, `.rule-events` row status, and `.alert-actions` document fields in the experimental alerting features." | ||
| --- | ||
|
|
||
| # Alert states and fields reference [alert-states-reference] | ||
|
|
||
|
|
||
| Alert states and fields are part of the experimental alerting features in Kibana. Use these tables when you read alert UI state, query `.rule-events` or `.alert-actions` in Discover, or align API payloads with what operators see. For triage controls (acknowledge, snooze, resolve, tags) and how they map to storage, refer to [Alert actions](view-and-manage-alerts.md#alert-actions). | ||
| <!-- TODO: Uncomment when PR #6523 (rules) is merged: | ||
| For rule evaluation fields on `.rule-events`, refer to [Rule event and field reference](../rules/rule-event-field-reference.md#rule-reference). | ||
| --> | ||
|
|
||
| ## Episode status | ||
|
|
||
| The `episode.status` field appears on documents with `type: alert` in `.rule-events`. It represents the current lifecycle state of the alert episode. | ||
|
|
||
| | Value | Description | | ||
| |---|---| | ||
| | `inactive` | Episode not in an active breach state in the lifecycle model. | | ||
| | `pending` | Condition met but activation thresholds not yet satisfied. | | ||
| | `active` | Episode is actively breaching per rule logic. | | ||
| | `recovering` | Condition clearing but recovery thresholds not yet satisfied. | | ||
|
|
||
| <!--[CONTENT NEEDED for M2: M2 adds two first-class severity fields to the episode document: | ||
|
|
||
| - `episode.severity` - The severity from the most recent rule event (current state). Updated each evaluation. | ||
| - `episode.severity_max` - The highest severity seen over the episode's lifetime. Never decreases; enables "peaked at CRITICAL" display in the episode detail UI. | ||
|
|
||
| Add a new table or rows for these fields once M2 ships. The episode detail UI is also expected to change to surface these fields directly. Several M2 details are still open: the enforced value set (or lack thereof), whether de-escalation triggers policy re-evaluation, and whether manual override is supported. Do not document specifics until resolved.] | ||
| --> | ||
|
|
||
| ## Rule event status | ||
|
|
||
| The `status` field appears on all documents in `.rule-events`, for both `type: signal` and `type: alert`. It reflects the outcome of a single rule evaluation row, independent of the episode lifecycle. | ||
|
|
||
| | Value | Description | | ||
| |---|---| | ||
| | `breached` | Condition met for this evaluation row. | | ||
| | `recovered` | Recovery path satisfied for this evaluation row. | | ||
| | `no_data` | No-data handling produced a no-data style outcome for this evaluation. | | ||
|
|
||
| ## Alert action document fields | ||
|
|
||
| When a user or the system records an action on an alert episode, {{kib}} writes a document to `.alert-actions`. Use this stream for triage history, operational metrics such as mean time to acknowledge (MTTA), and auditing. It does not store what your rule query returned on each run — that output is in `.rule-events`. | ||
|
|
||
| | Field | Type | Description | | ||
| |---|---|---| | ||
| | `@timestamp` | date | When the action was recorded. | | ||
| | `episode.id` | keyword | Target episode. | | ||
| | `rule.id` | keyword | Rule that owns the episode. | | ||
| | `action.type` | keyword | The action type, for example: <br>- `acknowledge`: User acknowledged the alert.<br>- `snooze`: Notifications snoozed for a period.<br>- `tag`: Tag applied to the alert.<br>- `fire`: Notification or escalation fired for the episode.<br>- `unmatched`: No action policy matched the episode, so no workflow ran for it under these policies. <br><br> For the full set of action types and UI behavior, refer to [Alert actions](view-and-manage-alerts.md#alert-actions). | | ||
| | `episode.status_count` | long | Count of consecutive evaluations in the current `episode.status`. Only set when `episode.status` is `pending` or `recovering`.<br>For example, if the episode stays `pending` for three rule evaluations in a row, the value is `3`. | |
31 changes: 31 additions & 0 deletions
31
...ing/kibana-alerting-experimental/alerts/query-alerts-and-signals-in-discover.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| --- | ||
| navigation_title: Query alerts in Discover | ||
| applies_to: | ||
| stack: unavailable | ||
| serverless: preview | ||
| products: | ||
| - id: kibana | ||
| description: "Use {{esql}} in Discover against `.rule-events` and `.alert-actions`: sample queries, trends, and MTTA-style analysis for the experimental alerting features." | ||
| --- | ||
|
|
||
| # Query alerts and signals in Discover [explore-alerts-discover] | ||
|
|
||
|
|
||
| Querying alerts and signals in Discover is part of the experimental alerting features in Kibana. Discover gives you direct {{esql}} access to everything the experimental alerting features record, including rule evaluation history, episode progressions, triage actions, and operational metrics like mean time to acknowledge. | ||
|
|
||
| The Alerts UI shows current episode state. Discover lets you go further: ask arbitrary questions, spot trends over time, replay how a specific incident unfolded, or correlate alert history with other data in your environment. | ||
|
|
||
| To use this page, open Discover, select {{esql}}, paste a query from the examples below, then adjust the time range and placeholders (`YOUR_RULE_ID`, `YOUR_GROUP_HASH`) to match your environment. | ||
|
|
||
| <!--[CONTENT NEEDED: The queries on this page use `.rule-events` and `.alert-actions` directly. Confirm whether these will remain the intended query surface, or whether users should query an ES|QL view or a stable user-facing data stream instead. Update all examples accordingly before publishing.] | ||
|
|
||
| <!--[CONTENT NEEDED for M2: Review and expand the query examples below once M2 field renames (`group_hash` → `series.key`, new `series.tracked_by`, `episode.severity`, `episode.severity_max`) are finalized. Add examples that take advantage of the new first-class severity and series fields.] | ||
| --> | ||
|
|
||
| <!-- TODO: Fix later | ||
| For field names, types, and episode fields, refer to [Alert states and fields reference](alert-states-and-fields-reference.md#alert-states-reference). For triage in the product UI, refer to [View, manage, and reference alerts](view-and-manage-alerts.md). | ||
|
nastasha-solomon marked this conversation as resolved.
|
||
| --> | ||
| <!-- TODO: Uncomment when PR #6523 (rules) is merged and restore full sentence: | ||
| For field names, types, and episode fields, refer to [Alert states and fields reference](alert-states-and-fields-reference.md#alert-states-reference) and [Rule event and field reference](../rules/rule-event-field-reference.md#rule-reference). For triage in the product UI, refer to [View, manage, and reference alerts](view-and-manage-alerts.md). | ||
| --> | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.