From 96bd0d7d1379b2d4e5b704cba5d57619fdbdad81 Mon Sep 17 00:00:00 2001
From: tfishler1 <165731132+tfishler1@users.noreply.github.com>
Date: Wed, 7 Jan 2026 10:17:53 -0800
Subject: [PATCH 1/6] Revise VM watch collectors documentation
Updated the date and modified sections for clarity and accuracy regarding VM watch collectors.
---
.../vm-watch-collector-suite.md | 65 +++++++------------
1 file changed, 24 insertions(+), 41 deletions(-)
diff --git a/articles/virtual-machines/vm-watch-collector-suite.md b/articles/virtual-machines/vm-watch-collector-suite.md
index a48dce81f08..d7a0d1dd881 100644
--- a/articles/virtual-machines/vm-watch-collector-suite.md
+++ b/articles/virtual-machines/vm-watch-collector-suite.md
@@ -5,14 +5,14 @@ author: ofemifowode
ms.author: ofemifowode
ms.service: azure-virtual-machines
ms.topic: concept-article
-ms.date: 02/05/2025
+ms.date: 01/07/2026
ms.subservice: monitoring
# Customer intent: As a systems administrator, I want to implement VM watch collectors to monitor VM health metrics and logs, so that I can proactively identify issues, optimize performance, and ensure the reliability of the virtual machine environment.
---
-# VM watch Collectors Suite
+# VMwatch Plugin Collections
-VM watch collectors are designed to gather VM health data on various resources like disk and network, by running health checks within the VM. This suite of collectors aid in identifying issues, monitoring performance trends, and optimizing resources to enhance the overall user experience.
+VMWatch is implemented with an Infra-Plugin model for functional scalability. VMWatch Infra is reponsible for the scheduling of each Plugin's execution, and each Plugin is targeted to measure the VM health of a spefic area and emit the VM health Signals (Check, Metric, Eventlog). Below is a summary of all the available Plugins in VMWatch, the Signals they emit and their parameter configurations.
This article provides a summary of all available collectors in VM watch, along with the corresponding checks, metrics, logs, and parameter configurations. For detailed descriptions of each check, metric, and log, refer to the [VM watch overview](/azure/virtual-machines/azure-vm-watch) page.
@@ -24,8 +24,8 @@ This article assumes that you're familiar with:
> [!NOTE]
> | **Name** | **Description** |
> |---|---|
-> | **Collector** | Logical grouping of similar tests where you can collect checks, metrics, and logs to determine the health of a particular resource |
-> | **Signals** | What is emitted to reflect the health status of VMs. The three types of signals emitted are checks, metrics, and logs |
+> | **Plugin Name** | Logical grouping of similar tests where you can collect checks, metrics, and logs to determine the health of a particular resource |
+> | **Description** | a Short Description about the Plugin |
> | **Group** | Indicates whether the collectors are part of the core or optional group. Core group collectors are enabled by default, while optional group collectors can be enabled or disabled based on your requirements |
> | **Tags** | Used to categorize and filter checks, metrics, and logs |
> | **Eligibility** | Determines whether a collector is eligible to be executed based on the environment attributes you specify |
@@ -35,45 +35,28 @@ This article assumes that you're familiar with:
### Groups, tags and corresponding checks, metrics, and event logs
-| Collector Name | Group | Tags | Checks | Metrics | Event Logs |
-|---|---|---|---|---|---|
-| outbound_connectivity| Core|Network|
|||
-| dns| Core|Network| |||
-| tcp_stats | Core|Network| | - SegmentsRetransmitted
- TCPSynRetransmits (Linux only)
- NormalizedSegmentsRetransmitted
- ConnectionResets
- NormalizedConnectionResets
- FailedConnectionAttempts
- NormalizedFailedConnectionAttempts
- ActiveConnectionOpenings
- PassiveConnectionOpenings
- CurrentConnections
- SegmentsReceived
- SegmentsSent
||
-| clock_skew | Core|Clock| |||
-| disk_io | Core|Disk|| - UsedSpaceInBytes
- FreeSpaceInBytes
- CapacityInBytes
- UsedPercent
||
-| disk_iops | Core|Disk||||
-| imds | Core|IMDS| |||
-| process | Core|Process| |||
-| process_memory | Core|Process| | - ProcessRSSPercent
- ProcessPageFaults
- MachineMemoryTotalInBytes
- MachineMemoryUsedPercent
- TotalPageFaults
||
-| process_cpu | Core|Process| | - ProcessCPUCoreUsage
- ProcessCPUMachineUsage
- MachineTotalCpuUsage
||
-| process_monitor | Optional|Process||||
-| system_error | Core|OS| |||
-| az_storage_blob | Optional|AzBlob| |||
-| hardware_health_monitor | Optional|Hardware| | | |
-| hardware_health_nvidia_smi | Optional|Hardware| | | - hardware_health_nvidia_smi
|
+| Plugin Name | Description | Eligibility | Group | Tags | Default Behavior | Overwritable Parameters | Checks | Metrics |
+|---|---|---|---|---|---|---|---|---|
+| outbound_connectivity| Verify the outbound connectivity to a remote URL from the VM. |Eligible if EnrironmentAttribute "OutboundConnectivityDisabled" is not set or set to "false"| Core|*Network|This Plugin is executed every 60s. In each execution, it sends an http GET request to http://www.msftconnecttest.com/connecttest.txt with a timeout of 30s.| - OUTBOUND_CONNECTIVITY_INTERVAL: the execution interval of the Collector. Default: 60s
- OUTBOUND_CONNECTIVITY_URLS: the URLs that this Collector sends http GET requests to. URLs are provided as a string using `,` as separator. Default: `http://www.msftconnecttest.com/connecttest.txt`
- OUTBOUND_CONNECTIVITY_TIMEOUT_IN_MILLISECONDS: the http GET request time-out in milliseconds. Default: 30000
- OUTBOUND_CONNECTIVITY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 1
- OUTBOUND_CONNECTIVITY_RETRY_INTERVAL_IN_SECONDS: the retry interval in seconds if the previous http request fails. Default: 10
| *outbound_connectivity |LatencyInNanoSeconds|
+| dns| Verify if the target DNS name can be resolved. |Eligible if EnrironmentAttribute "OutboundConnectivityDisbled" is not set or set to "false"| Core| * Network| This Plugin is executed every 180s. In each execution, it tries to resolve the DNS name www.msftconnecttest.com . The verification is marked as "Failed" if the DNS name cannot be resolved.| - DNS_INTERVAL: the execution interval of the Collector. Default: 180s
- DNS_NAMES: the domain names to be resolved separated by `,`. Default: `www.msftconnecttest.com`
| * dns ||
+| tcp_stats | Collect the TCP statistics of the VM |Always eligible| Core | * Network|This Plugin is executed every 180s. In each execution, it collects the TCP statistics of the last 180s.| - TCP_STATS_INTERVAL: the execution interval of the Collector. Default: 180s
|| - SegmentsRetransmitted
- TCPSynRetransmits (Linux only)
- NormalizedSegmentsRetransmitted
- ConnectionResets
- NormalizedConnectionResets
- FailedConnectionAttempts
- NormalizedFailedConnectionAttempts
- ActiveConnectionOpenings
- PassiveConnectionOpenings
- CurrentConnections
- SegmentsReceived
- SegmentsSent
||
+| clock_skew | Verify the clock skew between the VM and the remote NTP server|Eligible if EnrironmentAttribute "OutboundConnectivityDisbled" is not set or set to "false"| Core| * Clock| This Plugin is executed every 180s. In each execution, it retrieves the clock offset between the remote NTP server time.windows.com and the VM. The verification is marked as "Failed" if the clock skew is larger than 5.0 seconds. In Windows VM, if connecting to remote NTP server fails, it fallbacks to check Windows Time Service with w32tm command. The verification is marked as "Failed" if the w32tm command returns "Leap Indicator: 3(not synchronized)".| - CLOCK_SKEW_INTERVAL: the execution interval of the Collector. Default: 180s
- CLOCK_SKEW_NTP_SERVER: the remote NTP server used to calculate clock skew. Default: time.windows.com
- CLOCK_SKEW_TIME_SKEW_THRESHOLD_IN_SECONDS: the threshold in seconds of clock offset to mark the verification as "Failed". Default: 5.0
|clockskew|
+| disk_io | Verify the disk IO avaialbilities (including folder creation/deletion, file creation, write, read, deletion) in each selected disk/partition mount point. Also collect the disk usage metrics of these mount points. If no mount points are specified, apply the these operations on each available mount point.|Always eligible if mount points are not specified. If mount points are explicitly specified, only eligible when data disks are attached to the VM|Core|Disk | This Plugin is executed every 60s. In each execution, it verifies the disk io availability in each available mount point by creating a folder, creating a file, writing bytes to it, deleting it and delete the folder. Then it collects the disk usage info indcluding used space, free space, total capacity and used percentage from each mount point. Get disk LUN and disk serial number and report them as a property.| - DISK_IO_INTERVAL: the execution interval of the Plugin. Default: 60s
- DISK_IO_MOUNT_POINTS: the mount points separated by. No default value
- DISK_IO_IGNORE_FS_LIST: the file system list that should be ignored separated by ,. Default: tmpfs,devtmpfs,devfs,iso9660,overlay,aufs,squashfs,autofs
- DISK_IO_FILENAME: the name of the file used to verify the file read/write. Default: vmwatch-{timestamp}.txt
| * disk_io | - UsedSpaceInBytes
- FreeSpaceInBytes (Linux only)
- CapacityInBytes
- UsedPercent
- WriteLatencyNs
- SyncLatencyNs
- ReadLatencyNs
||
+| disk_iops | Collect the disk read/write perations per second from all available disk devices or explicitly specified disk devices.|Always eligible|Core|* Disk|This Plugin is executed every 180s. In each execution, it collects the disk read and write operations per second metrics from each available disk device.| - DISK_IOPS_INTERVAL: the execution interval of the Collector. Default: 180s
- DISK_IOPS_DEVICES: the device names separated by `,`. No default value
- DISK_IOPS_IGNORE_DEVICE_REGEX: the regex of the device name that should be ignored. Default: loop
|| - WriteOps
- ReadOps
- DiskReadBytesPerSec
- DiskTransfersPerSec
||
+| vm_cpu | Collect the machine total CPU usage, CPU count, and the usage of each CPU core.| Always eligible| Core| *CPU| This Plugin is executed every 180s. In each execution, it collects machine total CPU usage, CPU count, and the usage of each CPU core| - VM_CPU_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - TotalCpuUsage
- CpuCount
- PerCore
||
+| vm_blip | Measure the elapsed time in milliseconds for the given measurement interval, to detect any VM blip.| Always eligible| Core| *Blip| This Plugin is executed every 11s. In each execution, it measures the elapsed time in milliseconds of the given interval of 10 seconds| - VM_BLIP_INTERVAL: the execution interval of the Plugin. Default: 11s
- VM_BLIP_MEASUREMENT_INTERVAL_IN_SECONDS: the given measurement interval. Default: 10s
||||
+| imds | Query the IMDS endpoint from within the VM and verify the reponse of the IMDS query response. |Always eligible| Core| * IMDS| This Plugin is executed every 180s. In each execution, it queries the IMDS endpoint http://169.254.169.254/metadata/instance/compute and verifies the reponse body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query timeout is 10s. If the query fails, it will retry at most another 3 more times with an interval of 15s, 30s and 45s.| - IMDS_INTERVAL: the execution interval of the Plugin. Default: 180s
- IMDS_ENDPOINT: the URL of the IMDS endpoint. Default:http://169.254.169.254/metadata/instance/compute
- IMDS_TIMEOUT_IN_SECONDS: the timeout in seconds of each query. Default: 10
- IMDS_QUERY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 4
- IMDS_RETRY_INTERVAL_IN_SEONDS: the retry interval in seconds if the previous http request fails. Default: 15, 30, 45
| * imds| |
+| process | Verify if a process can be created and executed. |Always eligible| Core| * Process|This Plugin is executed every 180s. In each execution, it creates and executes command ${SYTEM_DIR}\system32\cmd.exe /c echo hello in Windows machine and /bin/sh -c echo hello in Linux machine. The timeout of process execution is 10s.| - PROCESS_INTERVAL: the execution interval of the Collector. Default: 180s
- PROCESS_TIMEOUT: the time-out of process execution. Default: 10s
| * process|
+| process_memory | Collect the top 3 processes using the most memory resource. Report each process's memory usage over the machine total memory and its Page Fault Counter. Also report the machine's total memory, machine's used memory pecentage and total Page Faults Counter.|Always eligible|Core |Process|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent and TotalPageFaults.| - PROCESS_MEMORY_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessRSSPercent
- ProcessPageFaults
- MachineMemoryTotalInBytes
- MachineMemoryUsedPercent
- TotalPageFaults
||
+| process_cpu | Collect the top 3 processes using the most CPU resource. Report each process's CPU usage over the CPU core and machine. Also report the machine's total CPU usage.|Always eligible|Core |Proces|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage and MachineTotalCpuUsage.| - PROCESS_CPU_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessCPUCoreUsage
- ProcessCPUMachineUsage
- MachineTotalCpuUsage
||
+| process_monitor | Verify if the selected process is running and collect its running time in seconds.|Always eligible|Optional|Process|Not executed.| - PROCESS_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
- PROCESS_MONITOR_PROCESS_NAMES: the Regular Expression of process names to be monitored separated by `,`. No default value
||||
+| system_error | Collect the error at system level event log (Windows only).|Eligible in Windows machine|Core|OS| The plugin is exeucted every 3 mins. In each execution, it subscribes to the "System" Channel of Windows EventLog and queries Events with Level defined in SystemData <=2 (including LOG_ALWAYS, Critital, Error). The measurementTarget is defined as Source_EventId_ShortHash of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection.| - SYSTEM_ERROR_MEASUREMENT_TARGET_CAP: the cap of total different measurementTargets in each collection. Default: 10
||||
+| az_storage_blob | Verify if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token.|Eligible if EnrironmentAttribute "OutboundConnectivityDisbled" is not set or set to "false"|Optional| *AzBlob|Not executed.| - AZ_STORAGE_BLOB_INTERVAL: the execution interval of the Collector. Default: 180s
- AZ_STORAGE_ACCOUNT_NAME: the Azure Storage account name. No default value
- AZ_STORAGE_CONTAINER_NAME: the Azure Storage Container name. No default value
- AZ_STORAGE_BLOB_NAME: the Azure Storage Blob name. No default value
- AZ_STORAGE_BLOB_DOMAIN_NAME: the Azure Storage domain name. No default value
- AZ_STORAGE_SAS_TOKEN_BASE64: the Base64 encoded Azure Storage SAS token. No default value
- AZ_STORAGE_USE_MANAGED_IDENTITY: if the managed identity will be used for authentication. Default: false
- AZ_STORAGE_MANAGED_IDENTITY_CLIENT_ID: the managed identity client ID for authentication. No default value
| |
+| hardware_health_monitor | Collect hardware health info from Windows event log, currently only disk related critical events are collected, including events with id 7, 500, 504, 505, 512 and 549. Those events include critical metrics about disk health status, for NVMe directly attached devices, it is exclusively available on the VM side only. With those metric it is possible to monitor and alert the disk status, thus to improve the VM service availabilities|Eligible in Windows machine|Optional|*Hardware |Not executed.| - HARDWARE_HEALTH_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
|| |
+| hardware_health_nvidia_smi | Collect GPU stats including memory and GPU usage, temp and other by running nvidia-smi command (Linux Ubuntu only)|Eligible in Linux Ubuntu machine|Optional |*Hardware |Not executed.| - HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the execution interval of the Collector. Default: 60s
- HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the time-out of running /usr/bin/nvidia-smi command. Default: 10s
|| - hardware_health_nvidia_smi
|
-### Eligibility, default behavior, and overwritable parameters
-| Collector Name | Eligibility | Default Behavior | Overwritable Parameters |
-|---|---|---|---|
-| outbound_connectivity| Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" |This collector is executed every 60s. In each execution, it sends an http GET request to `http://www.msftconnecttest.com/connecttest.txt` with a time-out of 5s. If the request fails, it retries at most two more times with and interval of 10s. The verification is marked as "Failed" if all the retries fail. | - OUTBOUND_CONNECTIVITY_INTERVAL: the execution interval of the Collector. Default: 60s
- OUTBOUND_CONNECTIVITY_URLS: the URLs that this Collector sends http GET requests to. URLs are provided as a string using `,` as separator. Default: `http://www.msftconnecttest.com/connecttest.txt`
- OUTBOUND_CONNECTIVITY_TIMEOUT_IN_MILLISECONDS: the http GET request time-out in milliseconds. Default: 5000
- OUTBOUND_CONNECTIVITY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 3
- OUTBOUND_CONNECTIVITY_RETRY_INTERVAL_IN_SECONDS: the retry interval in seconds if the previous http request fails. Default: 10
|
-| dns| Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" |This Collector is executed every 180s. In each execution, it tries to resolve the DNS name `www.msftconnecttest.com` . The verification is marked as "Failed" if the DNS name can't be resolved. | - DNS_INTERVAL: the execution interval of the Collector. Default: 180s
- DNS_NAMES: the domain names to be resolved separated by `,`. Default: `www.msftconnecttest.com`
|
-| tcp_stats| Always eligible |This collector is executed every 180s. In each execution, it collects the TCP statistics of the last 180s. | - TCP_STATS_INTERVAL: the execution interval of the Collector. Default: 180s
|
-| clock_skew| Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false"|This collector is executed every 180s. In each execution, it retrieves the clock offset between the remote NTP server `time.windows.com` and the VM. The verification is marked as "Failed" if the clock skew is larger than 5.0 seconds. In Windows VM, if connecting to remote NTP server fails, it fallbacks to check Windows Time Service with w32tm command. The verification is marked as "Failed" if the w32tm command returns "Leap Indicator: 3(not synchronized)". | - CLOCK_SKEW_INTERVAL: the execution interval of the Collector. Default: 180s
- CLOCK_SKEW_NTP_SERVER: the remote NTP server used to calculate clock skew. Default: time.windows.com
- CLOCK_SKEW_TIME_SKEW_THRESHOLD_IN_SECONDS: the threshold in seconds of clock offset to mark the verification as "Failed". Default: 5.0
|
-| disk_io| Always eligible if mount points aren't specified. If mount points are explicitly specified, only eligible when data disks are attached to the VM |This collector is executed every 180s. In each execution, it verifies the disk io availability in each available mount point by creating a folder, creating a file, writing bytes to it, deleting it and delete the folder. Then it collects the disk usage info including used space, free space, total capacity and used percentage from each mount point. | - DISK_IO_INTERVAL: the execution interval of the Collector. Default: 180s
- DISK_IO_MOUNT_POINTS: the mount points separated by `,`. No default value
- DISK_IO_IGNORE_FS_LIST: the file system list that should be ignored separated by `,`. Default: tmpfs,devtmpfs,devfs,iso9660,overlay,aufs,squashfs,autofs
- DISK_IO_FILENAME: the name of the file used to verify the file read/write. Default: vmwatch-{timestamp}.txt
|
-| disk_iops| Always eligible |This collector is executed every 180s. In each execution, it collects the disk read and write operations per second metrics from each available disk device. | - DISK_IOPS_INTERVAL: the execution interval of the Collector. Default: 180s
- DISK_IOPS_DEVICES: the device names separated by `,`. No default value
- DISK_IOPS_IGNORE_DEVICE_REGEX: the regex of the device name that should be ignored. Default: loop
|
-| imds| Always eligible|This collector is executed every 180s. In each execution, it queries the IMDS endpoint `http://169.254.169.254/metadata/instance/compute` and verifies the response body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query time-out is 10s. If the query fails, it retries at most another three more times with an interval of 15s, 30s, and 45s. | - IMDS_INTERVAL: the execution interval of the Collector. Default: 180s
- IMDS_ENDPOINT: the URL of the IMDS endpoint. Default:`http://169.254.169.254/metadata/instance/compute`
- IMDS_TIMEOUT_IN_SECONDS: the time-out in seconds of each query. Default: 10
- IMDS_QUERY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 4
- IMDS_RETRY_INTERVAL_IN_SEONDS: the retry interval in seconds if the previous http request fails. Default: 15, 30, 45
|
-| process| Always eligible|This collector is executed every 180s. In each execution, it creates and executes command `${SYTEM_DIR}\system32\cmd.exe /c echo hello` in Windows machine and `/bin/sh -c echo hello` in Linux machine. The time-out of process execution is 10s. | - PROCESS_INTERVAL: the execution interval of the Collector. Default: 180s
- PROCESS_TIMEOUT: the time-out of process execution. Default: 10s
|
-| process_memory| Always eligible|This collector is executed every 180s. In each execution, it selects the top three processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent, and TotalPageFaults. | - PROCESS_MEMORY_INTERVAL: the execution interval of the Collector. Default: 180s
|
-| process_cpu| Always eligible|This collector is executed every 180s. In each execution, it selects the top three processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage, and MachineTotalCpuUsage. | - PROCESS_CPU_INTERVAL: the execution interval of the Collector. Default: 180s
|
-| process_monitor| Always eligible|Not executed. If explicitly enabled by the user, this collector verifies if the selected process is running and collect its running time in seconds. | - PROCESS_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
- PROCESS_MONITOR_PROCESS_NAMES: the Regular Expression of process names to be monitored separated by `,`. No default value
|
-| system_error| Eligible in Windows machine|The Collector is executed every three mins. In each execution, it subscribes to the "System" channel of Windows EventLog and queries events with level defined in SystemData <=2 (including LOG_ALWAYS, Critical, Error). The measurementTarget is defined as Source_EventId of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection. | - SYSTEM_ERROR_MEASUREMENT_TARGET_CAP: the cap of total different measurementTargets in each collection. Default: 10
|
-| az_storage_blob| Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" |Not executed. If explicitly enabled by the user, this collector verifies if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token. | - AZ_STORAGE_BLOB_INTERVAL: the execution interval of the Collector. Default: 180s
- AZ_STORAGE_ACCOUNT_NAME: the Azure Storage account name. No default value
- AZ_STORAGE_CONTAINER_NAME: the Azure Storage Container name. No default value
- AZ_STORAGE_BLOB_NAME: the Azure Storage Blob name. No default value
- AZ_STORAGE_BLOB_DOMAIN_NAME: the Azure Storage domain name. No default value
- AZ_STORAGE_SAS_TOKEN_BASE64: the Base64 encoded Azure Storage SAS token. No default value
- AZ_STORAGE_USE_MANAGED_IDENTITY: if the managed identity will be used for authentication. Default: false
- AZ_STORAGE_MANAGED_IDENTITY_CLIENT_ID: the managed identity client ID for authentication. No default value
|
-| hardware_health_monitor| Eligible in Windows machine|Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549. | - HARDWARE_HEALTH_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
|
-| hardware_health_nvidia_smi | Eligible in Linux Ubuntu machine|Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549. | - HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the execution interval of the Collector. Default: 60s
- HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the time-out of running /usr/bin/nvidia-smi command. Default: 10s
|
-
### Next steps
From 6f2f3678f4a378e92b46f59d87e7435c666e9620 Mon Sep 17 00:00:00 2001
From: Diana Richards
Date: Wed, 7 Jan 2026 15:44:16 -0600
Subject: [PATCH 2/6] spelling fix
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
articles/virtual-machines/vm-watch-collector-suite.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/articles/virtual-machines/vm-watch-collector-suite.md b/articles/virtual-machines/vm-watch-collector-suite.md
index d7a0d1dd881..8981497123c 100644
--- a/articles/virtual-machines/vm-watch-collector-suite.md
+++ b/articles/virtual-machines/vm-watch-collector-suite.md
@@ -50,7 +50,7 @@ This article assumes that you're familiar with:
| process_memory | Collect the top 3 processes using the most memory resource. Report each process's memory usage over the machine total memory and its Page Fault Counter. Also report the machine's total memory, machine's used memory pecentage and total Page Faults Counter.|Always eligible|Core |Process|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent and TotalPageFaults.| - PROCESS_MEMORY_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessRSSPercent
- ProcessPageFaults
- MachineMemoryTotalInBytes
- MachineMemoryUsedPercent
- TotalPageFaults
||
| process_cpu | Collect the top 3 processes using the most CPU resource. Report each process's CPU usage over the CPU core and machine. Also report the machine's total CPU usage.|Always eligible|Core |Proces|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage and MachineTotalCpuUsage.| - PROCESS_CPU_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessCPUCoreUsage
- ProcessCPUMachineUsage
- MachineTotalCpuUsage
||
| process_monitor | Verify if the selected process is running and collect its running time in seconds.|Always eligible|Optional|Process|Not executed.| - PROCESS_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
- PROCESS_MONITOR_PROCESS_NAMES: the Regular Expression of process names to be monitored separated by `,`. No default value
||||
-| system_error | Collect the error at system level event log (Windows only).|Eligible in Windows machine|Core|OS| The plugin is exeucted every 3 mins. In each execution, it subscribes to the "System" Channel of Windows EventLog and queries Events with Level defined in SystemData <=2 (including LOG_ALWAYS, Critital, Error). The measurementTarget is defined as Source_EventId_ShortHash of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection.| - SYSTEM_ERROR_MEASUREMENT_TARGET_CAP: the cap of total different measurementTargets in each collection. Default: 10
||||
+| system_error | Collect the error at system level event log (Windows only).|Eligible in Windows machine|Core|OS| The plugin is executed every 3 mins. In each execution, it subscribes to the "System" Channel of Windows EventLog and queries Events with Level defined in SystemData <=2 (including LOG_ALWAYS, Critital, Error). The measurementTarget is defined as Source_EventId_ShortHash of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection.| - SYSTEM_ERROR_MEASUREMENT_TARGET_CAP: the cap of total different measurementTargets in each collection. Default: 10
||||
| az_storage_blob | Verify if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token.|Eligible if EnrironmentAttribute "OutboundConnectivityDisbled" is not set or set to "false"|Optional| *AzBlob|Not executed.| - AZ_STORAGE_BLOB_INTERVAL: the execution interval of the Collector. Default: 180s
- AZ_STORAGE_ACCOUNT_NAME: the Azure Storage account name. No default value
- AZ_STORAGE_CONTAINER_NAME: the Azure Storage Container name. No default value
- AZ_STORAGE_BLOB_NAME: the Azure Storage Blob name. No default value
- AZ_STORAGE_BLOB_DOMAIN_NAME: the Azure Storage domain name. No default value
- AZ_STORAGE_SAS_TOKEN_BASE64: the Base64 encoded Azure Storage SAS token. No default value
- AZ_STORAGE_USE_MANAGED_IDENTITY: if the managed identity will be used for authentication. Default: false
- AZ_STORAGE_MANAGED_IDENTITY_CLIENT_ID: the managed identity client ID for authentication. No default value
| |
| hardware_health_monitor | Collect hardware health info from Windows event log, currently only disk related critical events are collected, including events with id 7, 500, 504, 505, 512 and 549. Those events include critical metrics about disk health status, for NVMe directly attached devices, it is exclusively available on the VM side only. With those metric it is possible to monitor and alert the disk status, thus to improve the VM service availabilities|Eligible in Windows machine|Optional|*Hardware |Not executed.| - HARDWARE_HEALTH_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
|| |
| hardware_health_nvidia_smi | Collect GPU stats including memory and GPU usage, temp and other by running nvidia-smi command (Linux Ubuntu only)|Eligible in Linux Ubuntu machine|Optional |*Hardware |Not executed.| - HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the execution interval of the Collector. Default: 60s
- HARDWARE_HEALTH_NVIDIA_SMI_INTERVAL: the time-out of running /usr/bin/nvidia-smi command. Default: 10s
|| - hardware_health_nvidia_smi
|
From 4da9d5bdc5f0c44361c3d77753ca9b824970502f Mon Sep 17 00:00:00 2001
From: Diana Richards
Date: Wed, 7 Jan 2026 15:44:36 -0600
Subject: [PATCH 3/6] spelling fix
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
articles/virtual-machines/vm-watch-collector-suite.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/articles/virtual-machines/vm-watch-collector-suite.md b/articles/virtual-machines/vm-watch-collector-suite.md
index 8981497123c..41eb7481c69 100644
--- a/articles/virtual-machines/vm-watch-collector-suite.md
+++ b/articles/virtual-machines/vm-watch-collector-suite.md
@@ -45,7 +45,7 @@ This article assumes that you're familiar with:
| disk_iops | Collect the disk read/write perations per second from all available disk devices or explicitly specified disk devices.|Always eligible|Core|* Disk|This Plugin is executed every 180s. In each execution, it collects the disk read and write operations per second metrics from each available disk device.| - DISK_IOPS_INTERVAL: the execution interval of the Collector. Default: 180s
- DISK_IOPS_DEVICES: the device names separated by `,`. No default value
- DISK_IOPS_IGNORE_DEVICE_REGEX: the regex of the device name that should be ignored. Default: loop
|| - WriteOps
- ReadOps
- DiskReadBytesPerSec
- DiskTransfersPerSec
||
| vm_cpu | Collect the machine total CPU usage, CPU count, and the usage of each CPU core.| Always eligible| Core| *CPU| This Plugin is executed every 180s. In each execution, it collects machine total CPU usage, CPU count, and the usage of each CPU core| - VM_CPU_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - TotalCpuUsage
- CpuCount
- PerCore
||
| vm_blip | Measure the elapsed time in milliseconds for the given measurement interval, to detect any VM blip.| Always eligible| Core| *Blip| This Plugin is executed every 11s. In each execution, it measures the elapsed time in milliseconds of the given interval of 10 seconds| - VM_BLIP_INTERVAL: the execution interval of the Plugin. Default: 11s
- VM_BLIP_MEASUREMENT_INTERVAL_IN_SECONDS: the given measurement interval. Default: 10s
||||
-| imds | Query the IMDS endpoint from within the VM and verify the reponse of the IMDS query response. |Always eligible| Core| * IMDS| This Plugin is executed every 180s. In each execution, it queries the IMDS endpoint http://169.254.169.254/metadata/instance/compute and verifies the reponse body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query timeout is 10s. If the query fails, it will retry at most another 3 more times with an interval of 15s, 30s and 45s.| - IMDS_INTERVAL: the execution interval of the Plugin. Default: 180s
- IMDS_ENDPOINT: the URL of the IMDS endpoint. Default:http://169.254.169.254/metadata/instance/compute
- IMDS_TIMEOUT_IN_SECONDS: the timeout in seconds of each query. Default: 10
- IMDS_QUERY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 4
- IMDS_RETRY_INTERVAL_IN_SEONDS: the retry interval in seconds if the previous http request fails. Default: 15, 30, 45
| * imds| |
+| imds | Query the IMDS endpoint from within the VM and verify the response of the IMDS query response. |Always eligible| Core| * IMDS| This Plugin is executed every 180s. In each execution, it queries the IMDS endpoint http://169.254.169.254/metadata/instance/compute and verifies the response body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query timeout is 10s. If the query fails, it will retry at most another 3 more times with an interval of 15s, 30s and 45s.| - IMDS_INTERVAL: the execution interval of the Plugin. Default: 180s
- IMDS_ENDPOINT: the URL of the IMDS endpoint. Default:http://169.254.169.254/metadata/instance/compute
- IMDS_TIMEOUT_IN_SECONDS: the timeout in seconds of each query. Default: 10
- IMDS_QUERY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 4
- IMDS_RETRY_INTERVAL_IN_SEONDS: the retry interval in seconds if the previous http request fails. Default: 15, 30, 45
| * imds| |
| process | Verify if a process can be created and executed. |Always eligible| Core| * Process|This Plugin is executed every 180s. In each execution, it creates and executes command ${SYTEM_DIR}\system32\cmd.exe /c echo hello in Windows machine and /bin/sh -c echo hello in Linux machine. The timeout of process execution is 10s.| - PROCESS_INTERVAL: the execution interval of the Collector. Default: 180s
- PROCESS_TIMEOUT: the time-out of process execution. Default: 10s
| * process|
| process_memory | Collect the top 3 processes using the most memory resource. Report each process's memory usage over the machine total memory and its Page Fault Counter. Also report the machine's total memory, machine's used memory pecentage and total Page Faults Counter.|Always eligible|Core |Process|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent and TotalPageFaults.| - PROCESS_MEMORY_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessRSSPercent
- ProcessPageFaults
- MachineMemoryTotalInBytes
- MachineMemoryUsedPercent
- TotalPageFaults
||
| process_cpu | Collect the top 3 processes using the most CPU resource. Report each process's CPU usage over the CPU core and machine. Also report the machine's total CPU usage.|Always eligible|Core |Proces|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage and MachineTotalCpuUsage.| - PROCESS_CPU_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessCPUCoreUsage
- ProcessCPUMachineUsage
- MachineTotalCpuUsage
||
From 1eefd989ae4fd4e3f4d4acd4b5165e4bee812c2a Mon Sep 17 00:00:00 2001
From: Diana Richards
Date: Wed, 7 Jan 2026 15:44:46 -0600
Subject: [PATCH 4/6] spelling fix
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
articles/virtual-machines/vm-watch-collector-suite.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/articles/virtual-machines/vm-watch-collector-suite.md b/articles/virtual-machines/vm-watch-collector-suite.md
index 41eb7481c69..9fcba463c18 100644
--- a/articles/virtual-machines/vm-watch-collector-suite.md
+++ b/articles/virtual-machines/vm-watch-collector-suite.md
@@ -48,7 +48,7 @@ This article assumes that you're familiar with:
| imds | Query the IMDS endpoint from within the VM and verify the response of the IMDS query response. |Always eligible| Core| * IMDS| This Plugin is executed every 180s. In each execution, it queries the IMDS endpoint http://169.254.169.254/metadata/instance/compute and verifies the response body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query timeout is 10s. If the query fails, it will retry at most another 3 more times with an interval of 15s, 30s and 45s.| - IMDS_INTERVAL: the execution interval of the Plugin. Default: 180s
- IMDS_ENDPOINT: the URL of the IMDS endpoint. Default:http://169.254.169.254/metadata/instance/compute
- IMDS_TIMEOUT_IN_SECONDS: the timeout in seconds of each query. Default: 10
- IMDS_QUERY_TOTAL_ATTEMPTS: the total number of attempts to send http request if the previous one fails. Default: 4
- IMDS_RETRY_INTERVAL_IN_SEONDS: the retry interval in seconds if the previous http request fails. Default: 15, 30, 45
| * imds| |
| process | Verify if a process can be created and executed. |Always eligible| Core| * Process|This Plugin is executed every 180s. In each execution, it creates and executes command ${SYTEM_DIR}\system32\cmd.exe /c echo hello in Windows machine and /bin/sh -c echo hello in Linux machine. The timeout of process execution is 10s.| - PROCESS_INTERVAL: the execution interval of the Collector. Default: 180s
- PROCESS_TIMEOUT: the time-out of process execution. Default: 10s
| * process|
| process_memory | Collect the top 3 processes using the most memory resource. Report each process's memory usage over the machine total memory and its Page Fault Counter. Also report the machine's total memory, machine's used memory pecentage and total Page Faults Counter.|Always eligible|Core |Process|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent and TotalPageFaults.| - PROCESS_MEMORY_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessRSSPercent
- ProcessPageFaults
- MachineMemoryTotalInBytes
- MachineMemoryUsedPercent
- TotalPageFaults
||
-| process_cpu | Collect the top 3 processes using the most CPU resource. Report each process's CPU usage over the CPU core and machine. Also report the machine's total CPU usage.|Always eligible|Core |Proces|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage and MachineTotalCpuUsage.| - PROCESS_CPU_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessCPUCoreUsage
- ProcessCPUMachineUsage
- MachineTotalCpuUsage
||
+| process_cpu | Collect the top 3 processes using the most CPU resource. Report each process's CPU usage over the CPU core and machine. Also report the machine's total CPU usage.|Always eligible|Core |Process|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage and MachineTotalCpuUsage.| - PROCESS_CPU_INTERVAL: the execution interval of the Plugin. Default: 180s
|| - ProcessCPUCoreUsage
- ProcessCPUMachineUsage
- MachineTotalCpuUsage
||
| process_monitor | Verify if the selected process is running and collect its running time in seconds.|Always eligible|Optional|Process|Not executed.| - PROCESS_MONITOR_INTERVAL: the execution interval of the Collector. Default: 180s
- PROCESS_MONITOR_PROCESS_NAMES: the Regular Expression of process names to be monitored separated by `,`. No default value
||||
| system_error | Collect the error at system level event log (Windows only).|Eligible in Windows machine|Core|OS| The plugin is executed every 3 mins. In each execution, it subscribes to the "System" Channel of Windows EventLog and queries Events with Level defined in SystemData <=2 (including LOG_ALWAYS, Critital, Error). The measurementTarget is defined as Source_EventId_ShortHash of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection.| - SYSTEM_ERROR_MEASUREMENT_TARGET_CAP: the cap of total different measurementTargets in each collection. Default: 10
||||
| az_storage_blob | Verify if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token.|Eligible if EnrironmentAttribute "OutboundConnectivityDisbled" is not set or set to "false"|Optional| *AzBlob|Not executed.| - AZ_STORAGE_BLOB_INTERVAL: the execution interval of the Collector. Default: 180s
- AZ_STORAGE_ACCOUNT_NAME: the Azure Storage account name. No default value
- AZ_STORAGE_CONTAINER_NAME: the Azure Storage Container name. No default value
- AZ_STORAGE_BLOB_NAME: the Azure Storage Blob name. No default value
- AZ_STORAGE_BLOB_DOMAIN_NAME: the Azure Storage domain name. No default value
- AZ_STORAGE_SAS_TOKEN_BASE64: the Base64 encoded Azure Storage SAS token. No default value
- AZ_STORAGE_USE_MANAGED_IDENTITY: if the managed identity will be used for authentication. Default: false
- AZ_STORAGE_MANAGED_IDENTITY_CLIENT_ID: the managed identity client ID for authentication. No default value
| |
From f974de7006182d13115f2af504c64850d2d02a85 Mon Sep 17 00:00:00 2001
From: tfishler1 <165731132+tfishler1@users.noreply.github.com>
Date: Wed, 7 Jan 2026 13:59:56 -0800
Subject: [PATCH 5/6] Update
articles/virtual-machines/vm-watch-collector-suite.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
articles/virtual-machines/vm-watch-collector-suite.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/articles/virtual-machines/vm-watch-collector-suite.md b/articles/virtual-machines/vm-watch-collector-suite.md
index 9fcba463c18..734932f184c 100644
--- a/articles/virtual-machines/vm-watch-collector-suite.md
+++ b/articles/virtual-machines/vm-watch-collector-suite.md
@@ -12,7 +12,7 @@ ms.subservice: monitoring
# VMwatch Plugin Collections
-VMWatch is implemented with an Infra-Plugin model for functional scalability. VMWatch Infra is reponsible for the scheduling of each Plugin's execution, and each Plugin is targeted to measure the VM health of a spefic area and emit the VM health Signals (Check, Metric, Eventlog). Below is a summary of all the available Plugins in VMWatch, the Signals they emit and their parameter configurations.
+VMWatch is implemented with an Infra-Plugin model for functional scalability. VMWatch Infra is reponsible for the scheduling of each Plugin's execution, and each Plugin is targeted to measure the VM health of a specific area and emit the VM health Signals (Check, Metric, Eventlog). Below is a summary of all the available Plugins in VMWatch, the Signals they emit and their parameter configurations.
This article provides a summary of all available collectors in VM watch, along with the corresponding checks, metrics, logs, and parameter configurations. For detailed descriptions of each check, metric, and log, refer to the [VM watch overview](/azure/virtual-machines/azure-vm-watch) page.
From fe9fa2945d373e76ecf5df441eedc3825ed6f7e5 Mon Sep 17 00:00:00 2001
From: tfishler1 <165731132+tfishler1@users.noreply.github.com>
Date: Wed, 7 Jan 2026 14:00:05 -0800
Subject: [PATCH 6/6] Update
articles/virtual-machines/vm-watch-collector-suite.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
articles/virtual-machines/vm-watch-collector-suite.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/articles/virtual-machines/vm-watch-collector-suite.md b/articles/virtual-machines/vm-watch-collector-suite.md
index 734932f184c..8164d9cd71b 100644
--- a/articles/virtual-machines/vm-watch-collector-suite.md
+++ b/articles/virtual-machines/vm-watch-collector-suite.md
@@ -25,7 +25,7 @@ This article assumes that you're familiar with:
> | **Name** | **Description** |
> |---|---|
> | **Plugin Name** | Logical grouping of similar tests where you can collect checks, metrics, and logs to determine the health of a particular resource |
-> | **Description** | a Short Description about the Plugin |
+> | **Description** | a short description about the Plugin |
> | **Group** | Indicates whether the collectors are part of the core or optional group. Core group collectors are enabled by default, while optional group collectors can be enabled or disabled based on your requirements |
> | **Tags** | Used to categorize and filter checks, metrics, and logs |
> | **Eligibility** | Determines whether a collector is eligible to be executed based on the environment attributes you specify |