diff --git a/articles/virtual-machines/vm-watch-collector-suite.md b/articles/virtual-machines/vm-watch-collector-suite.md index a48dce81f08..8164d9cd71b 100644 --- a/articles/virtual-machines/vm-watch-collector-suite.md +++ b/articles/virtual-machines/vm-watch-collector-suite.md @@ -5,14 +5,14 @@ author: ofemifowode ms.author: ofemifowode ms.service: azure-virtual-machines ms.topic: concept-article -ms.date: 02/05/2025 +ms.date: 01/07/2026 ms.subservice: monitoring # Customer intent: As a systems administrator, I want to implement VM watch collectors to monitor VM health metrics and logs, so that I can proactively identify issues, optimize performance, and ensure the reliability of the virtual machine environment. --- -# VM watch Collectors Suite +# VMwatch Plugin Collections -VM watch collectors are designed to gather VM health data on various resources like disk and network, by running health checks within the VM. This suite of collectors aid in identifying issues, monitoring performance trends, and optimizing resources to enhance the overall user experience. +VMWatch is implemented with an Infra-Plugin model for functional scalability. VMWatch Infra is reponsible for the scheduling of each Plugin's execution, and each Plugin is targeted to measure the VM health of a specific area and emit the VM health Signals (Check, Metric, Eventlog). Below is a summary of all the available Plugins in VMWatch, the Signals they emit and their parameter configurations. This article provides a summary of all available collectors in VM watch, along with the corresponding checks, metrics, logs, and parameter configurations. For detailed descriptions of each check, metric, and log, refer to the [VM watch overview](/azure/virtual-machines/azure-vm-watch) page. @@ -24,8 +24,8 @@ This article assumes that you're familiar with: > [!NOTE] > | **Name** | **Description** | > |---|---| -> | **Collector** | Logical grouping of similar tests where you can collect checks, metrics, and logs to determine the health of a particular resource | -> | **Signals** | What is emitted to reflect the health status of VMs. The three types of signals emitted are checks, metrics, and logs | +> | **Plugin Name** | Logical grouping of similar tests where you can collect checks, metrics, and logs to determine the health of a particular resource | +> | **Description** | a short description about the Plugin | > | **Group** | Indicates whether the collectors are part of the core or optional group. Core group collectors are enabled by default, while optional group collectors can be enabled or disabled based on your requirements | > | **Tags** | Used to categorize and filter checks, metrics, and logs | > | **Eligibility** | Determines whether a collector is eligible to be executed based on the environment attributes you specify | @@ -35,45 +35,28 @@ This article assumes that you're familiar with: ### Groups, tags and corresponding checks, metrics, and event logs -| Collector Name | Group | Tags | Checks | Metrics | Event Logs | -|---|---|---|---|---|---| -| outbound_connectivity| Core|Network| ||| -| dns| Core|Network| ||| -| tcp_stats | Core|Network| | || -| clock_skew | Core|Clock| ||| -| disk_io | Core|Disk|| || -| disk_iops | Core|Disk|||| -| imds | Core|IMDS| ||| -| process | Core|Process| ||| -| process_memory | Core|Process| ||| -| process_cpu | Core|Process| ||| -| process_monitor | Optional|Process|||| -| system_error | Core|OS| ||| -| az_storage_blob | Optional|AzBlob| ||| -| hardware_health_monitor | Optional|Hardware| | | | -| hardware_health_nvidia_smi | Optional|Hardware| | | | +| Plugin Name | Description | Eligibility | Group | Tags | Default Behavior | Overwritable Parameters | Checks | Metrics | +|---|---|---|---|---|---|---|---|---| +| outbound_connectivity| Verify the outbound connectivity to a remote URL from the VM. |Eligible if EnrironmentAttribute "OutboundConnectivityDisabled" is not set or set to "false"| Core|*Network|This Plugin is executed every 60s. In each execution, it sends an http GET request to http://www.msftconnecttest.com/connecttest.txt with a timeout of 30s.|| *outbound_connectivity |LatencyInNanoSeconds| +| dns| Verify if the target DNS name can be resolved. |Eligible if EnrironmentAttribute "OutboundConnectivityDisbled" is not set or set to "false"| Core| * Network| This Plugin is executed every 180s. In each execution, it tries to resolve the DNS name www.msftconnecttest.com . The verification is marked as "Failed" if the DNS name cannot be resolved.|| * dns || +| tcp_stats | Collect the TCP statistics of the VM |Always eligible| Core | * Network|This Plugin is executed every 180s. In each execution, it collects the TCP statistics of the last 180s.| |||| +| clock_skew | Verify the clock skew between the VM and the remote NTP server|Eligible if EnrironmentAttribute "OutboundConnectivityDisbled" is not set or set to "false"| Core| * Clock| This Plugin is executed every 180s. In each execution, it retrieves the clock offset between the remote NTP server time.windows.com and the VM. The verification is marked as "Failed" if the clock skew is larger than 5.0 seconds. In Windows VM, if connecting to remote NTP server fails, it fallbacks to check Windows Time Service with w32tm command. The verification is marked as "Failed" if the w32tm command returns "Leap Indicator: 3(not synchronized)".| |clockskew| +| disk_io | Verify the disk IO avaialbilities (including folder creation/deletion, file creation, write, read, deletion) in each selected disk/partition mount point. Also collect the disk usage metrics of these mount points. If no mount points are specified, apply the these operations on each available mount point.|Always eligible if mount points are not specified. If mount points are explicitly specified, only eligible when data disks are attached to the VM|Core|Disk | This Plugin is executed every 60s. In each execution, it verifies the disk io availability in each available mount point by creating a folder, creating a file, writing bytes to it, deleting it and delete the folder. Then it collects the disk usage info indcluding used space, free space, total capacity and used percentage from each mount point. Get disk LUN and disk serial number and report them as a property.| | * disk_io ||| +| disk_iops | Collect the disk read/write perations per second from all available disk devices or explicitly specified disk devices.|Always eligible|Core|* Disk|This Plugin is executed every 180s. In each execution, it collects the disk read and write operations per second metrics from each available disk device.| |||| +| vm_cpu | Collect the machine total CPU usage, CPU count, and the usage of each CPU core.| Always eligible| Core| *CPU| This Plugin is executed every 180s. In each execution, it collects machine total CPU usage, CPU count, and the usage of each CPU core| |||| +| vm_blip | Measure the elapsed time in milliseconds for the given measurement interval, to detect any VM blip.| Always eligible| Core| *Blip| This Plugin is executed every 11s. In each execution, it measures the elapsed time in milliseconds of the given interval of 10 seconds| |||| +| imds | Query the IMDS endpoint from within the VM and verify the response of the IMDS query response. |Always eligible| Core| * IMDS| This Plugin is executed every 180s. In each execution, it queries the IMDS endpoint http://169.254.169.254/metadata/instance/compute and verifies the response body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query timeout is 10s. If the query fails, it will retry at most another 3 more times with an interval of 15s, 30s and 45s.|| * imds| | +| process | Verify if a process can be created and executed. |Always eligible| Core| * Process|This Plugin is executed every 180s. In each execution, it creates and executes command ${SYTEM_DIR}\system32\cmd.exe /c echo hello in Windows machine and /bin/sh -c echo hello in Linux machine. The timeout of process execution is 10s.| | * process| +| process_memory | Collect the top 3 processes using the most memory resource. Report each process's memory usage over the machine total memory and its Page Fault Counter. Also report the machine's total memory, machine's used memory pecentage and total Page Faults Counter.|Always eligible|Core |Process|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent and TotalPageFaults.||||| +| process_cpu | Collect the top 3 processes using the most CPU resource. Report each process's CPU usage over the CPU core and machine. Also report the machine's total CPU usage.|Always eligible|Core |Process|This Plugin is executed every 180s. In each execution, it selects the top 3 processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage and MachineTotalCpuUsage.||||| +| process_monitor | Verify if the selected process is running and collect its running time in seconds.|Always eligible|Optional|Process|Not executed.| |||| +| system_error | Collect the error at system level event log (Windows only).|Eligible in Windows machine|Core|OS| The plugin is executed every 3 mins. In each execution, it subscribes to the "System" Channel of Windows EventLog and queries Events with Level defined in SystemData <=2 (including LOG_ALWAYS, Critital, Error). The measurementTarget is defined as Source_EventId_ShortHash of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection.| |||| +| az_storage_blob | Verify if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token.|Eligible if EnrironmentAttribute "OutboundConnectivityDisbled" is not set or set to "false"|Optional| *AzBlob|Not executed.|| | +| hardware_health_monitor | Collect hardware health info from Windows event log, currently only disk related critical events are collected, including events with id 7, 500, 504, 505, 512 and 549. Those events include critical metrics about disk health status, for NVMe directly attached devices, it is exclusively available on the VM side only. With those metric it is possible to monitor and alert the disk status, thus to improve the VM service availabilities|Eligible in Windows machine|Optional|*Hardware |Not executed.||| | +| hardware_health_nvidia_smi | Collect GPU stats including memory and GPU usage, temp and other by running nvidia-smi command (Linux Ubuntu only)|Eligible in Linux Ubuntu machine|Optional |*Hardware |Not executed.| ||| -### Eligibility, default behavior, and overwritable parameters -| Collector Name | Eligibility | Default Behavior | Overwritable Parameters | -|---|---|---|---| -| outbound_connectivity| Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" |This collector is executed every 60s. In each execution, it sends an http GET request to `http://www.msftconnecttest.com/connecttest.txt` with a time-out of 5s. If the request fails, it retries at most two more times with and interval of 10s. The verification is marked as "Failed" if all the retries fail. | | -| dns| Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" |This Collector is executed every 180s. In each execution, it tries to resolve the DNS name `www.msftconnecttest.com` . The verification is marked as "Failed" if the DNS name can't be resolved. | | -| tcp_stats| Always eligible |This collector is executed every 180s. In each execution, it collects the TCP statistics of the last 180s. | | -| clock_skew| Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false"|This collector is executed every 180s. In each execution, it retrieves the clock offset between the remote NTP server `time.windows.com` and the VM. The verification is marked as "Failed" if the clock skew is larger than 5.0 seconds. In Windows VM, if connecting to remote NTP server fails, it fallbacks to check Windows Time Service with w32tm command. The verification is marked as "Failed" if the w32tm command returns "Leap Indicator: 3(not synchronized)". | | -| disk_io| Always eligible if mount points aren't specified. If mount points are explicitly specified, only eligible when data disks are attached to the VM |This collector is executed every 180s. In each execution, it verifies the disk io availability in each available mount point by creating a folder, creating a file, writing bytes to it, deleting it and delete the folder. Then it collects the disk usage info including used space, free space, total capacity and used percentage from each mount point. | | -| disk_iops| Always eligible |This collector is executed every 180s. In each execution, it collects the disk read and write operations per second metrics from each available disk device. | | -| imds| Always eligible|This collector is executed every 180s. In each execution, it queries the IMDS endpoint `http://169.254.169.254/metadata/instance/compute` and verifies the response body contains the information (SubscriptionId, ResourceGroup, VMId, ResourceId) of the VM. The query time-out is 10s. If the query fails, it retries at most another three more times with an interval of 15s, 30s, and 45s. | | -| process| Always eligible|This collector is executed every 180s. In each execution, it creates and executes command `${SYTEM_DIR}\system32\cmd.exe /c echo hello` in Windows machine and `/bin/sh -c echo hello` in Linux machine. The time-out of process execution is 10s. | | -| process_memory| Always eligible|This collector is executed every 180s. In each execution, it selects the top three processes with the most memory usage and reports the ProcessRSSPercent, ProcessPageFaults, MachineMemoryTotalInBytes, MachineMemoryUsedPercent, and TotalPageFaults. | | -| process_cpu| Always eligible|This collector is executed every 180s. In each execution, it selects the top three processes with the most CPU usage and reports the ProcessCoreUsage, ProcessMachineUsage, and MachineTotalCpuUsage. | | -| process_monitor| Always eligible|Not executed. If explicitly enabled by the user, this collector verifies if the selected process is running and collect its running time in seconds. | | -| system_error| Eligible in Windows machine|The Collector is executed every three mins. In each execution, it subscribes to the "System" channel of Windows EventLog and queries events with level defined in SystemData <=2 (including LOG_ALWAYS, Critical, Error). The measurementTarget is defined as Source_EventId of the EventLog using default Windows locale. A cap of no more than 10 different measurementTargets is applied in each collection. | | -| az_storage_blob| Eligible if EnvironmentAttribute "OutboundConnectivityDisabled" isn't set or set to "false" |Not executed. If explicitly enabled by the user, this collector verifies if the VM can have access to the selected Azure Storage Blob by using either Managed Identity or SAS token. | | -| hardware_health_monitor| Eligible in Windows machine|Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549. | | -| hardware_health_nvidia_smi | Eligible in Linux Ubuntu machine|Not executed. If explicitly enabled by the user, this collector collects hardware health info from Windows event log, currently only disk related critical events are collected, including events with ID 7, 500, 504, 505, 512 and 549. | | - ### Next steps