From 603b21dcb8d7a40bc3b9fc9c13e29b92c0d1a36f Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Wed, 13 May 2026 14:32:03 -0700 Subject: [PATCH] HDDS-15270. Document ReplicationSupervisorMetrics in datanode decommission guide. Add Datanode-side ReplicationSupervisor metrics alongside MeasuredReplicator under a combined Datanode metrics section. Co-authored-by: Cursor --- .../03-datanodes/01-datanode-decommission.md | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/01-datanode-decommission.md b/docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/01-datanode-decommission.md index f5537557a2..633deeaec9 100644 --- a/docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/01-datanode-decommission.md +++ b/docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/01-datanode-decommission.md @@ -90,9 +90,11 @@ These metrics are available on the SCM and provide a cluster-wide view of the re - `replicasCreatedTotal` (`replication_manager_metrics_replicas_created_total`): The total number of container replicas successfully created. - `replicateContainerCmdsDeferredTotal` (`replication_manager_metrics_replicate_container_cmds_deferred_total`): The number of replication commands deferred because source Datanodes were overloaded. If this value is high, it might indicate that the source Datanodes (including the decommissioning one) are too busy. -#### Datanode-side Metrics (`MeasuredReplicator` metrics) +#### Datanode-side Metrics -These metrics are available on each Datanode. For a decommissioning node, they show its activity as a source of replicas. For other nodes, they show their activity as targets. The name in parentheses is the corresponding Prometheus metric name. +These metrics are available on each DataNode. Together, `MeasuredReplicator` metrics describe transfer-level outcomes for replication work (for a decommissioning node this is mainly as a replica source; for other nodes they reflect activity as replication targets). `ReplicationSupervisorMetrics` track replication and reconstruction tasks managed by the Replication Supervisor (queues, lifecycle counts, and concurrency limits). The name in parentheses is the corresponding Prometheus metric name. + +##### `MeasuredReplicator` metrics - `success` (`measured_replicator_success`): The number of successful replication tasks. - `successTime` (`measured_replicator_success_time`): The total time spent on successful replication tasks. @@ -102,6 +104,17 @@ These metrics are available on each Datanode. For a decommissioning node, they s - `failureBytes` (`measured_replicator_failure_bytes`): The total bytes that failed to be transferred. - `queueTime` (`measured_replicator_queue_time`): The total time tasks spend in the replication queue. A high value might indicate the Datanode is overloaded. +##### `ReplicationSupervisorMetrics` + +- `numInFlightReplications` (`replication_supervisor_metrics_num_in_flight_replications`): Total number of pending replications and reconstructions (both low and normal priority). +- `numQueuedReplications` (`replication_supervisor_metrics_num_queued_replications`): Number of replication tasks currently in the queue. +- `numRequestedReplications` (`replication_supervisor_metrics_num_requested_replications`): Total number of replication tasks requested. +- `numSuccessReplications` (`replication_supervisor_metrics_num_success_replications`): Total number of successful replication tasks. +- `numFailureReplications` (`replication_supervisor_metrics_num_failure_replications`): Total number of failed replication tasks. +- `numTimeoutReplications` (`replication_supervisor_metrics_num_timeout_replications`): Number of replication requests that timed out before being processed. +- `numSkippedReplications` (`replication_supervisor_metrics_num_skipped_replications`): Number of replication requests skipped (for example, if the container is already present). +- `maxReplicationStreams` (`replication_supervisor_metrics_max_replication_streams`): Maximum number of concurrent replication tasks allowed to run simultaneously. + By monitoring these metrics, administrators can get a clear picture of the decommissioning progress and identify potential bottlenecks. ## Removing Decommissioned DataNodes from the List