diff --git a/docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/01-datanode-decommission.md b/docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/01-datanode-decommission.md index f5537557a2..633deeaec9 100644 --- a/docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/01-datanode-decommission.md +++ b/docs/05-administrator-guide/03-operations/03-node-decommissioning-and-maintenance/03-datanodes/01-datanode-decommission.md @@ -90,9 +90,11 @@ These metrics are available on the SCM and provide a cluster-wide view of the re - `replicasCreatedTotal` (`replication_manager_metrics_replicas_created_total`): The total number of container replicas successfully created. - `replicateContainerCmdsDeferredTotal` (`replication_manager_metrics_replicate_container_cmds_deferred_total`): The number of replication commands deferred because source Datanodes were overloaded. If this value is high, it might indicate that the source Datanodes (including the decommissioning one) are too busy. -#### Datanode-side Metrics (`MeasuredReplicator` metrics) +#### Datanode-side Metrics -These metrics are available on each Datanode. For a decommissioning node, they show its activity as a source of replicas. For other nodes, they show their activity as targets. The name in parentheses is the corresponding Prometheus metric name. +These metrics are available on each DataNode. Together, `MeasuredReplicator` metrics describe transfer-level outcomes for replication work (for a decommissioning node this is mainly as a replica source; for other nodes they reflect activity as replication targets). `ReplicationSupervisorMetrics` track replication and reconstruction tasks managed by the Replication Supervisor (queues, lifecycle counts, and concurrency limits). The name in parentheses is the corresponding Prometheus metric name. + +##### `MeasuredReplicator` metrics - `success` (`measured_replicator_success`): The number of successful replication tasks. - `successTime` (`measured_replicator_success_time`): The total time spent on successful replication tasks. @@ -102,6 +104,17 @@ These metrics are available on each Datanode. For a decommissioning node, they s - `failureBytes` (`measured_replicator_failure_bytes`): The total bytes that failed to be transferred. - `queueTime` (`measured_replicator_queue_time`): The total time tasks spend in the replication queue. A high value might indicate the Datanode is overloaded. +##### `ReplicationSupervisorMetrics` + +- `numInFlightReplications` (`replication_supervisor_metrics_num_in_flight_replications`): Total number of pending replications and reconstructions (both low and normal priority). +- `numQueuedReplications` (`replication_supervisor_metrics_num_queued_replications`): Number of replication tasks currently in the queue. +- `numRequestedReplications` (`replication_supervisor_metrics_num_requested_replications`): Total number of replication tasks requested. +- `numSuccessReplications` (`replication_supervisor_metrics_num_success_replications`): Total number of successful replication tasks. +- `numFailureReplications` (`replication_supervisor_metrics_num_failure_replications`): Total number of failed replication tasks. +- `numTimeoutReplications` (`replication_supervisor_metrics_num_timeout_replications`): Number of replication requests that timed out before being processed. +- `numSkippedReplications` (`replication_supervisor_metrics_num_skipped_replications`): Number of replication requests skipped (for example, if the container is already present). +- `maxReplicationStreams` (`replication_supervisor_metrics_max_replication_streams`): Maximum number of concurrent replication tasks allowed to run simultaneously. + By monitoring these metrics, administrators can get a clear picture of the decommissioning progress and identify potential bottlenecks. ## Removing Decommissioned DataNodes from the List