Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -67,5 +67,5 @@ A typical production Ozone cluster includes the following services:
### Ozone Configuration

- **Monitoring**: Install [Prometheus](../../operations/observability/prometheus) and [Grafana](../../operations/observability/grafana) for monitoring the Ozone cluster. For audit logs, consider using a log ingestion framework such as the ELK Stack (Elasticsearch, Logstash, and Kibana) with FileBeat, or other similar frameworks. Alternatively, you can use Apache Ranger to manage audit logs.
- **Pipeline Limits**: Increase the number of allowed write pipelines to better suit your workload by adjusting `ozone.scm.datanode.pipeline.limit`. See the [Multi-Raft](./multi-raft) documentation for the formula to calculate appropriate pipeline limits based on your metadata disk configuration.
- **Pipeline Limits**: Increase the number of allowed write pipelines to better suit your workload by adjusting `ozone.scm.datanode.pipeline.limit`. See [Calculating Ratis Pipeline Limits](./calculating-ratis-pipeline-limits) for the sizing formula, trade-offs, and production guidance. See [Multi-Raft](./multi-raft) for Datanode metadata directory configuration.
- **Heap Sizes**: Configure sufficient heap sizes for Ozone Manager (OM), Storage Container Manager (SCM), Recon, Datanode, S3 Gateway (S3G), and HttpFS services to ensure stability.
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
sidebar_label: Calculating Ratis Pipeline Limits
---

import RatisPipelineCalculator from '@site/src/components/RatisPipelineCalculator';

# Calculating Ratis Pipeline Limits

ReplicationFactor.THREE is controlled by three configuration properties that limit the
Expand Down Expand Up @@ -51,10 +53,59 @@ And the configuration is:

SCM will attempt to create and maintain approximately **24** open, FACTOR_THREE Ratis pipelines.

**Production Recommendation:**
<RatisPipelineCalculator />

## Sizing Trade-offs

Pipeline limits balance write throughput against Datanode resource usage. Each open Ratis pipeline is one Ratis
replication group, so the number of pipelines directly affects how much concurrent write capacity the cluster
exposes.

**Benefits of more pipelines**

- **Higher write parallelism**: Each pipeline is an independent Ratis group, so more pipelines allow more
concurrent write streams across the cluster.
- **Lower write latency under contention**: With more pipelines, different clients are less likely to share the
same Ratis group, reducing queueing on a single Raft leader.

**Costs of too many pipelines**

- **Datanode resource pressure**: Each pipeline consumes CPU, memory, and metadata-disk I/O on the Datanodes
that participate in it. Very high pipeline counts can overload individual nodes.
- **Follower lag and tail latency**: When a Datanode hosts many Raft groups, followers can fall behind on log
replication (index lag). This delays WATCH responses and increases write tail latency for pipelines on that
node.

### Ozone vs HDFS write pipelines

Ozone and HDFS take different approaches to grouping Datanodes for replicated writes:

| | Ozone (Ratis pipelines) | HDFS (stateless pipelines) |
| --- | --- | --- |
| **Pipeline model** | SCM creates and tracks a bounded set of explicit Ratis pipelines | Any three Datanodes can form an ad hoc "pipeline" for three block replicas |
| **Control & visibility** | Centralized limits and monitoring via SCM (`ozone admin pipeline list`, SCM UI) | No equivalent cluster-wide pipeline registry |
| **Load spreading** | Bounded by configured pipeline limits; may cap peak write throughput | Blocks can spread across any DNs, spreading load more evenly |
| **Operational trade-off** | More predictable capacity planning; may sacrifice some peak throughput | Higher potential throughput; harder to observe and cap concurrent write groups |

![Ozone Ratis pipelines vs HDFS stateless block placement](ratis-vs-hdfs-pipelines.png)

## Production Recommendation

For most production deployments, start with the dynamic per-disk limit (`ozone.scm.datanode.pipeline.limit=0`).
This lets pipeline capacity scale naturally with metadata disk resources. A good starting value for
`ozone.scm.pipeline.per.metadata.disk` is **2**.

Increase pipeline limits only when you observe write contention—for example, many clients sharing the same
pipelines or sustained write queueing—and Datanode resources (CPU, memory, metadata-disk I/O) have headroom.
Use `ozone.scm.ratis.pipeline.limit` as a safety cap rather than the primary tuning knob.

Monitor the **Pipeline Statistics** section in the SCM web UI, or run `ozone admin pipeline list`, to confirm
the actual number of open pipelines aligns with your configured targets. Watch for follower index lag or
elevated resource usage on Datanodes as signals that pipeline counts may be too high.

## Related Topics

For most production deployments, using the dynamic per-disk limit (`ozone.scm.datanode.pipeline.limit=0`) is
recommended, as it allows the cluster to scale pipeline capacity naturally with its resources. You can use the
global limit (`ozone.scm.ratis.pipeline.limit`) as a safety cap if needed. A good starting value for
`ozone.scm.pipeline.per.metadata.disk` is **2**. Monitor the section **Pipeline Statistics** in SCM web UI, or run
the command `ozone admin pipeline list` to see if the actual number of pipelines aligns with your configured targets.
Limits on this page apply to **open Ratis pipelines** (ReplicationFactor.THREE). The number of OPEN containers
and containers per pipeline are separate concerns and are not covered here. For background on Multi-Raft and
Datanode metadata directories, see [Multi-Raft](./multi-raft). For EC pipeline sizing, see
[Calculating EC Pipeline Limits](./calculating-ec-pipeline-limits).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
112 changes: 112 additions & 0 deletions src/components/RatisPipelineCalculator/calculateRatisPipelineLimit.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

const REPLICATION_FACTOR = 3;

function toPositiveInt(value, fallback = 0) {
const parsed = Number.parseInt(String(value), 10);
if (!Number.isFinite(parsed) || parsed < 0) {
return fallback;
}
return parsed;
}

/**
* Calculate SCM's target number of open FACTOR_THREE Ratis pipelines.
*
* Mirrors the formulas documented on the Calculating Ratis Pipeline Limits page.
*/
export function calculateRatisPipelineLimit({
useDynamicLimit,
datanodePipelineLimit,
pipelinesPerMetadataDisk,
globalLimit,
diskGroups,
healthyDatanodes,
}) {
const steps = [];
let rawTarget = 0;

if (useDynamicLimit) {
const perDisk = Math.max(toPositiveInt(pipelinesPerMetadataDisk, 2), 1);
const groups = Array.isArray(diskGroups) ? diskGroups : [];
let totalSlots = 0;

groups.forEach((group, index) => {
const datanodeCount = toPositiveInt(group.datanodeCount, 0);
const metadataDisks = Math.max(toPositiveInt(group.metadataDisks, 0), 1);
const groupSlots = datanodeCount * perDisk * metadataDisks;
totalSlots += groupSlots;

if (datanodeCount > 0) {
steps.push({
label: `Group ${index + 1} pipeline slots`,
formula: `${datanodeCount} datanodes * (${perDisk} pipelines/disk * ${metadataDisks} disks/datanode)`,
value: groupSlots,
});
}
});

rawTarget = Math.floor(totalSlots / REPLICATION_FACTOR);
steps.push({
label: 'Raw target from dynamic limit',
formula: `(${totalSlots}) / ${REPLICATION_FACTOR}`,
value: rawTarget,
});
} else {
const perDatanodeLimit = Math.max(toPositiveInt(datanodePipelineLimit, 2), 1);
const datanodes = toPositiveInt(healthyDatanodes, 0);
const totalSlots = perDatanodeLimit * datanodes;
rawTarget = Math.floor(totalSlots / REPLICATION_FACTOR);

steps.push({
label: 'Raw target from fixed datanode limit',
formula: `(${perDatanodeLimit} * ${datanodes} healthy datanodes) / ${REPLICATION_FACTOR}`,
value: rawTarget,
});
}

const cap = toPositiveInt(globalLimit, 0);
let finalTarget = rawTarget;
let globalLimitApplied = false;

if (cap > 0) {
finalTarget = Math.min(rawTarget, cap);
globalLimitApplied = finalTarget < rawTarget;
steps.push({
label: 'Compare with global limit',
formula: `min(${rawTarget}, ${cap})`,
value: finalTarget,
});
}

return {
steps,
rawTarget,
finalTarget,
globalLimitApplied,
};
}

export function createDefaultDiskGroups() {
return [
{ id: 'group-1', datanodeCount: 8, metadataDisks: 4 },
{ id: 'group-2', datanodeCount: 2, metadataDisks: 2 },
];
}
Loading
Loading