Skip to content

ref(rpc): always apply a limit to GetTrace queries#8108

Draft
phacops wants to merge 2 commits into
masterfrom
claude/gracious-maxwell-bkyzac
Draft

ref(rpc): always apply a limit to GetTrace queries#8108
phacops wants to merge 2 commits into
masterfrom
claude/gracious-maxwell-bkyzac

Conversation

@phacops

@phacops phacops commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

EndpointGetTrace could issue unbounded ClickHouse queries. When a request arrived without a limit and no global cap was configured — which is the default, since ENDPOINT_GET_TRACE_PAGINATION_MAX_ITEMS = 0 ("0 means no limit") — or whenever pagination was disabled, limit was left as None and the query ran with no LIMIT clause at all. These unbounded queries are the worst offenders for both ClickHouse and granian memory.

This came out of investigating the snuba.rpc.eap_trace_request_without_limit metric, which turned out to be doubly misleading:

  • It was emitted unconditionally on every GetTrace request, so it tracked total call volume, not requests without a limit (~3.5M/week, exclusively endpoint_name:endpointgettrace).
  • The sibling endpoints already cap themselves (the trace-item table resolver and GetTraces both fall back to a default row limit), so GetTrace was the one genuinely emitting LIMIT-less SQL.

Changes

  • Always bound the query. _get_pagination_limit now always returns a positive int, falling back to a default row limit (_DEFAULT_ROW_LIMIT = 10_000, matching the sibling table/traces endpoints) when there is neither a user-supplied limit nor a configured cap.
  • Apply the limit even when pagination is disabled. The limit is now computed regardless of the enable_trace_pagination flag; only the cross-item decrement / page-token logic stays gated on pagination being enabled. (Trade-off: with pagination disabled and a trace larger than the cap, results are truncated to the cap with no page token — bounded beats unbounded, and pagination is on by default.)
  • Fix the misleading metric. eap_trace_request_without_limit now increments only when the request arrives without a user-supplied limit (in_msg.limit <= 0), so the name is accurate and it gives real signal on how often the default cap is relied upon, per referrer.
  • Updated the ENDPOINT_GET_TRACE_PAGINATION_MAX_ITEMS doc comment to reflect that <= 0 now means "fall back to the default" rather than "no limit".

Out of scope (flagged for follow-up)

GetTraces (plural) cross-item path technically has a LIMIT, but its default fallback is _TRACE_LIMIT = 50_000_000 — a nominal cap so high it's effectively unbounded for memory. Left untouched here since it does emit a LIMIT clause and lowering it has broader trace-explorer implications worth a separate discussion.

Test plan

  • New parametrized unit test test_get_pagination_limit_is_always_bounded covering the full truth table (no/negative/positive configured cap × with/without user limit) — asserts the result is always a positive int.
  • New integration test test_query_is_bounded_when_pagination_disabled spies on the built Snuba request and asserts the query carries _DEFAULT_ROW_LIMIT even with pagination disabled and no user limit.
  • Existing GetTrace / pagination tests still describe the same behavior (data sets are far below the 10k cap).
  • CI (full EAP suite requires ClickHouse; could not run locally in this environment).

🤖 Generated with Claude Code

https://claude.ai/code/session_01TheCLK7ZEBCnw5bc3Qboth


Generated by Claude Code

EndpointGetTrace could issue unbounded ClickHouse queries: when a request
arrived without a limit and no global cap was configured (the default,
ENDPOINT_GET_TRACE_PAGINATION_MAX_ITEMS = 0), or whenever pagination was
disabled, `limit` was left as None and the query ran with no LIMIT clause.
These unbounded queries are the worst offenders for ClickHouse and granian
memory.

Ensure every GetTrace query is bounded:
- `_get_pagination_limit` now always returns a positive int, falling back to
  a default row limit (10k, matching the sibling table/traces endpoints) when
  no user limit and no configured cap are present.
- The limit is computed and applied even when pagination is disabled; the
  cross-item decrement/page-token logic stays gated on pagination being on.

Also fix the misleading `eap_trace_request_without_limit` metric. It was
emitted unconditionally on every request, so it tracked total call volume
rather than requests without a limit. It now increments only when the request
arrives without a user-supplied limit, making the name accurate and giving
real signal on how often the default cap is relied upon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01TheCLK7ZEBCnw5bc3Qboth
@phacops phacops requested review from a team as code owners June 25, 2026 20:45
@phacops phacops marked this pull request as draft June 25, 2026 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants