Skip to content

Feat/analytics api hardening#2

Merged
Abdssamie merged 9 commits into
mainfrom
feat/analytics-api-hardening
Apr 22, 2026
Merged

Feat/analytics api hardening#2
Abdssamie merged 9 commits into
mainfrom
feat/analytics-api-hardening

Conversation

@Abdssamie

Copy link
Copy Markdown
Owner

No description provided.

@Abdssamie Abdssamie merged commit d6bb551 into main Apr 22, 2026
2 checks passed
@greptile-apps

greptile-apps Bot commented Apr 22, 2026

Copy link
Copy Markdown

Greptile Summary

This PR significantly hardens the analytics API by removing the ingestDedupes table, event-ID-based deduplication, status-based aggregation fields (aggregationStatus, aggregationAttempts, aggregationError), and the server-side site-config pathway from registerRoutes. Sites must now be created explicitly via createSite before any events can be ingested, making the security boundary explicit. The PR also introduces a new edge-hour query strategy that uses full-hour rollups minus a small raw-event correction for partial-hour boundaries, fixing a double-range counting bug in buildExactRangePlan, and adds comprehensive high-load and integrity tests.

Key changes:

  • Ingest simplification: Removes deduplication cache, event ID field, aggregation retry/failure state, and ensureSite auto-upsert path; ingestBatch now returns {accepted, rejected} (no duplicate field).
  • Aggregation refactor: aggregateEventBatchreducePendingSiteEvents using an aggregatedAt: null index instead of status-based index; removes retryFailedEvents and aggregatePending admin mutations.
  • Edge-hour analytics: buildEdgeHourPlan + choosePartialHourStrategy add a rollup-subtract-raw correction for partial-hour query boundaries, avoiding full raw scans for mostly-covered hours.
  • Admin API trimming: exposeAdminApi removes ensureSite, aggregatePending, retryFailedEvents, pruneExpired; createSite now takes writeKeyHash (pre-hashed) instead of writeKey.
  • Two concerns to address: (1) failed aggregation events are silently left with aggregatedAt: null — there is no error sentinel and no recovery path, so permanently failing events are retried on every future ingest batch for that site; (2) the subtract helpers (subtractTopRows, subtractOverviewTotals) can produce negative field values when unaggregated events exist in the excluded sub-range, yielding incorrect dashboard numbers.

Confidence Score: 3/5

Two P1 correctness issues need resolution before merge: failed-aggregation events have no sentinel state (causing unbounded retry churn), and the subtract helpers can produce negative counters in partial-hour edge queries.

The PR's architectural direction is sound — removing the deduplication table, status fields, and server-side site config all simplify the system meaningfully, and the new edge-hour query strategy is a genuine improvement. The high-load and integrity test coverage is strong. However, two concrete correctness gaps remain: (1) a failed event in aggregateEventsByIds is left with aggregatedAt:null indefinitely — every subsequent ingest batch for the site will re-attempt it with no backoff and no way to distinguish it from legitimately pending events; (2) subtractTopRows/subtractOverviewTotals can emit negative field values when unaggregated events exist in the excluded time slice, producing wrong dashboard numbers.

src/component/ingest.ts (failed-event sentinel), src/component/analytics.ts (subtractTopRows and subtractOverviewTotals negative-value guards)

Important Files Changed

Filename Overview
src/component/ingest.ts Removes deduplication table, event-ID field, failed-state tracking, and admin repair mutations; replaces status-based aggregation with a null/timestamp scheme — but failed events are silently left as permanently pending with no sentinel and no recovery path.
src/component/analytics.ts Adds edge-hour rollup strategy (partial-hour boundaries use rollup minus excluded raw slice) and fixes a double-range bug in buildExactRangePlan; subtract helpers can produce negative field values when pending events exist in the excluded sub-range.
src/component/schema.ts Drops ingestDedupes table and aggregation-status/attempt fields from events; aggregatedAt changes from optional number to optional union(number, null); existing rows with no aggregatedAt field are effectively invisible to the new pending-events index.
src/client/http.ts Removes server-side site config (sites/site options) from registerRoutes; ingest now relies purely on DB-resident sites, eliminating the ensureSite auto-create path.
src/client/api.ts Trims exposeAdminApi surface: removes ensureSite, aggregatePending, retryFailedEvents, pruneExpired, and dedupeRetentionMs; auth wrappers for retained mutations look correct.
example/convex/example.ts Adds setupDefaultSite mutation (idempotent create-or-rotate-key); removes eventId from ingest args; admin wrapper trimmed to match removed API surface.
example/src/loadHarness.ts New dev-only Playwright load harness that monkey-patches window.fetch to collect timing samples; patch is never torn down and stale SDK references may accumulate across HMR cycles.
src/component/high-load.test.ts New high-load correctness test with 120 visitors, shard fanout verification, and compaction; correctly exercises the full ingest → aggregate → compact → query pipeline.
src/component/analytics.integrity.test.ts Adds two new integrity tests for partial-hour top-sources and overview edge correction; tests are well-structured and exercise the new edge-hour plan logic.

Sequence Diagram

sequenceDiagram
    participant Browser
    participant HTTP as HTTP Ingest Route
    participant IB as ingestBatch (mutation)
    participant RPSE as reducePendingSiteEvents (internalMutation)
    participant AEB as aggregateEventsByIds
    participant DB as Convex DB

    Browser->>HTTP: POST /analytics/ingest (writeKey + events)
    HTTP->>HTTP: hashWriteKey(writeKey)
    HTTP->>IB: ctx.runMutation(ingestBatch, {writeKeyHash, events})
    IB->>DB: query sites by writeKeyHash
    DB-->>IB: site (active)
    IB->>DB: insert events (aggregatedAt: null)
    IB->>RPSE: scheduler.runAfter(0, reducePendingSiteEvents)
    IB-->>HTTP: {accepted, rejected}
    HTTP-->>Browser: 200 OK

    RPSE->>DB: query events where aggregatedAt=null (take 101)
    DB-->>RPSE: pendingEvents
    RPSE->>AEB: aggregateEventsByIds(eventIds)
    AEB->>DB: upsertVisitor, upsertSession
    AEB->>DB: accumulateRollupShards → flushRollupShards
    AEB->>DB: patch event.aggregatedAt = now
    AEB-->>RPSE: {aggregated, skipped, failed}
    alt hasMore
        RPSE->>RPSE: scheduler.runAfter(0, reducePendingSiteEvents)
    end

    note over DB: getOverview query uses buildEdgeHourPlan for partial-hour boundaries, buildExactRangePlan for full-hour/day rollups
Loading

Comments Outside Diff (4)

  1. src/component/ingest.ts, line 162-233 (link)

    P1 Failed events silently left as permanently pending

    When aggregateEventsByIds catches an exception for an event, the event's aggregatedAt remains null (the catch block only increments failed with no write to the event). This has two consequences:

    1. Infinite retry churn — every future call to reducePendingSiteEvents for this site (triggered by any new ingest batch) will pick up the failing event again, attempt aggregation, fail again, and repeat. For a permanently broken event (e.g. site deleted mid-flight, a schema mismatch, or a bug in the aggregation logic), this repeats until the raw-event retention cutoff removes the record.

    2. No recovery path — the old retryFailedEvents / aggregatePending admin mutations were removed; there is no operator-visible "failed" state and no way to distinguish "not yet aggregated" from "permanently failing".

    Consider stamping a sentinel value (e.g. aggregatedAt: -1) on permanent failures so the index predicate can exclude them:

    // in the catch block of aggregateEventsByIds:
    catch (error) {
      failed += 1;
      // Stamp a sentinel so this event is excluded from future scans
      await ctx.db.patch(event._id, { aggregatedAt: -1 });
    }

    And adjust the index predicate to .eq("aggregatedAt", null) (which would then correctly exclude the sentinel).

  2. src/component/analytics.ts, line 2125-2142 (link)

    P1 subtractTopRows allows a key with mixed-sign fields to persist

    The deletion guard if (current.count <= 0 && current.pageviewCount <= 0) only removes a key when both counters reach zero or below. If one goes negative while the other is still positive (for example a key with count: 5, pageviewCount: -1 after an over-subtraction), the entry survives with a negative field and is emitted to the caller.

    This can materialise when there are unaggregated ("pending") events in the excluded range: the rollup only counts aggregated events, but summarizeRawTopDimension counts all events. If new events in the excluded slice haven't been aggregated yet, the subtraction over-subtracts and pageviewCount (or count) can go below zero.

    A minimal guard would be to clamp both values:

    current.count = Math.max(0, current.count - row.count);
    current.pageviewCount = Math.max(0, current.pageviewCount - row.pageviewCount);
    if (current.count === 0 && current.pageviewCount === 0) {
      byKey.delete(row.key);
      continue;
    }

    The same issue exists in subtractOverviewTotals — the counters are subtracted without a floor.

  3. src/component/analytics.ts, line 2170-2194 (link)

    P1 subtractOverviewTotals can produce negative counters

    Like subtractTopRows, this function unconditionally subtracts all six fields with no floor. If unaggregated events exist in the excluded range but are not yet reflected in the hourly rollup, fields like bounceCount, sessionCount, or uniqueVisitorCount can go negative, producing nonsensical overview numbers (e.g. a negative bounce count fed into the bounce-rate calculation).

  4. example/src/loadHarness.ts, line 841-858 (link)

    P2 window.fetch patch is never torn down

    Once installAnalyticsLoadHarness is called with enabled: true, the global window.fetch is replaced permanently for the lifetime of the page. A subsequent call with enabled: false (or enabled: true when window.__analyticsLoadHarness already exists) just returns early, leaving the patched fetch in place. Because analytics is passed as an argument and captured via closure, if the SDK is re-created (e.g. during HMR), the harness may hold a stale reference.

    Consider exposing an uninstall() method that restores originalFetch, or documenting that the harness is intentionally permanent for the page session.

Reviews (1): Last reviewed commit: "refactor: simplify session state" | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant