MarkdownOffice: Realtime Collaboration at Discord Scale

Deep System Design Analysis — 1000 Concurrent Editors on a Single Document

1. Core Architectural Challenges

1.1 The Fundamental Tension

Building realtime collaboration for a structured document graph (not a plain text editor) at 1000 concurrent users introduces a class of problems that are qualitatively different from both:

Plain text collaboration (Google Docs circa 2010)
Large-scale messaging (Discord, Slack)

MarkdownOffice sits at the intersection of both, with added complexity from:

Structured semantics: Block operations are not character insertions. Moving a block, changing a slide's layout, or updating a spreadsheet formula carries semantic weight.
Cross-document dependency graphs: An edit in Doc A can cascade to Doc B, C, and N via cross-references, formula dependencies, and AI-generated content.
Multi-agent editing: AI agents are first-class citizens in the edit graph, capable of generating bursts of 10–100 operations from a single user action.

1.2 The 1000-Editor Reality Check

At steady-state with 1000 concurrent editors:

Inbound operations:
  - User ops: 1000 users × 2 ops/sec = 2,000 ops/sec
  - Cursor/presence: 1000 users × 5 events/sec = 5,000 events/sec
  - AI agent ops: assume 10% trigger agents, 30 downstream ops = 6,000 ops/sec
  Total inbound: ~13,000 events/sec

Naive fanout (broadcast to all):
  - 13,000 × 999 recipients = ~13,000,000 messages/sec
  - At 200 bytes/message = 2.6 GB/s outbound
  - COMPLETELY IMPOSSIBLE without radical filtering

The problem is not sending operations — it is fanout explosion.

1.3 Structured Document-Specific Challenges

Challenge	Plain Text Editor	MarkdownOffice Block Graph
Conflict unit	Character insertion	Block operation (move, resize, nest, split)
CRDT compatibility	Well-solved (RGA, YATA)	Partially solved — tree CRDTs are young
Operation complexity	Insert/delete	Insert, delete, move, reparent, split, merge, transform
Dependency tracking	None	Cross-block, cross-document
Semantic validation	None required	Formula integrity, slide consistency, citation validity

2. Why Naive Approaches Fail

2.1 Why Single-Process Phoenix Channel Fails

# DO NOT DO THIS
defmodule MarkdownOffice.DocumentChannel do
  use Phoenix.Channel

  def handle_in("operation", op, socket) do
    # This becomes a bottleneck at ~100 users
    # Single process mailbox receives 13,000 msgs/sec
    # BEAM scheduler cannot keep up
    # Mailbox grows unbounded → OOM → crash
    broadcast!(socket, "operation", op)
    {:noreply, socket}
  end
end

Why it fails:

A single GenServer processes messages sequentially — one mailbox
At 13,000 events/sec, a process handling each in ~10µs can only do ~100,000/sec — fine on paper
But broadcast to 1000 sockets from one process creates a synchronous fanout bottleneck
Phoenix's broadcast! goes through PubSub, but the topic subscription list iteration for 1000 subscribers is expensive
Memory explosion: Document state + operation history + 1000 subscription refs in one process

2.2 Why Naive OT Fails at Scale

Operational Transformation requires a central authority to assign operation order:

Client A sends op at t=100ms
Client B sends op at t=101ms
Server must:
  1. Receive both
  2. Pick an order
  3. Transform one against the other
  4. Broadcast transformed versions

At 1000 clients × 2 ops/sec = 2000 transforms/sec at the server
Each transform requires O(history_length) lookups
With long sessions: history grows → transforms become O(n²)

More fundamentally: OT's correctness depends on linearizing all operations through a single coordinator. At 1000 users, that coordinator is both a bottleneck and a single point of failure.

2.3 Why Raw CRDT Sync Fails for Structured Documents

Standard CRDTs (Yjs, Automerge) were designed for sequential text or key-value maps. The MarkdownOffice block graph is a tree CRDT with positional semantics — a fundamentally harder problem:

Problem: Two users simultaneously reparent the same block

User A: Move Block-7 under Block-3
User B: Move Block-7 under Block-5

Naive CRDT merge: Block-7 appears under BOTH Block-3 and Block-5
Result: Duplicate block in the document tree — INVALID STATE

Standard Yjs cannot handle this. Tree CRDTs that prevent cycle creation and duplication (Kleppmann et al., 2022 "A highly-available move operation") exist but are complex to implement with block-level semantics.

2.4 Why Syncing Raw Markdown Fails

Markdown is NOT a stable sync format:

1. Parser instability: Different parsers produce different ASTs from same string
   "# Hello\n\nWorld" vs "# Hello\n\nWorld\n" → different ASTs

2. AST invalidation: Any edit can re-parse the entire document
   Edit one character in a table cell → entire table re-parsed
   
3. Whitespace non-determinism: Indentation, blank lines carry semantic weight
   in some Markdown variants but not others

4. MarkdownSheet formulas: =SUM(A1:A10) has no natural Markdown representation
   Any round-trip through Markdown loses formula semantics

5. Block ID loss: After Markdown round-trip, stable block IDs are gone
   Version history becomes unreplayable

Markdown is an I/O format only. The sync layer must never see it.

3. Recommended Collaboration Model

3.1 Block-Level Delta CRDT with Semantic Operations

The recommended model is a hybrid that does not exist as an off-the-shelf library — you must implement it:

┌─────────────────────────────────────────────────┐
│           Collaboration Layer Stack              │
├─────────────────────────────────────────────────┤
│  L4: Semantic Operation Layer                   │
│      (move_block, split_block, update_formula)  │
├─────────────────────────────────────────────────┤
│  L3: Block-Level CRDT Layer                     │
│      (tree structure, block ordering)           │
├─────────────────────────────────────────────────┤
│  L2: Text CRDT Layer (within blocks)            │
│      (character-level, Yjs YATA compatible)     │
├─────────────────────────────────────────────────┤
│  L1: Metadata LWW Layer                         │
│      (block properties, styles, metadata)       │
├─────────────────────────────────────────────────┤
│  L0: Transport / Ordering Layer                 │
│      (HLC timestamps, causal delivery)          │
└─────────────────────────────────────────────────┘

3.2 Operation Taxonomy

// All operations are sealed variants with full type information
type BlockOperation =
  // Structure operations — must go through CRDT tree
  | { type: "insert_block"; parentId: BlockId; afterId: BlockId | null; block: Block }
  | { type: "delete_block"; id: BlockId }
  | { type: "move_block"; id: BlockId; newParentId: BlockId; afterId: BlockId | null }
  | { type: "split_block"; id: BlockId; at: TextPosition; newBlockId: BlockId }
  | { type: "merge_blocks"; id: BlockId; nextId: BlockId }

  // Content operations — delegate to text CRDT within block
  | { type: "edit_text"; blockId: BlockId; delta: YjsDelta }

  // Property operations — Last Write Wins
  | { type: "set_property"; blockId: BlockId; key: string; value: unknown; hlc: HLC }

  // Presence operations — ephemeral, NOT persisted
  | { type: "cursor_move"; userId: UserId; blockId: BlockId; offset: number }
  | { type: "viewport_change"; userId: UserId; visibleBlocks: BlockId[] }

// Operations carry Hybrid Logical Clocks for causal ordering
type OperationEnvelope = {
  op: BlockOperation;
  hlc: HLC;           // Hybrid Logical Clock timestamp
  userId: UserId;
  sessionId: string;
  causedBy: HLC[];    // Causal dependencies (for non-commutative ops)
}

3.3 The Move Problem — Solved

For the critical "concurrent move" problem in the tree CRDT:

Algorithm (from Kleppmann 2022, adapted for MarkdownOffice):

Operation: Move(nodeId, newParentId, afterId, opId)

Invariants to maintain:
  1. No cycles (nodeId cannot be ancestor of newParentId)
  2. No duplication (node appears exactly once in tree)
  3. Deterministic: same set of operations → same tree, regardless of order

Implementation:
  - Every node tracks: {id, parentId, position_clock, tombstoned}
  - Position is a Fractional Index (LSEQ-style) within sibling list
  - On concurrent moves of same node:
    - Only the move with LOWER HLC timestamp "wins" (is applied)
    - The other move is discarded (logged for audit, not applied)
    - This is a deliberate LWW-over-position policy

This means: last mover loses (the earlier, more "committed" move wins)
The alternative (last-writer-wins on position) leads to phantom copies.

4. Recommended Distributed Architecture

4.1 The Core Insight: Partition by Block Range

A single document with 1000 editors should never be managed by a single process. Instead:

Document = Collection of Block Partitions
         = P₁ ∪ P₂ ∪ ... ∪ Pₙ

Each partition:
  - Contains a contiguous range of blocks (by document order)
  - Has its own GenServer (BlockPartitionServer)
  - Has its own PubSub topic
  - Handles ~50–200 blocks

A user subscribes ONLY to partitions whose blocks are in their viewport

Why this works:

1000 users editing a 2000-block document → each user sees ~50 blocks at a time
Each BlockPartitionServer handles ~30-50 active subscribers (not 1000)
Fanout per operation: ~50 recipients × operation broadcast (not 999)
System scales horizontally by adding more partition processes

4.2 Partition Strategy for Document Types

MarkdownDoc: Partition by heading sections
  Doc → [Section-1: blocks 1-80] [Section-2: blocks 81-160] ...

MarkdownSlide: Partition by slide
  Deck → [Slide-1] [Slide-2] ... [Slide-N]
  (Natural partition boundaries — users view one slide at a time)

MarkdownSheet: Partition by row range + column range (2D sharding)
  Sheet → [A1:Z100] [A101:Z200] ... (row partitions)
         → [A1:M500] [N1:Z500] ... (column partitions for wide sheets)
  Cell dependencies can cross partitions → requires dependency tracking

4.3 Message Flow Architecture

User Operation Flow:

[Client] 
  ↓ WebSocket (Phoenix Channel)
[UserSession Process — 1 per connection]
  ↓ validate, assign HLC, check ACL
[DocumentCoordinator — 1 per document]
  ↓ determine affected partition(s)
[BlockPartitionServer — 1 per partition]
  ↓ apply CRDT operation to partition state
  ↓ broadcast to subscribers of this partition
[UserSession Processes of subscribers]
  ↓ filter by user's viewport
[Client] ← only receives ops for visible blocks

4.4 Viewport Subscription Registry

defmodule MarkdownOffice.ViewportRegistry do
  # Each user registers which blocks they're viewing
  # Operations are filtered BEFORE fanout, not after
  
  @type viewport :: %{
    user_id: String.t(),
    document_id: String.t(),
    visible_blocks: MapSet.t(),      # Block IDs currently on screen
    subscribed_partitions: [atom()]  # PubSub topics
  }
  
  def update_viewport(user_id, visible_blocks) do
    current = get_viewport(user_id)
    new_partitions = blocks_to_partitions(visible_blocks)
    old_partitions = current.subscribed_partitions
    
    # Subscribe to newly visible partitions
    (new_partitions -- old_partitions)
    |> Enum.each(&Phoenix.PubSub.subscribe(MO.PubSub, partition_topic(&1)))
    
    # Unsubscribe from no-longer-visible partitions  
    (old_partitions -- new_partitions)
    |> Enum.each(&Phoenix.PubSub.unsubscribe(MO.PubSub, partition_topic(&1)))
    
    update_registry(user_id, %{visible_blocks: visible_blocks, subscribed_partitions: new_partitions})
  end
end

5. Recommended Elixir/Erlang Topology

5.1 Supervision Tree

MarkdownOffice.Application (root supervisor)
├── MarkdownOffice.PubSub (Phoenix.PubSub :pg adapter, partitioned)
├── MarkdownOffice.ClusterSupervisor
│   ├── MarkdownOffice.NodeRegistry      (libcluster auto-discovery)
│   └── MarkdownOffice.HLCServer         (global HLC synchronization)
├── MarkdownOffice.Endpoint              (Phoenix Endpoint)
│   └── [1000x UserSession processes per hot document]
└── MarkdownOffice.DocumentSupervisor (DynamicSupervisor)
    └── per document:
        ├── DocumentCoordinator          (1 per document, global state)
        ├── DocumentSupervisor           (nested DynamicSupervisor)
        │   ├── BlockPartitionServer[0]  (blocks 0–99)
        │   ├── BlockPartitionServer[1]  (blocks 100–199)
        │   └── ... (N partitions)
        ├── PresenceServer               (cursor/viewport state only)
        ├── OperationLog                 (event source buffer)
        └── SnapshotWorker               (async, periodic snapshotting)

5.2 Process Design Details

defmodule MarkdownOffice.BlockPartitionServer do
  use GenServer
  
  # State is a partition-local view of the document
  defstruct [
    partition_id: nil,
    document_id: nil,
    blocks: %{},              # %{block_id => Block}
    crdt_state: nil,          # RGA/YATA state per block
    sibling_order: nil,       # Fractional index tree for ordering
    hlc: nil,                 # Partition-local HLC
    operation_buffer: [],     # Unacked operations
    subscribers: MapSet.new() # Active viewport subscribers
  ]
  
  # Critical: process mailbox must never overflow
  # Operations arrive at ~200/sec per partition (from all users)
  # Each operation must complete in <1ms to maintain 200hz throughput
  
  def handle_cast({:apply_op, envelope}, state) do
    with :ok <- validate_causality(envelope, state),
         {:ok, new_state} <- apply_crdt_op(envelope.op, state),
         :ok <- broadcast_to_subscribers(envelope, new_state) do
      
      # Async write to operation log — don't block the fast path
      Task.start(fn -> OperationLog.append(envelope) end)
      {:noreply, new_state}
    else
      {:error, :out_of_order} ->
        # Buffer and retry after causal dep is received
        {:noreply, buffer_pending(envelope, state)}
    end
  end
  
  # Back-pressure: if mailbox > threshold, apply flow control
  def handle_call(:get_pressure, _from, state) do
    {:reply, Process.info(self(), :message_queue_len), state}
  end
end

5.3 Why NOT 1 Process per Block

The tempting architecture — one GenServer per block — fails:

1 process per block:
  - 10,000 block document = 10,000 processes
  - Each process: ~3KB baseline memory = 30MB just for processes
  - Cross-block operations (merge_blocks, move_block) require 
    distributed transactions across 2+ processes
  - Two-phase commit introduces latency and complexity
  - Process scheduling overhead at 10k concurrent processes 
    under load is measurable

BEAM is excellent at 100k+ processes, but inter-process 
coordination for transactional ops is still expensive.

Verdict: 1 process per PARTITION (50-200 blocks) is the sweet spot.
  - For a 2000-block doc: ~20–40 partition processes
  - Cross-partition ops (move block across section boundary) are rare
  - Can be handled via partition coordinator with optimistic locking

5.4 Cluster Topology for Hot Documents

For a "hot" document (1000 concurrent editors):

Option A: All partition processes on ONE node
  - Simpler routing (local process calls, no distribution overhead)
  - Risk: single node failure kills the document
  - 1000 WebSocket connections → 1 node saturated

Option B: Partition processes distributed across cluster
  - Document shard across 3-5 nodes (consistent hashing on partition_id)
  - WebSocket connections load-balanced (any node can connect)
  - Cross-node message via Distributed Erlang (~0.1ms latency)
  - RECOMMENDED for hot documents (>500 users)

Node roles:
  ┌──────────────────────────────────────┐
  │  Load Balancer (HAProxy / Nginx)     │
  └─────────┬────────────────────────────┘
            │
  ┌─────────▼────────────────────────────┐
  │  WS Nodes (websocket-optimized)      │
  │  ┌────────┐ ┌────────┐ ┌────────┐   │
  │  │ Node 1 │ │ Node 2 │ │ Node 3 │   │ ← 333 connections each
  └──┴────────┴─┴────────┴─┴────────┴───┘
            │ Distributed Erlang
  ┌─────────▼────────────────────────────┐
  │  Document Nodes (computation-heavy)  │
  │  ┌────────┐ ┌────────┐ ┌────────┐   │
  │  │ Node 4 │ │ Node 5 │ │ Node 6 │   │ ← partition servers
  └──┴────────┴─┴────────┴─┴────────┴───┘
            │
  ┌─────────▼────────────────────────────┐
  │  Storage Nodes                       │
  │  PostgreSQL (primary + 2 replicas)   │
  └──────────────────────────────────────┘

5.5 Phoenix PubSub at Scale

# Configure PubSub with pool for parallel dispatch
config :markdown_office, MarkdownOffice.PubSub,
  adapter: Phoenix.PubSub.PG2,
  pool_size: 10  # 10 parallel dispatch workers

# Use partitioned topics to reduce broadcast contention
# Instead of one topic per document:
#   "doc:abc123"  (1000 subscribers)
# Use one topic per partition:
#   "doc:abc123:partition:0"   (~30 subscribers)
#   "doc:abc123:partition:1"   (~25 subscribers)
#   ...

# Presence is a separate topic (different QoS)
#   "presence:abc123"  (all 1000 users — but throttled to 1hz)

6. Recommended CRDT Strategy

6.1 CRDT Selection Matrix

Layer	Data Type	CRDT Type	Implementation
Block ordering within parent	Sibling list	LSEQ / Fractional Index	Custom Elixir
Block tree structure	Nested tree	Move Tree CRDT (Kleppmann 2022)	Custom
Text within a block	Character sequence	YATA (Yjs-compatible)	Port Yjs to WASM/JS
Block properties (color, style)	Key-value	LWW-Register with HLC	Simple
Block metadata (type, version)	Key-value	LWW-Register with HLC	Simple
Comments / annotations	List	RGA	Custom

6.2 Fractional Index for Block Ordering

Instead of integer positions (which require renumbering on insert):

Block order uses fractional keys between 0 and 1:
  [0.1: Block-A] [0.5: Block-B] [0.8: Block-C]
  
Insert between A and B:
  new_key = (0.1 + 0.5) / 2 = 0.3
  [0.1: A] [0.3: new] [0.5: B] [0.8: C]

Concurrent inserts at same position:
  Two users insert at position 0.3 simultaneously
  Both generate different fractional keys (using user-id as tiebreaker)
  No conflict — both blocks appear, order determined by key comparison

Precision concern: After many inserts, keys approach equal values
Solution: Rebalancing operation (offline, periodic, logged as an operation)

6.3 Yjs Integration Strategy

Architecture decision: Run Yjs in Elixir via NIFs or as JS worker?

Option A: Port Yjs logic to Elixir
  + Native BEAM process, no FFI overhead
  + Can introspect CRDT state
  - Enormous implementation effort
  - Binary format may drift from Yjs spec

Option B: Run Yjs in Node.js sidecar
  + Full Yjs compatibility
  + Can sync with frontend Yjs directly
  - Latency (Elixir → Node.js → Elixir)
  - Operational complexity

Option C: Yjs on Frontend ONLY, server is authority for structure
  + Server handles block-level CRDT (tree)
  + Client handles text-level CRDT (character sequence within each block)
  + Server validates and relays Yjs updates without understanding them
  - Offline text edits require full Yjs merge on reconnect
  RECOMMENDED for v1

Option D: Custom BEAM CRDT for everything
  + Full control, no dependencies
  - 6-12 months of implementation
  - CRDT bugs are catastrophic and hard to detect
  Only after v2+ with dedicated CRDT engineering team

6.4 Recommended: Hybrid Yjs + Custom Tree CRDT

Text CRDT (within a block):
  - Use Yjs on frontend (JS)
  - Server opaquely relays Yjs Y.Doc updates between clients
  - Server stores Yjs binary snapshots per block
  - Server does NOT interpret Yjs state (just relay + store)

Structure CRDT (block tree):
  - Custom implementation in Elixir
  - Based on Kleppmann move operation paper
  - Operations: insert_block, delete_block, move_block
  - Server is authoritative for block tree structure
  - Frontend OPTIMISTICALLY applies structure ops, may rollback

Property CRDT (block metadata):
  - LWW-Register: last write wins using HLC
  - Server resolves concurrent writes deterministically
  - Properties: type, color, collapsed, width, etc.

7. Presence & Cursor Scaling Strategy

7.1 The Cursor Problem at Scale

Naive presence: 1000 users × 5 cursor updates/sec = 5000 events/sec
Each event fanout to 999 peers = 5,000,000 messages/sec
At 100 bytes/message = 500 MB/sec — IMPOSSIBLE

Optimizations applied in order of impact:

7.2 Four-Layer Presence Optimization

Layer 1: Throttling at Source

Client sends cursor position at max 4 Hz (250ms interval)
Viewport changes at max 2 Hz (500ms interval)
Typing indicators: binary true/false, not position

Effect: 5000 → 4000 events/sec (20% reduction — not enough alone)

Layer 2: Server-Side Cursor Aggregation

defmodule MarkdownOffice.PresenceServer do
  # Collect cursor updates in a 100ms window
  # Batch and broadcast once every 100ms
  
  defp flush_cursors(state) do
    # Build a diff: only changed cursors since last broadcast
    cursor_diff = compute_diff(state.last_broadcast_cursors, state.current_cursors)
    
    # Broadcast compressed delta (not all 1000 cursors)
    Phoenix.PubSub.broadcast(MO.PubSub, "presence:#{state.doc_id}", {
      :cursor_batch,
      cursor_diff,
      state.hlc
    })
    
    # 10 broadcasts/sec × (changed cursors only) ≈ 50-100 cursors per batch
    # vs naive: 5000 events/sec
  end
end

Layer 3: Viewport-Filtered Cursor Delivery

User A is viewing blocks 50-100
User A only receives cursor updates for users whose cursors are in blocks 50-100

Implementation:
  - PresenceServer knows each user's viewport (updated by client)
  - On cursor broadcast, filter by viewport intersection
  - Only deliver cursor if it's in recipient's visible area

Effect: ~50 out of 1000 cursors are visible to any given user
  → 1000 cursors × 4hz = 4000 events/sec inbound
  → Per user: ~200 relevant cursor events/sec delivered
  → Total outbound: 4000 × 50 (avg visible) = 200,000 msgs/sec (vs 5M)

Layer 4: Cursor Clustering

For high-density areas (e.g., 50 users on same paragraph):
  - Cluster nearby cursors into a single aggregated cursor
  - Show "42 users editing here" instead of 42 individual cursors
  - Client-side: only detailed cursors for nearest N users (e.g., 10)
  - Others shown as anonymous colored dots

Threshold: if >10 cursors within same block, cluster them

7.3 Presence Data Model

// Ephemeral — NOT event-sourced, NOT persisted to DB
type UserPresence = {
  userId: string;
  sessionId: string;           // For multi-device
  color: string;               // Consistent per user
  
  // Cursor (updated frequently, throttled to 4hz)
  cursor: {
    blockId: string;
    offset: number;            // Character offset within block
    anchorBlockId?: string;    // For multi-block selection
    anchorOffset?: number;
  } | null;
  
  // Viewport (updated on scroll, throttled to 2hz)
  viewport: {
    visibleBlockIds: string[]; // Currently visible blocks
    topBlockId: string;        // Block at top of viewport
  };
  
  // Activity
  isTyping: boolean;           // True if keystroke in last 3s
  lastSeenAt: number;          // Unix ms — for timeout
}

// Server-side: Phoenix.Presence handles distributed state
// Automatically handles node failures, merges presence from partitioned nodes

7.4 Phoenix.Presence Architecture for 1000 Users

Phoenix.Presence uses CRDT internally (PG2) across nodes.
For 1000 users on 1 document:

Problem: Phoenix.Presence broadcasts full presence diff on every change
  → 1000 users × changes → large diff payloads

Solution: Shadow Presence (presence metadata stripped down)
  - Presence only carries: userId, color, isActive, partition
  - Detailed cursor/viewport tracked in PresenceServer (custom)
  - Phoenix.Presence used only for join/leave/heartbeat
  - Cursor sync uses separate high-frequency topic

topic "presence:doc_id" → join/leave events only (low freq, full broadcast OK)
topic "cursors:doc_id:partition:N" → cursor updates (high freq, viewport filtered)

8. Storage & Persistence Strategy

8.1 Event Sourcing Architecture

-- Core event log (append-only)
CREATE TABLE document_operations (
  id            BIGSERIAL PRIMARY KEY,
  document_id   UUID NOT NULL,
  partition_id  INTEGER NOT NULL,
  hlc           BIGINT NOT NULL,          -- Hybrid Logical Clock
  user_id       UUID NOT NULL,
  session_id    VARCHAR(36) NOT NULL,
  op_type       VARCHAR(50) NOT NULL,     -- 'insert_block', 'edit_text', etc.
  op_data       JSONB NOT NULL,           -- Serialized operation
  caused_by     BIGINT[],                 -- Causal deps (HLC values)
  block_ids     UUID[],                   -- Affected block IDs (for filtering)
  created_at    TIMESTAMPTZ DEFAULT NOW()
);

-- Partition indexes for common queries
CREATE INDEX idx_doc_ops_document ON document_operations(document_id, hlc);
CREATE INDEX idx_doc_ops_partition ON document_operations(document_id, partition_id, hlc);
CREATE INDEX idx_doc_ops_blocks ON document_operations USING GIN(block_ids);

-- Snapshots (periodic full state captures)
CREATE TABLE document_snapshots (
  id            BIGSERIAL PRIMARY KEY,
  document_id   UUID NOT NULL,
  partition_id  INTEGER NOT NULL,
  snapshot_hlc  BIGINT NOT NULL,          -- HLC at time of snapshot
  blocks_json   JSONB NOT NULL,           -- Serialized block graph
  yjs_binary    BYTEA,                    -- Yjs Y.Doc binary per block
  op_count      INTEGER NOT NULL,         -- Number of ops since last snapshot
  created_at    TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_snapshots_doc ON document_snapshots(document_id, partition_id, snapshot_hlc DESC);

8.2 Operation Throughput at Scale

At 1000 users × 2 ops/sec = 2000 ops/sec

PostgreSQL write throughput:
  - Single table insert: ~5000-10000 rows/sec on NVMe
  - With JSONB: ~3000-5000 rows/sec
  - Problem: 2000 ops/sec is borderline, and AI agents add 6000 more

Solutions:
  1. Batch inserts: collect 100ms of ops, insert as single multi-row insert
     2000 ops/sec × 100ms = 200 ops per batch
     Batched: 10 inserts/sec instead of 2000 — well within Postgres capacity

  2. Operations table partitioned by document_id + created_at (range partition)
     Old partitions archived to cold storage (S3 Parquet)
     
  3. Write through Redis Stream first (low latency), async drain to Postgres
     Redis XADD: ~100k ops/sec capacity
     Worker drains Redis → Postgres in batches

8.3 Snapshotting Strategy

Snapshot triggers:
  - Every 500 operations per partition
  - Every 10 minutes (time-based)
  - On explicit user save
  - Before AI agent batch operations

Snapshot process (must be non-blocking):
  1. BlockPartitionServer receives :snapshot signal
  2. Spawn async Task with CURRENT state snapshot
  3. Task serializes to JSONB + Yjs binary
  4. Task writes to document_snapshots table
  5. Task logs new snapshot_hlc (for future replay baseline)
  
Replay from snapshot (for reconnect, new participant):
  1. Load latest snapshot for each partition (snapshot_hlc)
  2. Replay operations from snapshot_hlc to current (from operation log)
  3. Apply replayed ops to restore current state
  4. Average replay: 500 ops × 5ms/op = 2.5s max
  5. Fast path: < 50 ops since snapshot = <250ms replay

8.4 Operation Log Compaction

Problems with unbounded operation logs:
  - Storage cost: 2000 ops/sec × 100 bytes = 200KB/sec = 17GB/day
  - Replay time grows linearly with history

Compaction strategy:
  Retain ALL operations for 30 days (full history, undo support)
  After 30 days: compact to semantic changesets (lossy but auditable)
  After 1 year: retain only hourly snapshots
  
Selective operation pruning (safe to prune):
  - Intermediate cursor moves (keep only final positions)
  - Intermediate typing states (keep only committed versions)
  - Superseded property sets (keep only final value)
  NOT prunable: structural operations (inserts, deletes, moves)

9. AI Agent Integration Strategy

9.1 Agents as First-Class Operation Emitters

AI agents are modeled identically to users in the operation pipeline:
  - Each agent has a userId (synthetic: "agent:summarize:doc_id")
  - Agents emit BlockOperations through the same CRDT pipeline
  - Agents have presence (visible in "who's editing" list)
  - Agent operations carry metadata: reason, confidence, provenance

What's different about agents:
  - Operations may be ASYNC (no user waiting for response)
  - Agents can batch 10-100 operations in a single transaction
  - Agents can be PAUSED if they exceed rate limits
  - Agent operations go through a policy validator before application

9.2 Agent Operation Pipeline

[User Action] → [Policy Engine] → [Agent Trigger Evaluation]
                                          ↓ (if triggers)
                              [Agent Job Queue (Oban)]
                                          ↓ (async, after delay)
                              [Agent Worker Process]
                                          ↓ (generates N ops)
                              [Agent Op Batch]
                                          ↓
                              [Policy Validator] — rejects harmful ops
                                          ↓ (validated)
                              [BlockPartitionServer] ← same path as user ops
                                          ↓
                              [Broadcast to subscribers]
                              [Label: "🤖 Agent Name made changes"]

9.3 Preventing Downstream Operation Explosions

Problem: User edit → triggers AI agent → 100 ops → triggers more agents → cascade

Solution: Operation Budget System

type OperationContext = {
  originOpId: string;        // The user op that started this chain
  depth: number;             // How many agent hops deep we are
  opsRemaining: number;      // Budget remaining in this chain
}

Rules:
  - Root user operation: budget = 500 total downstream ops
  - Each agent hop decrements budget by its op count
  - If budget exhausted: agent op queued for next user save
  - Max depth: 3 agent hops (user → agent1 → agent2 → agent3, then stop)

Implementation:
  - Budget tracked in ETS (fast, per-node)
  - Budget keyed by originOpId
  - Agents check budget before executing
  - Budget expires after 30 seconds (TTL)

9.4 Agent Conflict Resolution

Scenario: User is editing Block-7 while agent is also modifying Block-7

Strategy: Optimistic agent, user always wins

1. Agent generates ops with hlc timestamp at time of agent start
2. User edit arrives with later hlc
3. CRDT merge: user edit applies on top of agent state
4. Agent's text edits: Yjs handles character-level merge correctly
5. Agent's structure ops: if agent moved Block-7, but user edited Block-7 content,
   move is preserved, content edit applies to moved block (correct behavior)

Edge case: Agent deleted Block-7, user edited Block-7 simultaneously
  - CRDT rule: Yjs tombstones the text, block delete wins
  - But: user's edits are captured as a pending op in the deleted block's history
  - Policy: if block deleted within 5s of user edit, offer "restore with edits"

10. Frontend Architecture

10.1 Document Rendering Model

The frontend document renderer must handle:
  - 2000+ blocks without DOM explosion
  - Continuous CRDT updates arriving at ~60hz
  - Cursor overlays for up to 1000 users (clustered)
  - Optimistic local edits with rollback capability

Architecture:
  ┌─────────────────────────────────────────┐
  │           Document View (SvelteKit)     │
  │                                         │
  │  VirtualizedBlockList                   │
  │  ├── renders only visible blocks        │
  │  ├── overscan: 10 blocks above/below    │
  │  └── block height estimation cache      │
  │                                         │
  │  CursorOverlay (canvas-based)           │
  │  ├── NOT DOM elements (too slow at 100+)│
  │  ├── Canvas 2D, 60fps animation frame   │
  │  └── clustered rendering for >10 cursors│
  │                                         │
  │  CRDT Worker (Web Worker)               │
  │  ├── runs Yjs off main thread           │
  │  ├── applies remote ops asynchronously  │
  │  └── posts reconciliation patches       │
  └─────────────────────────────────────────┘

10.2 Virtualized Block Renderer

<!-- VirtualizedBlockList.svelte -->
<script>
  import { onMount } from 'svelte';
  import { blockStore } from '$lib/stores/blocks';
  
  let containerHeight = 0;
  let scrollTop = 0;
  let estimatedHeights = new Map(); // blockId → height
  
  // Only render blocks in viewport + overscan
  $: visibleBlocks = computeVisibleBlocks(
    $blockStore.order,
    scrollTop,
    containerHeight,
    estimatedHeights,
    overscan = 15  // render 15 blocks beyond viewport edge
  );
  
  // Notify server of viewport change (throttled)
  $: debouncedViewportUpdate(visibleBlocks.map(b => b.id), 500);
  
  // Optimistic update: apply local edit immediately
  function applyLocalEdit(blockId, delta) {
    blockStore.optimisticUpdate(blockId, delta);
    // Send to server via WebSocket
    channel.push('operation', buildOp(blockId, delta));
  }
  
  // Rollback on server rejection
  channel.on('op_rejected', ({ opId, reason }) => {
    blockStore.rollback(opId);
    showConflictNotification(reason);
  });
</script>

<div bind:this={container} on:scroll={handleScroll}>
  <!-- Spacer for blocks above viewport (maintains scroll position) -->
  <div style="height: {topSpacerHeight}px" />
  
  {#each visibleBlocks as block (block.id)}
    <BlockRenderer {block} />
  {/each}
  
  <!-- Spacer for blocks below viewport -->
  <div style="height: {bottomSpacerHeight}px" />
</div>

10.3 CRDT Web Worker Architecture

// crdt-worker.js — runs off main thread
import * as Y from 'yjs';

const docs = new Map(); // blockId → Y.Doc

self.onmessage = ({ data }) => {
  const { type, payload } = data;
  
  switch (type) {
    case 'INIT_BLOCK': {
      const ydoc = new Y.Doc();
      if (payload.snapshot) {
        Y.applyUpdateV2(ydoc, payload.snapshot);
      }
      docs.set(payload.blockId, ydoc);
      
      ydoc.on('update', (update, origin) => {
        if (origin !== 'remote') {
          // Local edit: send to server
          self.postMessage({ type: 'LOCAL_UPDATE', blockId: payload.blockId, update });
        }
      });
      break;
    }
    
    case 'REMOTE_UPDATE': {
      const ydoc = docs.get(payload.blockId);
      if (ydoc) {
        Y.applyUpdateV2(ydoc, payload.update, 'remote');
        // Post reconciled state back to main thread
        const text = ydoc.getText('content').toString();
        self.postMessage({ type: 'BLOCK_RECONCILED', blockId: payload.blockId, text });
      }
      break;
    }
    
    case 'LOCAL_EDIT': {
      const ydoc = docs.get(payload.blockId);
      if (ydoc) {
        ydoc.transact(() => {
          // Apply local delta to Y.Doc
          applyDelta(ydoc.getText('content'), payload.delta);
        });
      }
      break;
    }
  }
};

10.4 Canvas-Based Cursor Rendering

// CursorCanvas.ts — renders all cursors on HTML Canvas
class CursorCanvas {
  private ctx: CanvasRenderingContext2D;
  private cursors: Map<string, CursorState> = new Map();
  private blockPositions: Map<string, DOMRect> = new Map();
  
  updateCursors(batch: CursorBatch) {
    for (const [userId, cursor] of Object.entries(batch)) {
      this.cursors.set(userId, cursor);
    }
    this.scheduleRender();
  }
  
  private render() {
    const { ctx } = this;
    ctx.clearRect(0, 0, this.canvas.width, this.canvas.height);
    
    // Cluster cursors by block
    const clusters = this.clusterCursors(this.cursors, this.blockPositions);
    
    for (const cluster of clusters) {
      if (cluster.size === 1) {
        // Render individual cursor caret
        const [userId, cursor] = [...cluster.entries()][0];
        this.drawCaret(cursor, getColor(userId));
        this.drawLabel(cursor, getUserName(userId));
      } else {
        // Render cluster badge: "42 users"
        this.drawClusterBadge(cluster.centroid, cluster.size);
      }
    }
  }
  
  // Performance: only re-render when cursors change
  private scheduleRender = throttle(() => requestAnimationFrame(() => this.render()), 16);
}

11. Failure Scenarios

11.1 Scenario: Node Failure During Active Editing

Context: Node 4 hosts partition servers for partitions 0–7
         All 1000 users are subscribed to some of these
         Node 4 crashes suddenly

Timeline:
  T+0ms   : Node 4 crashes, Erlang distribution detects within 500ms
  T+500ms : Surviving nodes receive :nodedown from Distributed Erlang
  T+500ms : OTP supervisors on surviving nodes detect missing processes
  T+600ms : Document Coordinator detects missing partition servers
  T+700ms : DynamicSupervisor spawns replacement processes on Node 5
  T+800ms : New processes load state from database (latest snapshot + op replay)
            Fast path: < 50 ops since snapshot → < 300ms replay
  T+1100ms: Processes restored, begin accepting operations
  T+1100ms: Clients receive reconnect signal, push pending ops
  T+1500ms: Full recovery

Client experience:
  - 1-2 second pause in receiving remote updates
  - Their own edits are buffered client-side (WebSocket reconnect)
  - On reconnect: client sends operations with original HLC timestamps
  - Server replays and merges with what happened during outage
  - No data loss (client buffers unsent ops)

11.2 Scenario: Network Partition (Split Brain)

Context: 3-node cluster, nodes split into [Node1,2] and [Node3]
         Document Coordinator running on Node1

Node3 side:
  - Cannot reach Document Coordinator
  - Falls back to "degraded mode": accept local edits, buffer for sync
  - CRDTs ensure these edits are eventually mergeable
  - Users on Node3 see their own edits but not Node1/2's edits

Node1/2 side:
  - Continue normally
  - Don't know about Node3 edits

Partition heals:
  - Node3 reconnects to cluster
  - Node3's buffered operations sent to Document Coordinator
  - CRDT merge applied: concurrent edits merged deterministically
  - All clients reconcile to same state
  - Structure ops (move_block, etc.) resolved by HLC ordering

Risk: if partition lasts > snapshot interval (10 min), Node3 may diverge significantly
  - Large merge operation required
  - May show "conflict resolution" UI to users in extreme cases

11.3 Scenario: Slow Client (Head-of-Line Blocking)

Context: One user on 2G mobile, receiving messages slowly
         Their WebSocket channel's send buffer fills up
         Phoenix default: process suspend when buffer full

Problem: 
  - UserSession process for slow client accumulates backpressure
  - If synchronous, it blocks the process
  - In Elixir: process mailbox can grow unbounded → OOM

Solution: Separate fast and slow paths
  defmodule UserSession do
    def handle_info({:op_broadcast, op}, state) do
      case :gen_tcp.send(state.socket, encode(op)) of
        :ok -> 
          {:noreply, state}
        {:error, :timeout} ->
          # Client is slow: drop non-critical messages
          if droppable?(op) do
            {:noreply, state}  # Drop cursor updates, presence
          else
            # For structural ops: mark client as lagging
            {:noreply, %{state | lagging: true, pending_ops: [op | state.pending_ops]}}
          end
      end
    end
    
    # Types that can be safely dropped for slow clients
    defp droppable?({:cursor_move, _}), do: true
    defp droppable?({:presence_update, _}), do: true
    defp droppable?({:viewport_sync, _}), do: true
    defp droppable?(_), do: false  # Structure ops: never drop
  end

11.4 Scenario: AI Agent Runaway

Context: Compliance agent triggered by large document import
         Agent generates 10,000 operations instead of expected 100

Detection:
  - OperationBudget system detects budget exceeded
  - Agent's ops rate > 500 ops/sec for > 5 seconds → trigger alert
  - CircuitBreaker trips for this agent type

Response:
  1. Suspend agent's session (kill agent GenServer)
  2. Mark last 1000 ops as "agent:runaway" in operation log
  3. Rollback last N agent ops (replay log without agent ops)
  4. Notify document owner
  5. Circuit breaker: agent type paused for 1 hour

Rollback implementation:
  - Filter operation log: exclude ops with agent_session_id
  - Replay filtered log from last known-good snapshot
  - This is expensive (full partition replay) but correctness is maintained

12. Scalability Estimation

12.1 Baseline: 1000 Users, 1 Large Document

Document: 2000 blocks, 20 partitions (100 blocks each)
Users: 1000 concurrent editors
Session: 6 hours/day sustained

─── INBOUND TRAFFIC ───
User operations:    1000 × 1.5 ops/sec = 1,500 ops/sec
Cursor events:      1000 × 3 updates/sec = 3,000/sec
Viewport changes:   1000 × 1 update/sec = 1,000/sec
AI agent ops:       150 agents × 20 ops/sec = 3,000/sec
Total inbound:      ~8,500 events/sec

─── OUTBOUND (after optimizations) ───
Structure ops:      1,500/sec × 50 avg recipients = 75,000/sec
Cursor events:      3,000/sec × 50 avg visible cursors = 150,000/sec
AI agent ops:       3,000/sec × 50 avg recipients = 150,000/sec
Total outbound:     ~375,000 messages/sec

Bandwidth (200 bytes avg):
  Inbound:  8,500 × 200B = 1.7 MB/sec
  Outbound: 375,000 × 200B = 75 MB/sec → with gzip: ~15 MB/sec
  Per node (6 nodes): 2.5 MB/sec — very manageable

─── PROCESS COUNT ───
WebSocket sessions:     1,000 processes
Partition servers:      20 processes per document
Document coordinator:   1 process per document
Presence servers:       1 process per document
Supporting processes:   ~50
Total per hot document: ~1,072 processes
BEAM handles 1M+ processes — this is trivial

─── MEMORY ───
WebSocket sessions:     1,000 × 50KB = 50MB
Partition servers:      20 × 5MB (CRDT state) = 100MB
Yjs snapshots per block: 2000 blocks × 5KB = 10MB (in ETS)
Operation buffer:       20 partitions × 2MB pending ops = 40MB
Phoenix PubSub subs:    1,000 subs × 1KB = 1MB
Total per document:     ~200MB RAM per hot document

─── DATABASE ───
Operation log writes: 8,500 events/sec batched → 85 batches/sec
Each batch: 100 ops × 200B = 20KB → 1.7MB/sec to Postgres
Postgres throughput: typically 100MB/sec+ → well within limits
Operation log growth: 1.7MB/sec × 86400 = 147GB/day (full 1000-user day)
  → Compress with ZSTD → ~15GB/day at 10:1 ratio

─── CPU ───
CRDT operations: 8,500/sec × 1ms avg = 8.5 CPU-core-seconds/sec
  → 9 cores dedicated to CRDT computation (oversized for safety)
WebSocket I/O: 375,000 msg/sec → ~2 cores (BEAM handles I/O efficiently)
Serialization: 8,500 msg/sec encode × 0.1ms = 0.85 core
Total: ~12 cores for 1 document cluster

─── CLUSTER SIZING ───
For 1 hot document: 6 nodes × 4 cores × 32GB = adequate with headroom
For 100 hot documents at 1000 users each: 50 nodes × 16 cores × 64GB
  → Assuming 50% node sharing across documents

12.2 Comparison with Real Systems

System	Concurrent Users/Doc	Sync Protocol	Notes
Google Docs	~100 practical	OT	Degrades above 50
Figma	~100	Live Graph (custom CRDT)	Whiteboard, not doc
Notion	~50 practical	Custom OT	Blocks, not full CRDT
Discord (channel)	500k viewers	No CRDT (chat)	Read-mostly
MarkdownOffice	1,000	Block CRDT	This design

The 1000-editor steady-state is 10x beyond Google Docs current limits. It is achievable with the architecture above, but requires:

Viewport-based fanout filtering (mandatory, not optional)
Partition-based block sharding (mandatory)
Cursor canvas rendering (mandatory for >50 cursors)
AI agent rate limiting (mandatory)

13. Tradeoffs

13.1 Chosen Tradeoffs

Decision	What You Gain	What You Give Up
Block-level CRDT (not char-level)	Structure correctness, semantic ops	Must implement custom tree CRDT
Yjs on frontend (server as relay)	Fast text editing, proven library	Server cannot understand text ops
Viewport filtering	20x reduction in fanout	Users miss ops for off-screen blocks briefly
LWW for block properties	Simple, fast	Last write wins, not "smartest merge"
Partition-based sharding	Horizontal scale	Cross-partition ops require coordination
Canvas cursors (not DOM)	100+ cursor performance	Accessibility harder to implement
Batch DB writes (100ms window)	Database survives load	100ms durability window (ops can be lost in crash)

13.2 The Viewport Filtering Caveat

Viewport filtering means: a user editing off-screen will not receive 
updates from off-screen areas. When they scroll there:
  1. Request "catch-up" from server (latest partition state)
  2. Server sends snapshot + ops since client's last HLC for that partition
  3. Client applies catch-up (typically < 100 ops → near-instant)

This is identical to how Figma handles canvas regions and
how Google Docs handles large spreadsheets.

13.3 Consistency Model

MarkdownOffice uses: Causal Eventual Consistency

NOT: strong consistency (too slow, requires coordination)
NOT: eventual consistency without ordering (text would be garbled)

Causal consistency guarantees:
  - If you see op B, you've already seen op A if B causally depends on A
  - If User A edits block, then replies in comment, other users see
    the edit BEFORE the comment (causal ordering preserved)
  - No forward references to operations not yet received

What it does NOT guarantee:
  - Real-time ordering (operations may arrive out of real-time order)
  - You see all operations immediately after they're made
  - For a document with 1000 editors, your view may be 50-200ms behind
    for off-screen areas (acceptable for collaboration)

14. Final Recommended Architecture

14.1 Architecture Summary

TRANSPORT LAYER
  WebSocket via Phoenix Channels
  TLS + compression (permessage-deflate)
  4 WS nodes, load-balanced

SESSION LAYER  
  1 GenServer per connection (UserSession)
  Handles: auth, rate limiting, viewport state, message routing
  
DOCUMENT LAYER
  1 DocumentCoordinator per document (Horde.Registry, cluster-wide unique)
  Partitioned BlockPartitionServers (20 per large document)
  CRDT state in-memory, snapshotted to Postgres
  
PRESENCE LAYER
  Phoenix.Presence for join/leave/heartbeat
  Custom PresenceServer for cursor/viewport sync
  Batched 10hz broadcast, viewport-filtered delivery
  Canvas-based cursor rendering on frontend
  
CRDT LAYER
  Block tree: Custom Move Tree CRDT in Elixir
  Block ordering: Fractional Index (LSEQ variant)
  Text within block: Yjs (frontend) + opaque relay (server)
  Properties: LWW-Register with HLC
  
AI AGENT LAYER
  Agents as synthetic users in same operation pipeline
  Oban job queue for async execution
  Operation budget system (max 500 downstream ops per user trigger)
  Circuit breakers per agent type
  
STORAGE LAYER
  PostgreSQL: operation log (event source) + snapshots
  Redis Streams: write buffer (100ms batch window)
  ETS: hot CRDT state cache, viewport registry
  S3: archived operation logs (> 30 days), Yjs binary snapshots
  
FRONTEND LAYER
  SvelteKit + Svelte 5 (fine-grained reactivity)
  Virtualized block list (only visible blocks rendered)
  Web Worker for Yjs CRDT processing
  Canvas overlay for cursors (not DOM)
  Optimistic updates with HLC-based rollback

14.2 Critical Implementation Order

Phase 1: Foundation (months 1-3)
  ✓ BlockPartitionServer with basic CRDT (insert/delete only)
  ✓ Phoenix Channels + basic fanout
  ✓ Yjs integration (text within blocks)
  ✓ Snapshot + replay

Phase 2: Scale (months 4-6)
  ✓ Viewport subscription system
  ✓ Cursor batching + canvas rendering
  ✓ Move Tree CRDT (reparent operations)
  ✓ Cluster distribution (Horde)

Phase 3: Intelligence (months 7-9)
  ✓ AI agent pipeline
  ✓ Operation budget system
  ✓ Cross-document dependency graph
  ✓ Policy validation layer

Phase 4: Production hardening (months 10-12)
  ✓ Operation log compaction
  ✓ Network partition tests (Jepsen-style)
  ✓ Circuit breakers everywhere
  ✓ Full observability (telemetry + tracing)

14.3 The Non-Negotiables

If any of these are compromised, the system cannot reach 1000 users/document:

Viewport-based fanout filtering — without this, outbound bandwidth is 50x too high
Block partition sharding — without this, single-process bottleneck caps at ~100 users
AI agent rate limiting — without this, one triggered agent can flood the system
CRDT for structure, not OT — at 1000 users, OT serialization at a central server cannot keep up
Canvas cursors, not DOM — at 50+ cursors, DOM-based rendering causes layout thrash and jank
HLC timestamps everywhere — without causal ordering, CRDT convergence is not guaranteed

Analysis depth: distributed systems, BEAM/Erlang, CRDT theory, structured document semantics
Reference systems: Google Docs, Figma, Linear, Discord, Liveblocks, Yjs, Automerge, Kleppmann et al.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation