Building realtime collaboration for a structured document graph (not a plain text editor) at 1000 concurrent users introduces a class of problems that are qualitatively different from both:
- Plain text collaboration (Google Docs circa 2010)
- Large-scale messaging (Discord, Slack)
MarkdownOffice sits at the intersection of both, with added complexity from:
- Structured semantics: Block operations are not character insertions. Moving a block, changing a slide's layout, or updating a spreadsheet formula carries semantic weight.
- Cross-document dependency graphs: An edit in Doc A can cascade to Doc B, C, and N via cross-references, formula dependencies, and AI-generated content.
- Multi-agent editing: AI agents are first-class citizens in the edit graph, capable of generating bursts of 10–100 operations from a single user action.
At steady-state with 1000 concurrent editors:
Inbound operations:
- User ops: 1000 users × 2 ops/sec = 2,000 ops/sec
- Cursor/presence: 1000 users × 5 events/sec = 5,000 events/sec
- AI agent ops: assume 10% trigger agents, 30 downstream ops = 6,000 ops/sec
Total inbound: ~13,000 events/sec
Naive fanout (broadcast to all):
- 13,000 × 999 recipients = ~13,000,000 messages/sec
- At 200 bytes/message = 2.6 GB/s outbound
- COMPLETELY IMPOSSIBLE without radical filtering
The problem is not sending operations — it is fanout explosion.
| Challenge | Plain Text Editor | MarkdownOffice Block Graph |
|---|---|---|
| Conflict unit | Character insertion | Block operation (move, resize, nest, split) |
| CRDT compatibility | Well-solved (RGA, YATA) | Partially solved — tree CRDTs are young |
| Operation complexity | Insert/delete | Insert, delete, move, reparent, split, merge, transform |
| Dependency tracking | None | Cross-block, cross-document |
| Semantic validation | None required | Formula integrity, slide consistency, citation validity |
# DO NOT DO THIS
defmodule MarkdownOffice.DocumentChannel do
use Phoenix.Channel
def handle_in("operation", op, socket) do
# This becomes a bottleneck at ~100 users
# Single process mailbox receives 13,000 msgs/sec
# BEAM scheduler cannot keep up
# Mailbox grows unbounded → OOM → crash
broadcast!(socket, "operation", op)
{:noreply, socket}
end
endWhy it fails:
- A single GenServer processes messages sequentially — one mailbox
- At 13,000 events/sec, a process handling each in ~10µs can only do ~100,000/sec — fine on paper
- But broadcast to 1000 sockets from one process creates a synchronous fanout bottleneck
- Phoenix's
broadcast!goes through PubSub, but the topic subscription list iteration for 1000 subscribers is expensive - Memory explosion: Document state + operation history + 1000 subscription refs in one process
Operational Transformation requires a central authority to assign operation order:
Client A sends op at t=100ms
Client B sends op at t=101ms
Server must:
1. Receive both
2. Pick an order
3. Transform one against the other
4. Broadcast transformed versions
At 1000 clients × 2 ops/sec = 2000 transforms/sec at the server
Each transform requires O(history_length) lookups
With long sessions: history grows → transforms become O(n²)
More fundamentally: OT's correctness depends on linearizing all operations through a single coordinator. At 1000 users, that coordinator is both a bottleneck and a single point of failure.
Standard CRDTs (Yjs, Automerge) were designed for sequential text or key-value maps. The MarkdownOffice block graph is a tree CRDT with positional semantics — a fundamentally harder problem:
Problem: Two users simultaneously reparent the same block
User A: Move Block-7 under Block-3
User B: Move Block-7 under Block-5
Naive CRDT merge: Block-7 appears under BOTH Block-3 and Block-5
Result: Duplicate block in the document tree — INVALID STATE
Standard Yjs cannot handle this. Tree CRDTs that prevent cycle creation and duplication (Kleppmann et al., 2022 "A highly-available move operation") exist but are complex to implement with block-level semantics.
Markdown is NOT a stable sync format:
1. Parser instability: Different parsers produce different ASTs from same string
"# Hello\n\nWorld" vs "# Hello\n\nWorld\n" → different ASTs
2. AST invalidation: Any edit can re-parse the entire document
Edit one character in a table cell → entire table re-parsed
3. Whitespace non-determinism: Indentation, blank lines carry semantic weight
in some Markdown variants but not others
4. MarkdownSheet formulas: =SUM(A1:A10) has no natural Markdown representation
Any round-trip through Markdown loses formula semantics
5. Block ID loss: After Markdown round-trip, stable block IDs are gone
Version history becomes unreplayable
Markdown is an I/O format only. The sync layer must never see it.
The recommended model is a hybrid that does not exist as an off-the-shelf library — you must implement it:
┌─────────────────────────────────────────────────┐
│ Collaboration Layer Stack │
├─────────────────────────────────────────────────┤
│ L4: Semantic Operation Layer │
│ (move_block, split_block, update_formula) │
├─────────────────────────────────────────────────┤
│ L3: Block-Level CRDT Layer │
│ (tree structure, block ordering) │
├─────────────────────────────────────────────────┤
│ L2: Text CRDT Layer (within blocks) │
│ (character-level, Yjs YATA compatible) │
├─────────────────────────────────────────────────┤
│ L1: Metadata LWW Layer │
│ (block properties, styles, metadata) │
├─────────────────────────────────────────────────┤
│ L0: Transport / Ordering Layer │
│ (HLC timestamps, causal delivery) │
└─────────────────────────────────────────────────┘
// All operations are sealed variants with full type information
type BlockOperation =
// Structure operations — must go through CRDT tree
| { type: "insert_block"; parentId: BlockId; afterId: BlockId | null; block: Block }
| { type: "delete_block"; id: BlockId }
| { type: "move_block"; id: BlockId; newParentId: BlockId; afterId: BlockId | null }
| { type: "split_block"; id: BlockId; at: TextPosition; newBlockId: BlockId }
| { type: "merge_blocks"; id: BlockId; nextId: BlockId }
// Content operations — delegate to text CRDT within block
| { type: "edit_text"; blockId: BlockId; delta: YjsDelta }
// Property operations — Last Write Wins
| { type: "set_property"; blockId: BlockId; key: string; value: unknown; hlc: HLC }
// Presence operations — ephemeral, NOT persisted
| { type: "cursor_move"; userId: UserId; blockId: BlockId; offset: number }
| { type: "viewport_change"; userId: UserId; visibleBlocks: BlockId[] }
// Operations carry Hybrid Logical Clocks for causal ordering
type OperationEnvelope = {
op: BlockOperation;
hlc: HLC; // Hybrid Logical Clock timestamp
userId: UserId;
sessionId: string;
causedBy: HLC[]; // Causal dependencies (for non-commutative ops)
}For the critical "concurrent move" problem in the tree CRDT:
Algorithm (from Kleppmann 2022, adapted for MarkdownOffice):
Operation: Move(nodeId, newParentId, afterId, opId)
Invariants to maintain:
1. No cycles (nodeId cannot be ancestor of newParentId)
2. No duplication (node appears exactly once in tree)
3. Deterministic: same set of operations → same tree, regardless of order
Implementation:
- Every node tracks: {id, parentId, position_clock, tombstoned}
- Position is a Fractional Index (LSEQ-style) within sibling list
- On concurrent moves of same node:
- Only the move with LOWER HLC timestamp "wins" (is applied)
- The other move is discarded (logged for audit, not applied)
- This is a deliberate LWW-over-position policy
This means: last mover loses (the earlier, more "committed" move wins)
The alternative (last-writer-wins on position) leads to phantom copies.
A single document with 1000 editors should never be managed by a single process. Instead:
Document = Collection of Block Partitions
= P₁ ∪ P₂ ∪ ... ∪ Pₙ
Each partition:
- Contains a contiguous range of blocks (by document order)
- Has its own GenServer (BlockPartitionServer)
- Has its own PubSub topic
- Handles ~50–200 blocks
A user subscribes ONLY to partitions whose blocks are in their viewport
Why this works:
- 1000 users editing a 2000-block document → each user sees ~50 blocks at a time
- Each BlockPartitionServer handles ~30-50 active subscribers (not 1000)
- Fanout per operation: ~50 recipients × operation broadcast (not 999)
- System scales horizontally by adding more partition processes
MarkdownDoc: Partition by heading sections
Doc → [Section-1: blocks 1-80] [Section-2: blocks 81-160] ...
MarkdownSlide: Partition by slide
Deck → [Slide-1] [Slide-2] ... [Slide-N]
(Natural partition boundaries — users view one slide at a time)
MarkdownSheet: Partition by row range + column range (2D sharding)
Sheet → [A1:Z100] [A101:Z200] ... (row partitions)
→ [A1:M500] [N1:Z500] ... (column partitions for wide sheets)
Cell dependencies can cross partitions → requires dependency tracking
User Operation Flow:
[Client]
↓ WebSocket (Phoenix Channel)
[UserSession Process — 1 per connection]
↓ validate, assign HLC, check ACL
[DocumentCoordinator — 1 per document]
↓ determine affected partition(s)
[BlockPartitionServer — 1 per partition]
↓ apply CRDT operation to partition state
↓ broadcast to subscribers of this partition
[UserSession Processes of subscribers]
↓ filter by user's viewport
[Client] ← only receives ops for visible blocks
defmodule MarkdownOffice.ViewportRegistry do
# Each user registers which blocks they're viewing
# Operations are filtered BEFORE fanout, not after
@type viewport :: %{
user_id: String.t(),
document_id: String.t(),
visible_blocks: MapSet.t(), # Block IDs currently on screen
subscribed_partitions: [atom()] # PubSub topics
}
def update_viewport(user_id, visible_blocks) do
current = get_viewport(user_id)
new_partitions = blocks_to_partitions(visible_blocks)
old_partitions = current.subscribed_partitions
# Subscribe to newly visible partitions
(new_partitions -- old_partitions)
|> Enum.each(&Phoenix.PubSub.subscribe(MO.PubSub, partition_topic(&1)))
# Unsubscribe from no-longer-visible partitions
(old_partitions -- new_partitions)
|> Enum.each(&Phoenix.PubSub.unsubscribe(MO.PubSub, partition_topic(&1)))
update_registry(user_id, %{visible_blocks: visible_blocks, subscribed_partitions: new_partitions})
end
endMarkdownOffice.Application (root supervisor)
├── MarkdownOffice.PubSub (Phoenix.PubSub :pg adapter, partitioned)
├── MarkdownOffice.ClusterSupervisor
│ ├── MarkdownOffice.NodeRegistry (libcluster auto-discovery)
│ └── MarkdownOffice.HLCServer (global HLC synchronization)
├── MarkdownOffice.Endpoint (Phoenix Endpoint)
│ └── [1000x UserSession processes per hot document]
└── MarkdownOffice.DocumentSupervisor (DynamicSupervisor)
└── per document:
├── DocumentCoordinator (1 per document, global state)
├── DocumentSupervisor (nested DynamicSupervisor)
│ ├── BlockPartitionServer[0] (blocks 0–99)
│ ├── BlockPartitionServer[1] (blocks 100–199)
│ └── ... (N partitions)
├── PresenceServer (cursor/viewport state only)
├── OperationLog (event source buffer)
└── SnapshotWorker (async, periodic snapshotting)
defmodule MarkdownOffice.BlockPartitionServer do
use GenServer
# State is a partition-local view of the document
defstruct [
partition_id: nil,
document_id: nil,
blocks: %{}, # %{block_id => Block}
crdt_state: nil, # RGA/YATA state per block
sibling_order: nil, # Fractional index tree for ordering
hlc: nil, # Partition-local HLC
operation_buffer: [], # Unacked operations
subscribers: MapSet.new() # Active viewport subscribers
]
# Critical: process mailbox must never overflow
# Operations arrive at ~200/sec per partition (from all users)
# Each operation must complete in <1ms to maintain 200hz throughput
def handle_cast({:apply_op, envelope}, state) do
with :ok <- validate_causality(envelope, state),
{:ok, new_state} <- apply_crdt_op(envelope.op, state),
:ok <- broadcast_to_subscribers(envelope, new_state) do
# Async write to operation log — don't block the fast path
Task.start(fn -> OperationLog.append(envelope) end)
{:noreply, new_state}
else
{:error, :out_of_order} ->
# Buffer and retry after causal dep is received
{:noreply, buffer_pending(envelope, state)}
end
end
# Back-pressure: if mailbox > threshold, apply flow control
def handle_call(:get_pressure, _from, state) do
{:reply, Process.info(self(), :message_queue_len), state}
end
endThe tempting architecture — one GenServer per block — fails:
1 process per block:
- 10,000 block document = 10,000 processes
- Each process: ~3KB baseline memory = 30MB just for processes
- Cross-block operations (merge_blocks, move_block) require
distributed transactions across 2+ processes
- Two-phase commit introduces latency and complexity
- Process scheduling overhead at 10k concurrent processes
under load is measurable
BEAM is excellent at 100k+ processes, but inter-process
coordination for transactional ops is still expensive.
Verdict: 1 process per PARTITION (50-200 blocks) is the sweet spot.
- For a 2000-block doc: ~20–40 partition processes
- Cross-partition ops (move block across section boundary) are rare
- Can be handled via partition coordinator with optimistic locking
For a "hot" document (1000 concurrent editors):
Option A: All partition processes on ONE node
- Simpler routing (local process calls, no distribution overhead)
- Risk: single node failure kills the document
- 1000 WebSocket connections → 1 node saturated
Option B: Partition processes distributed across cluster
- Document shard across 3-5 nodes (consistent hashing on partition_id)
- WebSocket connections load-balanced (any node can connect)
- Cross-node message via Distributed Erlang (~0.1ms latency)
- RECOMMENDED for hot documents (>500 users)
Node roles:
┌──────────────────────────────────────┐
│ Load Balancer (HAProxy / Nginx) │
└─────────┬────────────────────────────┘
│
┌─────────▼────────────────────────────┐
│ WS Nodes (websocket-optimized) │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ ← 333 connections each
└──┴────────┴─┴────────┴─┴────────┴───┘
│ Distributed Erlang
┌─────────▼────────────────────────────┐
│ Document Nodes (computation-heavy) │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Node 4 │ │ Node 5 │ │ Node 6 │ │ ← partition servers
└──┴────────┴─┴────────┴─┴────────┴───┘
│
┌─────────▼────────────────────────────┐
│ Storage Nodes │
│ PostgreSQL (primary + 2 replicas) │
└──────────────────────────────────────┘
# Configure PubSub with pool for parallel dispatch
config :markdown_office, MarkdownOffice.PubSub,
adapter: Phoenix.PubSub.PG2,
pool_size: 10 # 10 parallel dispatch workers
# Use partitioned topics to reduce broadcast contention
# Instead of one topic per document:
# "doc:abc123" (1000 subscribers)
# Use one topic per partition:
# "doc:abc123:partition:0" (~30 subscribers)
# "doc:abc123:partition:1" (~25 subscribers)
# ...
# Presence is a separate topic (different QoS)
# "presence:abc123" (all 1000 users — but throttled to 1hz)| Layer | Data Type | CRDT Type | Implementation |
|---|---|---|---|
| Block ordering within parent | Sibling list | LSEQ / Fractional Index | Custom Elixir |
| Block tree structure | Nested tree | Move Tree CRDT (Kleppmann 2022) | Custom |
| Text within a block | Character sequence | YATA (Yjs-compatible) | Port Yjs to WASM/JS |
| Block properties (color, style) | Key-value | LWW-Register with HLC | Simple |
| Block metadata (type, version) | Key-value | LWW-Register with HLC | Simple |
| Comments / annotations | List | RGA | Custom |
Instead of integer positions (which require renumbering on insert):
Block order uses fractional keys between 0 and 1:
[0.1: Block-A] [0.5: Block-B] [0.8: Block-C]
Insert between A and B:
new_key = (0.1 + 0.5) / 2 = 0.3
[0.1: A] [0.3: new] [0.5: B] [0.8: C]
Concurrent inserts at same position:
Two users insert at position 0.3 simultaneously
Both generate different fractional keys (using user-id as tiebreaker)
No conflict — both blocks appear, order determined by key comparison
Precision concern: After many inserts, keys approach equal values
Solution: Rebalancing operation (offline, periodic, logged as an operation)
Architecture decision: Run Yjs in Elixir via NIFs or as JS worker?
Option A: Port Yjs logic to Elixir
+ Native BEAM process, no FFI overhead
+ Can introspect CRDT state
- Enormous implementation effort
- Binary format may drift from Yjs spec
Option B: Run Yjs in Node.js sidecar
+ Full Yjs compatibility
+ Can sync with frontend Yjs directly
- Latency (Elixir → Node.js → Elixir)
- Operational complexity
Option C: Yjs on Frontend ONLY, server is authority for structure
+ Server handles block-level CRDT (tree)
+ Client handles text-level CRDT (character sequence within each block)
+ Server validates and relays Yjs updates without understanding them
- Offline text edits require full Yjs merge on reconnect
RECOMMENDED for v1
Option D: Custom BEAM CRDT for everything
+ Full control, no dependencies
- 6-12 months of implementation
- CRDT bugs are catastrophic and hard to detect
Only after v2+ with dedicated CRDT engineering team
Text CRDT (within a block):
- Use Yjs on frontend (JS)
- Server opaquely relays Yjs Y.Doc updates between clients
- Server stores Yjs binary snapshots per block
- Server does NOT interpret Yjs state (just relay + store)
Structure CRDT (block tree):
- Custom implementation in Elixir
- Based on Kleppmann move operation paper
- Operations: insert_block, delete_block, move_block
- Server is authoritative for block tree structure
- Frontend OPTIMISTICALLY applies structure ops, may rollback
Property CRDT (block metadata):
- LWW-Register: last write wins using HLC
- Server resolves concurrent writes deterministically
- Properties: type, color, collapsed, width, etc.
Naive presence: 1000 users × 5 cursor updates/sec = 5000 events/sec
Each event fanout to 999 peers = 5,000,000 messages/sec
At 100 bytes/message = 500 MB/sec — IMPOSSIBLE
Optimizations applied in order of impact:
Layer 1: Throttling at Source
Client sends cursor position at max 4 Hz (250ms interval)
Viewport changes at max 2 Hz (500ms interval)
Typing indicators: binary true/false, not position
Effect: 5000 → 4000 events/sec (20% reduction — not enough alone)
Layer 2: Server-Side Cursor Aggregation
defmodule MarkdownOffice.PresenceServer do
# Collect cursor updates in a 100ms window
# Batch and broadcast once every 100ms
defp flush_cursors(state) do
# Build a diff: only changed cursors since last broadcast
cursor_diff = compute_diff(state.last_broadcast_cursors, state.current_cursors)
# Broadcast compressed delta (not all 1000 cursors)
Phoenix.PubSub.broadcast(MO.PubSub, "presence:#{state.doc_id}", {
:cursor_batch,
cursor_diff,
state.hlc
})
# 10 broadcasts/sec × (changed cursors only) ≈ 50-100 cursors per batch
# vs naive: 5000 events/sec
end
endLayer 3: Viewport-Filtered Cursor Delivery
User A is viewing blocks 50-100
User A only receives cursor updates for users whose cursors are in blocks 50-100
Implementation:
- PresenceServer knows each user's viewport (updated by client)
- On cursor broadcast, filter by viewport intersection
- Only deliver cursor if it's in recipient's visible area
Effect: ~50 out of 1000 cursors are visible to any given user
→ 1000 cursors × 4hz = 4000 events/sec inbound
→ Per user: ~200 relevant cursor events/sec delivered
→ Total outbound: 4000 × 50 (avg visible) = 200,000 msgs/sec (vs 5M)
Layer 4: Cursor Clustering
For high-density areas (e.g., 50 users on same paragraph):
- Cluster nearby cursors into a single aggregated cursor
- Show "42 users editing here" instead of 42 individual cursors
- Client-side: only detailed cursors for nearest N users (e.g., 10)
- Others shown as anonymous colored dots
Threshold: if >10 cursors within same block, cluster them
// Ephemeral — NOT event-sourced, NOT persisted to DB
type UserPresence = {
userId: string;
sessionId: string; // For multi-device
color: string; // Consistent per user
// Cursor (updated frequently, throttled to 4hz)
cursor: {
blockId: string;
offset: number; // Character offset within block
anchorBlockId?: string; // For multi-block selection
anchorOffset?: number;
} | null;
// Viewport (updated on scroll, throttled to 2hz)
viewport: {
visibleBlockIds: string[]; // Currently visible blocks
topBlockId: string; // Block at top of viewport
};
// Activity
isTyping: boolean; // True if keystroke in last 3s
lastSeenAt: number; // Unix ms — for timeout
}
// Server-side: Phoenix.Presence handles distributed state
// Automatically handles node failures, merges presence from partitioned nodesPhoenix.Presence uses CRDT internally (PG2) across nodes.
For 1000 users on 1 document:
Problem: Phoenix.Presence broadcasts full presence diff on every change
→ 1000 users × changes → large diff payloads
Solution: Shadow Presence (presence metadata stripped down)
- Presence only carries: userId, color, isActive, partition
- Detailed cursor/viewport tracked in PresenceServer (custom)
- Phoenix.Presence used only for join/leave/heartbeat
- Cursor sync uses separate high-frequency topic
topic "presence:doc_id" → join/leave events only (low freq, full broadcast OK)
topic "cursors:doc_id:partition:N" → cursor updates (high freq, viewport filtered)
-- Core event log (append-only)
CREATE TABLE document_operations (
id BIGSERIAL PRIMARY KEY,
document_id UUID NOT NULL,
partition_id INTEGER NOT NULL,
hlc BIGINT NOT NULL, -- Hybrid Logical Clock
user_id UUID NOT NULL,
session_id VARCHAR(36) NOT NULL,
op_type VARCHAR(50) NOT NULL, -- 'insert_block', 'edit_text', etc.
op_data JSONB NOT NULL, -- Serialized operation
caused_by BIGINT[], -- Causal deps (HLC values)
block_ids UUID[], -- Affected block IDs (for filtering)
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Partition indexes for common queries
CREATE INDEX idx_doc_ops_document ON document_operations(document_id, hlc);
CREATE INDEX idx_doc_ops_partition ON document_operations(document_id, partition_id, hlc);
CREATE INDEX idx_doc_ops_blocks ON document_operations USING GIN(block_ids);
-- Snapshots (periodic full state captures)
CREATE TABLE document_snapshots (
id BIGSERIAL PRIMARY KEY,
document_id UUID NOT NULL,
partition_id INTEGER NOT NULL,
snapshot_hlc BIGINT NOT NULL, -- HLC at time of snapshot
blocks_json JSONB NOT NULL, -- Serialized block graph
yjs_binary BYTEA, -- Yjs Y.Doc binary per block
op_count INTEGER NOT NULL, -- Number of ops since last snapshot
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_snapshots_doc ON document_snapshots(document_id, partition_id, snapshot_hlc DESC);At 1000 users × 2 ops/sec = 2000 ops/sec
PostgreSQL write throughput:
- Single table insert: ~5000-10000 rows/sec on NVMe
- With JSONB: ~3000-5000 rows/sec
- Problem: 2000 ops/sec is borderline, and AI agents add 6000 more
Solutions:
1. Batch inserts: collect 100ms of ops, insert as single multi-row insert
2000 ops/sec × 100ms = 200 ops per batch
Batched: 10 inserts/sec instead of 2000 — well within Postgres capacity
2. Operations table partitioned by document_id + created_at (range partition)
Old partitions archived to cold storage (S3 Parquet)
3. Write through Redis Stream first (low latency), async drain to Postgres
Redis XADD: ~100k ops/sec capacity
Worker drains Redis → Postgres in batches
Snapshot triggers:
- Every 500 operations per partition
- Every 10 minutes (time-based)
- On explicit user save
- Before AI agent batch operations
Snapshot process (must be non-blocking):
1. BlockPartitionServer receives :snapshot signal
2. Spawn async Task with CURRENT state snapshot
3. Task serializes to JSONB + Yjs binary
4. Task writes to document_snapshots table
5. Task logs new snapshot_hlc (for future replay baseline)
Replay from snapshot (for reconnect, new participant):
1. Load latest snapshot for each partition (snapshot_hlc)
2. Replay operations from snapshot_hlc to current (from operation log)
3. Apply replayed ops to restore current state
4. Average replay: 500 ops × 5ms/op = 2.5s max
5. Fast path: < 50 ops since snapshot = <250ms replay
Problems with unbounded operation logs:
- Storage cost: 2000 ops/sec × 100 bytes = 200KB/sec = 17GB/day
- Replay time grows linearly with history
Compaction strategy:
Retain ALL operations for 30 days (full history, undo support)
After 30 days: compact to semantic changesets (lossy but auditable)
After 1 year: retain only hourly snapshots
Selective operation pruning (safe to prune):
- Intermediate cursor moves (keep only final positions)
- Intermediate typing states (keep only committed versions)
- Superseded property sets (keep only final value)
NOT prunable: structural operations (inserts, deletes, moves)
AI agents are modeled identically to users in the operation pipeline:
- Each agent has a userId (synthetic: "agent:summarize:doc_id")
- Agents emit BlockOperations through the same CRDT pipeline
- Agents have presence (visible in "who's editing" list)
- Agent operations carry metadata: reason, confidence, provenance
What's different about agents:
- Operations may be ASYNC (no user waiting for response)
- Agents can batch 10-100 operations in a single transaction
- Agents can be PAUSED if they exceed rate limits
- Agent operations go through a policy validator before application
[User Action] → [Policy Engine] → [Agent Trigger Evaluation]
↓ (if triggers)
[Agent Job Queue (Oban)]
↓ (async, after delay)
[Agent Worker Process]
↓ (generates N ops)
[Agent Op Batch]
↓
[Policy Validator] — rejects harmful ops
↓ (validated)
[BlockPartitionServer] ← same path as user ops
↓
[Broadcast to subscribers]
[Label: "🤖 Agent Name made changes"]
Problem: User edit → triggers AI agent → 100 ops → triggers more agents → cascade
Solution: Operation Budget System
type OperationContext = {
originOpId: string; // The user op that started this chain
depth: number; // How many agent hops deep we are
opsRemaining: number; // Budget remaining in this chain
}
Rules:
- Root user operation: budget = 500 total downstream ops
- Each agent hop decrements budget by its op count
- If budget exhausted: agent op queued for next user save
- Max depth: 3 agent hops (user → agent1 → agent2 → agent3, then stop)
Implementation:
- Budget tracked in ETS (fast, per-node)
- Budget keyed by originOpId
- Agents check budget before executing
- Budget expires after 30 seconds (TTL)
Scenario: User is editing Block-7 while agent is also modifying Block-7
Strategy: Optimistic agent, user always wins
1. Agent generates ops with hlc timestamp at time of agent start
2. User edit arrives with later hlc
3. CRDT merge: user edit applies on top of agent state
4. Agent's text edits: Yjs handles character-level merge correctly
5. Agent's structure ops: if agent moved Block-7, but user edited Block-7 content,
move is preserved, content edit applies to moved block (correct behavior)
Edge case: Agent deleted Block-7, user edited Block-7 simultaneously
- CRDT rule: Yjs tombstones the text, block delete wins
- But: user's edits are captured as a pending op in the deleted block's history
- Policy: if block deleted within 5s of user edit, offer "restore with edits"
The frontend document renderer must handle:
- 2000+ blocks without DOM explosion
- Continuous CRDT updates arriving at ~60hz
- Cursor overlays for up to 1000 users (clustered)
- Optimistic local edits with rollback capability
Architecture:
┌─────────────────────────────────────────┐
│ Document View (SvelteKit) │
│ │
│ VirtualizedBlockList │
│ ├── renders only visible blocks │
│ ├── overscan: 10 blocks above/below │
│ └── block height estimation cache │
│ │
│ CursorOverlay (canvas-based) │
│ ├── NOT DOM elements (too slow at 100+)│
│ ├── Canvas 2D, 60fps animation frame │
│ └── clustered rendering for >10 cursors│
│ │
│ CRDT Worker (Web Worker) │
│ ├── runs Yjs off main thread │
│ ├── applies remote ops asynchronously │
│ └── posts reconciliation patches │
└─────────────────────────────────────────┘
<!-- VirtualizedBlockList.svelte -->
<script>
import { onMount } from 'svelte';
import { blockStore } from '$lib/stores/blocks';
let containerHeight = 0;
let scrollTop = 0;
let estimatedHeights = new Map(); // blockId → height
// Only render blocks in viewport + overscan
$: visibleBlocks = computeVisibleBlocks(
$blockStore.order,
scrollTop,
containerHeight,
estimatedHeights,
overscan = 15 // render 15 blocks beyond viewport edge
);
// Notify server of viewport change (throttled)
$: debouncedViewportUpdate(visibleBlocks.map(b => b.id), 500);
// Optimistic update: apply local edit immediately
function applyLocalEdit(blockId, delta) {
blockStore.optimisticUpdate(blockId, delta);
// Send to server via WebSocket
channel.push('operation', buildOp(blockId, delta));
}
// Rollback on server rejection
channel.on('op_rejected', ({ opId, reason }) => {
blockStore.rollback(opId);
showConflictNotification(reason);
});
</script>
<div bind:this={container} on:scroll={handleScroll}>
<!-- Spacer for blocks above viewport (maintains scroll position) -->
<div style="height: {topSpacerHeight}px" />
{#each visibleBlocks as block (block.id)}
<BlockRenderer {block} />
{/each}
<!-- Spacer for blocks below viewport -->
<div style="height: {bottomSpacerHeight}px" />
</div>// crdt-worker.js — runs off main thread
import * as Y from 'yjs';
const docs = new Map(); // blockId → Y.Doc
self.onmessage = ({ data }) => {
const { type, payload } = data;
switch (type) {
case 'INIT_BLOCK': {
const ydoc = new Y.Doc();
if (payload.snapshot) {
Y.applyUpdateV2(ydoc, payload.snapshot);
}
docs.set(payload.blockId, ydoc);
ydoc.on('update', (update, origin) => {
if (origin !== 'remote') {
// Local edit: send to server
self.postMessage({ type: 'LOCAL_UPDATE', blockId: payload.blockId, update });
}
});
break;
}
case 'REMOTE_UPDATE': {
const ydoc = docs.get(payload.blockId);
if (ydoc) {
Y.applyUpdateV2(ydoc, payload.update, 'remote');
// Post reconciled state back to main thread
const text = ydoc.getText('content').toString();
self.postMessage({ type: 'BLOCK_RECONCILED', blockId: payload.blockId, text });
}
break;
}
case 'LOCAL_EDIT': {
const ydoc = docs.get(payload.blockId);
if (ydoc) {
ydoc.transact(() => {
// Apply local delta to Y.Doc
applyDelta(ydoc.getText('content'), payload.delta);
});
}
break;
}
}
};// CursorCanvas.ts — renders all cursors on HTML Canvas
class CursorCanvas {
private ctx: CanvasRenderingContext2D;
private cursors: Map<string, CursorState> = new Map();
private blockPositions: Map<string, DOMRect> = new Map();
updateCursors(batch: CursorBatch) {
for (const [userId, cursor] of Object.entries(batch)) {
this.cursors.set(userId, cursor);
}
this.scheduleRender();
}
private render() {
const { ctx } = this;
ctx.clearRect(0, 0, this.canvas.width, this.canvas.height);
// Cluster cursors by block
const clusters = this.clusterCursors(this.cursors, this.blockPositions);
for (const cluster of clusters) {
if (cluster.size === 1) {
// Render individual cursor caret
const [userId, cursor] = [...cluster.entries()][0];
this.drawCaret(cursor, getColor(userId));
this.drawLabel(cursor, getUserName(userId));
} else {
// Render cluster badge: "42 users"
this.drawClusterBadge(cluster.centroid, cluster.size);
}
}
}
// Performance: only re-render when cursors change
private scheduleRender = throttle(() => requestAnimationFrame(() => this.render()), 16);
}Context: Node 4 hosts partition servers for partitions 0–7
All 1000 users are subscribed to some of these
Node 4 crashes suddenly
Timeline:
T+0ms : Node 4 crashes, Erlang distribution detects within 500ms
T+500ms : Surviving nodes receive :nodedown from Distributed Erlang
T+500ms : OTP supervisors on surviving nodes detect missing processes
T+600ms : Document Coordinator detects missing partition servers
T+700ms : DynamicSupervisor spawns replacement processes on Node 5
T+800ms : New processes load state from database (latest snapshot + op replay)
Fast path: < 50 ops since snapshot → < 300ms replay
T+1100ms: Processes restored, begin accepting operations
T+1100ms: Clients receive reconnect signal, push pending ops
T+1500ms: Full recovery
Client experience:
- 1-2 second pause in receiving remote updates
- Their own edits are buffered client-side (WebSocket reconnect)
- On reconnect: client sends operations with original HLC timestamps
- Server replays and merges with what happened during outage
- No data loss (client buffers unsent ops)
Context: 3-node cluster, nodes split into [Node1,2] and [Node3]
Document Coordinator running on Node1
Node3 side:
- Cannot reach Document Coordinator
- Falls back to "degraded mode": accept local edits, buffer for sync
- CRDTs ensure these edits are eventually mergeable
- Users on Node3 see their own edits but not Node1/2's edits
Node1/2 side:
- Continue normally
- Don't know about Node3 edits
Partition heals:
- Node3 reconnects to cluster
- Node3's buffered operations sent to Document Coordinator
- CRDT merge applied: concurrent edits merged deterministically
- All clients reconcile to same state
- Structure ops (move_block, etc.) resolved by HLC ordering
Risk: if partition lasts > snapshot interval (10 min), Node3 may diverge significantly
- Large merge operation required
- May show "conflict resolution" UI to users in extreme cases
Context: One user on 2G mobile, receiving messages slowly
Their WebSocket channel's send buffer fills up
Phoenix default: process suspend when buffer full
Problem:
- UserSession process for slow client accumulates backpressure
- If synchronous, it blocks the process
- In Elixir: process mailbox can grow unbounded → OOM
Solution: Separate fast and slow paths
defmodule UserSession do
def handle_info({:op_broadcast, op}, state) do
case :gen_tcp.send(state.socket, encode(op)) of
:ok ->
{:noreply, state}
{:error, :timeout} ->
# Client is slow: drop non-critical messages
if droppable?(op) do
{:noreply, state} # Drop cursor updates, presence
else
# For structural ops: mark client as lagging
{:noreply, %{state | lagging: true, pending_ops: [op | state.pending_ops]}}
end
end
end
# Types that can be safely dropped for slow clients
defp droppable?({:cursor_move, _}), do: true
defp droppable?({:presence_update, _}), do: true
defp droppable?({:viewport_sync, _}), do: true
defp droppable?(_), do: false # Structure ops: never drop
end
Context: Compliance agent triggered by large document import
Agent generates 10,000 operations instead of expected 100
Detection:
- OperationBudget system detects budget exceeded
- Agent's ops rate > 500 ops/sec for > 5 seconds → trigger alert
- CircuitBreaker trips for this agent type
Response:
1. Suspend agent's session (kill agent GenServer)
2. Mark last 1000 ops as "agent:runaway" in operation log
3. Rollback last N agent ops (replay log without agent ops)
4. Notify document owner
5. Circuit breaker: agent type paused for 1 hour
Rollback implementation:
- Filter operation log: exclude ops with agent_session_id
- Replay filtered log from last known-good snapshot
- This is expensive (full partition replay) but correctness is maintained
Document: 2000 blocks, 20 partitions (100 blocks each)
Users: 1000 concurrent editors
Session: 6 hours/day sustained
─── INBOUND TRAFFIC ───
User operations: 1000 × 1.5 ops/sec = 1,500 ops/sec
Cursor events: 1000 × 3 updates/sec = 3,000/sec
Viewport changes: 1000 × 1 update/sec = 1,000/sec
AI agent ops: 150 agents × 20 ops/sec = 3,000/sec
Total inbound: ~8,500 events/sec
─── OUTBOUND (after optimizations) ───
Structure ops: 1,500/sec × 50 avg recipients = 75,000/sec
Cursor events: 3,000/sec × 50 avg visible cursors = 150,000/sec
AI agent ops: 3,000/sec × 50 avg recipients = 150,000/sec
Total outbound: ~375,000 messages/sec
Bandwidth (200 bytes avg):
Inbound: 8,500 × 200B = 1.7 MB/sec
Outbound: 375,000 × 200B = 75 MB/sec → with gzip: ~15 MB/sec
Per node (6 nodes): 2.5 MB/sec — very manageable
─── PROCESS COUNT ───
WebSocket sessions: 1,000 processes
Partition servers: 20 processes per document
Document coordinator: 1 process per document
Presence servers: 1 process per document
Supporting processes: ~50
Total per hot document: ~1,072 processes
BEAM handles 1M+ processes — this is trivial
─── MEMORY ───
WebSocket sessions: 1,000 × 50KB = 50MB
Partition servers: 20 × 5MB (CRDT state) = 100MB
Yjs snapshots per block: 2000 blocks × 5KB = 10MB (in ETS)
Operation buffer: 20 partitions × 2MB pending ops = 40MB
Phoenix PubSub subs: 1,000 subs × 1KB = 1MB
Total per document: ~200MB RAM per hot document
─── DATABASE ───
Operation log writes: 8,500 events/sec batched → 85 batches/sec
Each batch: 100 ops × 200B = 20KB → 1.7MB/sec to Postgres
Postgres throughput: typically 100MB/sec+ → well within limits
Operation log growth: 1.7MB/sec × 86400 = 147GB/day (full 1000-user day)
→ Compress with ZSTD → ~15GB/day at 10:1 ratio
─── CPU ───
CRDT operations: 8,500/sec × 1ms avg = 8.5 CPU-core-seconds/sec
→ 9 cores dedicated to CRDT computation (oversized for safety)
WebSocket I/O: 375,000 msg/sec → ~2 cores (BEAM handles I/O efficiently)
Serialization: 8,500 msg/sec encode × 0.1ms = 0.85 core
Total: ~12 cores for 1 document cluster
─── CLUSTER SIZING ───
For 1 hot document: 6 nodes × 4 cores × 32GB = adequate with headroom
For 100 hot documents at 1000 users each: 50 nodes × 16 cores × 64GB
→ Assuming 50% node sharing across documents
| System | Concurrent Users/Doc | Sync Protocol | Notes |
|---|---|---|---|
| Google Docs | ~100 practical | OT | Degrades above 50 |
| Figma | ~100 | Live Graph (custom CRDT) | Whiteboard, not doc |
| Notion | ~50 practical | Custom OT | Blocks, not full CRDT |
| Discord (channel) | 500k viewers | No CRDT (chat) | Read-mostly |
| MarkdownOffice | 1,000 | Block CRDT | This design |
The 1000-editor steady-state is 10x beyond Google Docs current limits. It is achievable with the architecture above, but requires:
- Viewport-based fanout filtering (mandatory, not optional)
- Partition-based block sharding (mandatory)
- Cursor canvas rendering (mandatory for >50 cursors)
- AI agent rate limiting (mandatory)
| Decision | What You Gain | What You Give Up |
|---|---|---|
| Block-level CRDT (not char-level) | Structure correctness, semantic ops | Must implement custom tree CRDT |
| Yjs on frontend (server as relay) | Fast text editing, proven library | Server cannot understand text ops |
| Viewport filtering | 20x reduction in fanout | Users miss ops for off-screen blocks briefly |
| LWW for block properties | Simple, fast | Last write wins, not "smartest merge" |
| Partition-based sharding | Horizontal scale | Cross-partition ops require coordination |
| Canvas cursors (not DOM) | 100+ cursor performance | Accessibility harder to implement |
| Batch DB writes (100ms window) | Database survives load | 100ms durability window (ops can be lost in crash) |
Viewport filtering means: a user editing off-screen will not receive
updates from off-screen areas. When they scroll there:
1. Request "catch-up" from server (latest partition state)
2. Server sends snapshot + ops since client's last HLC for that partition
3. Client applies catch-up (typically < 100 ops → near-instant)
This is identical to how Figma handles canvas regions and
how Google Docs handles large spreadsheets.
MarkdownOffice uses: Causal Eventual Consistency
NOT: strong consistency (too slow, requires coordination)
NOT: eventual consistency without ordering (text would be garbled)
Causal consistency guarantees:
- If you see op B, you've already seen op A if B causally depends on A
- If User A edits block, then replies in comment, other users see
the edit BEFORE the comment (causal ordering preserved)
- No forward references to operations not yet received
What it does NOT guarantee:
- Real-time ordering (operations may arrive out of real-time order)
- You see all operations immediately after they're made
- For a document with 1000 editors, your view may be 50-200ms behind
for off-screen areas (acceptable for collaboration)
TRANSPORT LAYER
WebSocket via Phoenix Channels
TLS + compression (permessage-deflate)
4 WS nodes, load-balanced
SESSION LAYER
1 GenServer per connection (UserSession)
Handles: auth, rate limiting, viewport state, message routing
DOCUMENT LAYER
1 DocumentCoordinator per document (Horde.Registry, cluster-wide unique)
Partitioned BlockPartitionServers (20 per large document)
CRDT state in-memory, snapshotted to Postgres
PRESENCE LAYER
Phoenix.Presence for join/leave/heartbeat
Custom PresenceServer for cursor/viewport sync
Batched 10hz broadcast, viewport-filtered delivery
Canvas-based cursor rendering on frontend
CRDT LAYER
Block tree: Custom Move Tree CRDT in Elixir
Block ordering: Fractional Index (LSEQ variant)
Text within block: Yjs (frontend) + opaque relay (server)
Properties: LWW-Register with HLC
AI AGENT LAYER
Agents as synthetic users in same operation pipeline
Oban job queue for async execution
Operation budget system (max 500 downstream ops per user trigger)
Circuit breakers per agent type
STORAGE LAYER
PostgreSQL: operation log (event source) + snapshots
Redis Streams: write buffer (100ms batch window)
ETS: hot CRDT state cache, viewport registry
S3: archived operation logs (> 30 days), Yjs binary snapshots
FRONTEND LAYER
SvelteKit + Svelte 5 (fine-grained reactivity)
Virtualized block list (only visible blocks rendered)
Web Worker for Yjs CRDT processing
Canvas overlay for cursors (not DOM)
Optimistic updates with HLC-based rollback
Phase 1: Foundation (months 1-3)
✓ BlockPartitionServer with basic CRDT (insert/delete only)
✓ Phoenix Channels + basic fanout
✓ Yjs integration (text within blocks)
✓ Snapshot + replay
Phase 2: Scale (months 4-6)
✓ Viewport subscription system
✓ Cursor batching + canvas rendering
✓ Move Tree CRDT (reparent operations)
✓ Cluster distribution (Horde)
Phase 3: Intelligence (months 7-9)
✓ AI agent pipeline
✓ Operation budget system
✓ Cross-document dependency graph
✓ Policy validation layer
Phase 4: Production hardening (months 10-12)
✓ Operation log compaction
✓ Network partition tests (Jepsen-style)
✓ Circuit breakers everywhere
✓ Full observability (telemetry + tracing)
If any of these are compromised, the system cannot reach 1000 users/document:
- Viewport-based fanout filtering — without this, outbound bandwidth is 50x too high
- Block partition sharding — without this, single-process bottleneck caps at ~100 users
- AI agent rate limiting — without this, one triggered agent can flood the system
- CRDT for structure, not OT — at 1000 users, OT serialization at a central server cannot keep up
- Canvas cursors, not DOM — at 50+ cursors, DOM-based rendering causes layout thrash and jank
- HLC timestamps everywhere — without causal ordering, CRDT convergence is not guaranteed
Analysis depth: distributed systems, BEAM/Erlang, CRDT theory, structured document semantics
Reference systems: Google Docs, Figma, Linear, Discord, Liveblocks, Yjs, Automerge, Kleppmann et al.