Audience: Senior Software Engineers, Systems Architects, Computer Scientists
Version: v1.1.0 (feat/pdf-parser)
Status: Current
- The Fundamental Problem
- Memory Management and Static Allocation
- Zero-Copy Architecture
- Data Invariants and the Retreat Algorithm
- Stability Guarantees via Property-Based Testing
- Accepted Trade-offs
Ingestion tools built on managed runtimes (Python/LangChain, Java/LlamaIndex JVM) share a pathological allocation pattern: each document transits through multiple intermediate heap representations before producing a usable chunk.
A typical Python pipeline:
bytes (disk) → str (UTF-8 decode) → List[str] (split) → List[str] (chunk) → List[str] (overlap)
Each arrow represents a new, independent heap allocation. For a 10 MB document, the pipeline may hold 3–5× the input size simultaneously as Python objects — all subject to reference counting and the GIL. CPython's GC was designed for interactive latency, not batch throughput.
In Go, the problem persists differently. Libraries using strings.Split, regexp.FindAllString, or json.Unmarshal → json.MarshalIndent materialize full intermediate representations. A naive Go pipeline for JSON still produces:
bytes (io.Reader) → interface{} (Unmarshal tree) → string (MarshalIndent) → []byte (chunk)
The interface{} tree from json.Unmarshal is the core offender: each node is a separate heap allocation, generating GC pressure proportional to node count — not byte count.
The zero external dependencies constraint is not ideological — it is architectural. Any external dependency introduces a foreign allocation model that cannot be controlled. The objective was a verifiable guarantee: constant post-GC residual heap independent of input volume.
This is only achievable when you own the lifecycle of every allocation on the critical path.
sync.Pool is commonly used superficially: get a buffer, use it, return it. The problem with this naive pattern is that the return point rarely coincides with the end of the data's useful lifetime.
In GopherDoc, the buffer allocated by the parser cannot be returned immediately after parsing because it is the backing memory for the emitted chunks. Chunks are sub-slices of the original buffer — returning it to the pool while chunks are still in-flight is a classic race condition with silent data corruption consequences.
The solution is a transferable ownership model: the buffer travels with the data to its last consumer.
// Document carries the backing buffer as an explicit field.
// PoolBuf is nil on all chunks except the last.
type Document struct {
ID string
Content []byte // sub-slice of PoolBuf
Metadata map[string]any
PoolBuf *bytes.Buffer // nil except on the last chunk
}
func (d *Document) Release() {
if d.PoolBuf != nil {
pool.PutBuffer(d.PoolBuf)
d.PoolBuf = nil
}
}The pipeline emits N chunks per document. Only the last chunk needs to carry the responsibility of returning the buffer. The implementation uses a prev accumulator to detect the last element of an iterator without lookahead:
var prev *document.Document
for chunk := range chunks { // iter.Seq — pull without buffering
if prev != nil {
emit(ctx, out, Result{Doc: prev}) // prev is not last: PoolBuf = nil
}
prev = chunk
}
if prev != nil {
prev.PoolBuf = doc.PoolBuf // ownership transfer
emit(ctx, out, Result{Doc: prev})
}This pattern is required because iter.Seq (Go 1.23+) does not expose "last element" semantics — the only way to know an element is last is to have seen the next one.
The 2.4 MB post-GC residual heap is a direct consequence of this model: content buffers never accumulate on the heap. Each buffer has exactly one live owner at a time, and that owner returns the buffer to the pool before going out of scope. The pool maintains a bounded number of reusable buffers (64 KB initial capacity, discarded if > 1 MB after growth), ensuring the pool's working set converges to a fixed value regardless of volume processed.
| Model | Steady-state heap for 2.1 GB input |
|---|---|
| Naive (new buffer per document) | Proportional to parallelism × average document size |
| Pool without explicit ownership | Race conditions → corruption or leaks |
| GopherDoc (transferable ownership) | Constant: 2.4 MB |
In Go, a slice is a three-field struct: {ptr *T, len int, cap int}. A sub-slice content[a:b] creates a new three-field struct pointing to the same underlying memory — O(1) cost, zero bytes copied.
content: [ h e l l o w o r l d ]
0 1 2 3 4 5 6 7 8 9 10 11
chunk[0].Content: content[0:5] → ptr=&content[0], len=5
chunk[1].Content: content[3:11] → ptr=&content[3], len=8 (2-byte overlap)
No content bytes are copied between parse and final consumer. The only allocation per chunk is the Document struct itself — hence the guarantee of 1 allocation per chunk for text and Markdown formats.
JSON is the only format where zero-copy does not apply at the parse layer. json.Unmarshal requires materializing the value tree to validate structure before any output. The adopted strategy minimizes the damage: use json.NewEncoder writing directly into a pooled buffer, avoiding the intermediate string allocation of json.MarshalIndent:
buf := pool.GetBuffer()
enc := json.NewEncoder(buf)
enc.SetIndent("", " ")
enc.Encode(v) // writes directly into the pooled bufferThe allocation ratio for JSON converges to ~3.1× input (1× for the Unmarshal tree + ~2× for encoder output). Going below this baseline without reimplementing the JSON parser is not possible — an explicitly accepted trade-off.
A naive fixed-offset chunker fails in two scenarios:
1. Cutting inside a multi-byte rune. UTF-8 encodes codepoints U+0080 and above in 2–4 bytes. A cut at an arbitrary position may separate continuation bytes (10xxxxxx) from the start byte (11xxxxxx), producing invalid sequences that silently corrupt downstream embeddings.
2. Cutting inside a word. LLM tokenizers use BPE (Byte Pair Encoding) — tokens are derived from common substrings in running text. Cutting "retriev|al" produces two degraded tokens instead of one semantically rich token, measurably reducing embedding quality.
The algorithm operates in two passes over the buffer, both O(k) where k ≤ chunkSize:
Pass 1 — Retreat chunk end:
Given: content[pos:end] as chunk candidate
IF content[end] or content[end-1] is whitespace:
return end // clean boundary, no retreat needed
ELSE:
scan content[end-1 .. pos] for whitespace
IF found at i: return i+1
// Fallback: no whitespace in range — align to UTF-8 rune boundary
WHILE end > pos AND NOT utf8.RuneStart(content[end]):
end--
return end
utf8.RuneStart(b) returns true if and only if b & 0xC0 != 0x80 — the byte is not a UTF-8 continuation byte. This guarantees every cut lands on a valid rune start.
Pass 2 — Retreat overlap start:
The overlap rewinds overlapSize bytes from the end of the previous chunk. This point may land inside a word, producing a chunk with a semantically degraded start:
Given: target = end - overlapSize
IF content[target-1] is whitespace: return target // clean start
ELSE:
scan content[target-1 .. minPos] for whitespace
IF found at i: return i+1
// Fallback: no boundary in range — return end (progress guaranteed)
return fallback
The fallback to end (not target) is critical: it guarantees monotonic progress of pos even in dense content without ASCII whitespace (e.g., CJK text without separators), preventing infinite loops.
| Property | Guarantee |
|---|---|
| Complexity per chunk | O(chunkSize) — retreat bounded to the window |
| Total complexity | O(n) — each byte visited at most 2× (chunk + overlap) |
| UTF-8 validity | Guaranteed by utf8.RuneStart in the fallback path |
Monotonicity of pos |
next > pos always, by construction of the fallback |
| No bytes dropped | Verified by property-based testing (see §5) |
Known limitation: chunkSize < 4 on 100% multi-byte content without whitespace (e.g., pure Kanji) can approximate loop behavior — retreat does not converge if end - pos < 4 and no RuneStart is found in range. Documented; recommended mitigation: chunkSize ≥ 16 for CJK corpora.
Unit tests verify specific known cases. For a chunker, this is insufficient: the relevant input space includes all combinations of (content, chunkSize, overlapSize) — a combinatorial space that manual tests cannot cover.
The core property to verify is not "the output for this specific input is correct" but rather "for any valid input, structural invariants hold".
Property 1 — No bytes dropped:
For any position-encoded content (content[i] = '!' + (i % 94), spaces at i%7==6), the union of all chunks must contain each input position exactly once (excluding overlap bytes, which appear in two consecutive chunks).
Verification uses pointer arithmetic to determine the exact offset of each sub-slice within the parent buffer — eliminating the ambiguity of bytes.Equal on content with repeated bytes:
func chunkOffset(parent, sub []byte) int {
if len(sub) == 0 { return 0 }
return int(reflect.ValueOf(sub).Pointer() - reflect.ValueOf(parent).Pointer())
}Property 2 — chunkSize respected:
∀ chunk: len(chunk.Content) ≤ chunkSize
Retreat only reduces chunk size — never increases it. Verified across 200 runs with chunkSize ∈ [16, 512].
Property 3 — UTF-8 integrity:
∀ chunk: utf8.Valid(chunk.Content) == true
Verified with content containing multi-byte runes (U+00E9, U+4E2D, U+1F600) at all offset combinations.
Property 4 — Exact overlap:
The last overlapSize bytes of chunk N are identical to the start of chunk N+1. Verified as sub-slice identity (same pointer, not bytes.Equal).
Property 5 — Zero-copy confirmed:
chunk.Content is a sub-slice of the original buffer — reflect.ValueOf(chunk.Content).Pointer() must fall within [base, base+len(original)). Any violation indicates an unintended copy.
200 runs do not provide exhaustive coverage — they provide high-probability detection of systematic failures. If a property fails on a fraction p of the input space, the probability that 200 consecutive runs do not detect the failure is (1-p)^200. For p = 0.01 (1% of inputs are buggy), this probability is 0.134% — acceptable for non-safety-critical infrastructure.
Position-encoded content was chosen specifically to eliminate false negatives from coincidence: with unique bytes per position, any reordering, drop, or duplication is detectable with certainty.
The transferable ownership model (PoolBuf on the last chunk, Release() mandatory for the consumer) imposes a non-trivial contract on the public API. A consumer that forgets to call Release() does not cause a crash — it causes a gradual pool leak (buffers do not return to the pool; new allocations compensate). This is detectable via heap profiling but not by the compiler.
Alternative considered: automatic return via runtime.SetFinalizer. Discarded because finalizers in Go are non-deterministic with respect to timing — they would introduce unpredictable GC latency on exactly the path being made predictable.
Decision: accept ownership complexity in exchange for determinism.
Normalizing JSON requires structural validation before any output — it is not possible to produce well-formed JSON in streaming without materializing the value tree. The 3.1× ratio is the theoretical minimum for any correct implementation in Go using encoding/json.
Alternative considered: a streaming JSON formatter that normalizes indentation without Unmarshal. Discarded due to implementation complexity and correctness risk on edge cases (arbitrary-precision numbers, Unicode escapes, nested structures).
Decision: accept 3.1× for JSON; document as the worst-case upper bound.
The retreat algorithm assumes ASCII whitespace exists at sufficient frequency to find a boundary within the retreat window. In pure CJK corpora, this assumption fails.
Alternative considered: full Unicode word boundary detection (Unicode Standard Annex #29). Discarded for v1.x due to implementation cost and the zero external dependencies constraint — a complete UAX#29 implementation requires non-trivial lookup tables.
Decision: document the limitation, recommend chunkSize ≥ 16, schedule for v2.0 via deterministic finite automaton over per-rune Unicode properties.
Parsers use a pooled bufio.Reader (64 KB) over io.LimitReader. This eliminates any possibility of seek or re-read on the stream. If a parser requires lookahead beyond the bufio.Reader internal buffer, the abstraction breaks.
For the supported formats (Markdown line-by-line, CSV line-by-line, JSON via json.Decoder, PlainText line-by-line), 64 KB of lookahead is more than sufficient. For future formats requiring seek (e.g., binary formats with an index at end-of-file), pooling bufio.Reader will be insufficient and will require a redesign of the parser contract.
Decision: accept the limitation for v1.x. The io.Reader contract on the Parser interface is intentional — it is not io.ReadSeeker.
Implementation reference: github.com/VesperArch/GopherDoc @ v1.1.0