This document describes the design for storing file attachments associated with MindooDB documents. The approach uses a unified ContentAddressedStore interface that serves both Automerge document changes and attachment chunks, providing secure, efficient, and flexible storage with deduplication, optional local caching, random access, and transparent synchronization.
After exploring various options, we chose to unify document changes and attachment storage under a single ContentAddressedStore interface. This provides:
- Consistent interface: Same store interface for documents and attachments
- Flexible deployment: MindooDB can use separate stores for docs and attachments
- Shared infrastructure: Reuses existing sync, encryption, and authentication patterns
- Type differentiation: Uses
entryTypefield to distinguish between entry types
The ContentAddressedStore stores entries with a unique id and a contentHash for deduplication. The entryType field distinguishes between:
doc_create- Document creation (first Automerge change)doc_change- Document modification (subsequent Automerge changes)doc_snapshot- Automerge snapshot for performance optimizationdoc_delete- Document deletion (tombstone entry)attachment_chunk- File attachment chunk
Store implementations can use this field to optimize storage (e.g., inline small doc changes, external storage for large attachment chunks).
Each store entry has two distinct identifiers:
-
id: Unique identifier (primary key) for the entry- For doc_* entries:
<docId>_d_<depsFingerprint>_<automergeHash> - For attachment_chunk:
<docId>_a_<fileUuid7>_<base62ChunkUuid7>
- For doc_* entries:
-
contentHash: SHA-256 hash of the encrypted data- Used for storage-level deduplication
- Multiple entries can share the same contentHash
This separation enables:
- Unique metadata per entry (even when content is identical)
- Storage-level deduplication (same bytes stored once)
- No metadata collisions when files share content
Document Entry ID Format:
<docId>_d_<depsFingerprint>_<automergeHash>
docId: Document UUID7d: Type marker for "document"depsFingerprint: First 8 chars of SHA256(sorted Automerge deps), or "0" if no depsautomergeHash: The Automerge change hash
Attachment Chunk ID Format:
<docId>_a_<fileUuid7>_<base62ChunkUuid7>
docId: Document UUID7 this attachment belongs toa: Type marker for "attachment"fileUuid7: UUID7 for the whole file (same for all chunks)base62ChunkUuid7: Base62-encoded UUID7 for this specific chunk
MindooDB accepts two store instances in its constructor:
class MindooDB {
constructor(
tenant: MindooTenant,
docStore: ContentAddressedStore,
attachmentStore?: ContentAddressedStore
)
getStore(): ContentAddressedStore; // Returns docStore
getAttachmentStore(): ContentAddressedStore | undefined; // Returns attachmentStore
}This enables flexible deployment options:
- Single store: Use one store for both documents and attachments (simple deployments)
- Separate stores: Use different stores/backends (e.g., local docs, cloud attachments)
- No attachments: Don't configure an attachment store for document-only use cases
The store deduplicates content at the storage level:
- Metadata: Stored by
id(always unique) - Content: Stored by
contentHash(deduplicated)
When two entries have the same contentHash:
- Both entries have their own metadata (different
id,docId,dependencyIds) - The encrypted bytes are stored only once
- When an entry is deleted, orphaned content is cleaned up
Chunk Size: 256KB per chunk (configurable)
Attachment chunks use the same StoreEntry type as document changes:
interface StoreEntry extends StoreEntryMetadata {
entryType: "attachment_chunk"; // Identifies this as an attachment chunk
id: string; // Unique chunk ID (format: docId_a_fileId_chunkId)
contentHash: string; // SHA-256 of encryptedData (for deduplication)
docId: string; // Document this chunk belongs to
dependencyIds: string[]; // Entry ID of previous chunk (for append-only files)
createdAt: number;
createdByPublicKey: string; // Author's public signing key
decryptionKeyId: string; // "default" or named key ID
signature: Uint8Array; // Signature over encrypted data
encryptedData: Uint8Array; // Encrypted chunk data
originalSize?: number; // Plaintext chunk size before encryption
}Attachments are referenced in Automerge documents via lightweight metadata:
interface AttachmentReference {
attachmentId: string; // UUID7 for this attachment instance
fileName: string; // Original filename
mimeType: string; // MIME type
size: number; // Total file size in bytes
lastChunkId: string; // Entry ID of the last chunk (enables append-only growth)
decryptionKeyId: string; // Same key as document ("default" or named)
createdAt: number; // When attachment was added
createdBy: string; // User public key
}Document Structure Example:
{
title: "My Document",
content: "...",
_attachments: [
{
attachmentId: "123e4567-...",
fileName: "report.pdf",
mimeType: "application/pdf",
size: 5242880,
lastChunkId: "123e4567-..._a_file-uuid_chunk-id",
decryptionKeyId: "default",
createdAt: 1234567890,
createdBy: "-----BEGIN PUBLIC KEY-----..."
}
]
}The design supports appending content to files without copying existing data:
- Each chunk has
dependencyIdspointing to the previous chunk's entry ID - Document metadata stores
lastChunkIdand totalsize - To append: Create new chunks pointing back to the previous last chunk
- Update document metadata with new
lastChunkIdandsize - Use
resolveDependencies()to traverse from last to first chunk for reading - Earlier document revisions return attachment data up to their
lastChunkId
This is ideal for log files and other append-only data.
The ContentAddressedStore.resolveDependencies() method enables:
- Attachment streaming: Traverse from last chunk to first
- Document loading: Stop at snapshots when loading document history
interface ContentAddressedStore {
resolveDependencies(
startId: string,
options?: {
stopAtEntryType?: string; // Stop at "doc_snapshot" for docs
maxDepth?: number; // Limit traversal depth
includeStart?: boolean; // Include startId in result
}
): Promise<string[]>;
}Each chunk is encrypted independently using AES-256-GCM with two modes:
- IV is randomly generated for each encryption
- Same plaintext produces different ciphertext each time
- No deduplication possible
- More secure for sensitive content
- IV is derived from SHA-256(plaintext)[:12]
- Same plaintext + same key = same ciphertext
- Enables tenant-wide deduplication
- Reveals when identical content is stored (acceptable trade-off)
Encrypted Data Format:
[mode byte (1)] [IV (12 bytes)] [ciphertext + GCM tag]
- Key: Same as document (tenant key or named key from
decryptionKeyId) - Mode byte: 0x00 = random IV, 0x01 = deterministic IV
- IV: 12 bytes (random or derived from content)
- contentHash: SHA-256 of complete encrypted payload (mode + IV + ciphertext)
Why encrypt before hashing?
- Security: Content hash doesn't reveal plaintext information
- Per-key deduplication: Same file with different keys = different hashes
- Consistent with document change encryption
Deterministic Encryption Trade-offs:
- Pro: Tenant-wide deduplication (all users encrypting same file = same contentHash)
- Pro: Bandwidth savings on sync (don't transfer duplicate content)
- Pro: Storage savings (one copy of encrypted bytes per contentHash)
- Con: Reveals when identical content exists (metadata pattern)
- Con: Same content always produces same ciphertext (less secure than random IV)
-
Document Changes Sync:
- Sync document entries via
docStore - Extract attachment references from documents
- Identify required chunk entry IDs
- Sync document entries via
-
Attachment Chunks Sync:
- Compare chunk IDs with remote
attachmentStore - Fetch missing chunks using
resolveDependencies() - Store chunks locally (if optional storage enabled)
- Compare chunk IDs with remote
- On-Demand Fetching: Only fetch chunks when attachment is accessed
- Streaming: Use
resolveDependencies()to traverse chunk chain - Background Sync: Optionally sync all chunks in background
Attachment methods are available on MindooDoc. Write methods (addAttachment, addAttachmentStream, removeAttachment, appendToAttachment) can only be called within the MindooDB.changeDoc() callback. Read methods (getAttachment, getAttachmentRange, streamAttachment, getAttachments) work anywhere.
interface MindooDoc {
// ========== Write Methods (only within changeDoc callback) ==========
// Add an attachment from in-memory data
addAttachment(
fileData: Uint8Array,
fileName: string,
mimeType: string,
decryptionKeyId?: string
): Promise<AttachmentReference>;
// Add an attachment from a streaming source (memory efficient for large files)
// Works with ReadableStream, Node streams, async generators, etc.
addAttachmentStream(
dataStream: AsyncIterable<Uint8Array>,
fileName: string,
mimeType: string,
decryptionKeyId?: string
): Promise<AttachmentReference>;
// Remove an attachment (removes reference, chunks remain in store)
removeAttachment(attachmentId: string): Promise<void>;
// Append data to an existing attachment (for log files, etc.)
appendToAttachment(attachmentId: string, data: Uint8Array): Promise<void>;
// ========== Read Methods (work anywhere) ==========
// Get all attachment references
getAttachments(): AttachmentReference[];
// Get full attachment content (fetches chunks, decrypts, assembles)
getAttachment(attachmentId: string): Promise<Uint8Array>;
// Get a byte range (random access, only fetches needed chunks)
getAttachmentRange(
attachmentId: string,
startByte: number,
endByte: number
): Promise<Uint8Array>;
// Stream attachment data chunk by chunk (memory efficient)
streamAttachment(
attachmentId: string,
startOffset?: number
): AsyncGenerator<Uint8Array, void, unknown>;
}
interface MindooDB {
// Get the attachment store (may be same as doc store or separate)
getAttachmentStore(): ContentAddressedStore | undefined;
}interface AttachmentStoragePolicy {
maxLocalSize?: number; // Maximum total size of attachments
maxLocalCount?: number; // Maximum number of chunks (LRU)
keepForDocuments?: string[]; // Always keep for these doc IDs
neverStore?: boolean; // Always fetch from remote
}- LRU Cache: Track access times, evict least recently used chunks
- Size Threshold: Keep chunks below total size limit
- Document-Based: Keep all chunks for certain documents
The ContentAddressedStore.purgeDocHistory(docId) method enables:
- Removing all entries (document changes AND attachment chunks) for a document
- Supporting "right to be forgotten" requirements
- Coordinated cleanup across document and attachment stores
- Automatic cleanup of orphaned content (bytes no longer referenced by any entry)
- Create
ContentAddressedStoreinterface withid/contentHashseparation - Implement
InMemoryContentAddressedStorewith byte-level deduplication - Add
StoreEntrytypes withentryTypefield - Update
MindooDBto accept two stores - Implement structured ID generation utilities
- Add deterministic encryption for attachments
- Implement chunking and
attachment_chunkentries - Add attachment methods to
MindooDoc(withinchangeDoc()callback) - Store attachment references in documents (
_attachmentsarray) - Implement
addAttachment()for in-memory data - Implement
addAttachmentStream()for streaming uploads - Implement
getAttachment(),getAttachmentRange(),streamAttachment() - Implement
removeAttachment()andappendToAttachment()
- Implement attachment chunk sync
- Add
resolveDependencies()usage for streaming - Integrate with document sync (two-phase sync)
- Implement
AttachmentCacheManagerwith LRU/size policies - Add storage policies configuration
- Implement eviction logic
- Add transparent remote fetching
- Content-defined chunking (variable-size)
- Tiered storage (external storage migration)
// Create MindooDB with separate stores for docs and attachments
const docStore = docStoreFactory.createStore("mydb");
const attachmentStore = attachmentStoreFactory.createStore("mydb-attachments");
const db = new BaseMindooDB(tenant, docStore, attachmentStore);
// Create a document and add attachment in one changeDoc call
const doc = await db.createDocument();
let attachmentRef: AttachmentReference;
await db.changeDoc(doc, async (d) => {
d.getData().title = "My Document";
// Add attachment from in-memory data
const fileData = new Uint8Array([/* ... */]);
attachmentRef = await d.addAttachment(fileData, "report.pdf", "application/pdf");
});
console.log(`Attachment ID: ${attachmentRef.attachmentId}`);
console.log(`Size: ${attachmentRef.size} bytes`);
// Add attachment from a stream (memory efficient for large files)
await db.changeDoc(doc, async (d) => {
// From fetch response
const response = await fetch('/large-file.pdf');
await d.addAttachmentStream(response.body!, "large.pdf", "application/pdf");
// Or from a File input (browser)
// const file = inputElement.files[0];
// await d.addAttachmentStream(file.stream(), file.name, file.type);
// Or from an async generator
// async function* generateData() { yield new Uint8Array([1,2,3]); }
// await d.addAttachmentStream(generateData(), "generated.bin", "application/octet-stream");
});
// Read methods work outside changeDoc
const reloadedDoc = await db.getDocument(doc.getId());
// List all attachments
const attachments = reloadedDoc.getAttachments();
console.log(`Document has ${attachments.length} attachments`);
// Get full attachment content
const data = await reloadedDoc.getAttachment(attachmentRef.attachmentId);
// Get a byte range (random access, only fetches needed chunks)
const firstMB = await reloadedDoc.getAttachmentRange(
attachmentRef.attachmentId,
0,
1024 * 1024 // First 1MB
);
// Stream attachment data (memory efficient for large files)
for await (const chunk of reloadedDoc.streamAttachment(attachmentRef.attachmentId)) {
// Process chunk by chunk
console.log(`Received ${chunk.length} bytes`);
}
// Stream from an offset
for await (const chunk of reloadedDoc.streamAttachment(attachmentRef.attachmentId, 1024 * 1024)) {
// Start from 1MB offset
}
// Modify attachments (must be within changeDoc)
await db.changeDoc(reloadedDoc, async (d) => {
// Append data to an existing attachment (for log files)
await d.appendToAttachment(attachmentRef.attachmentId, new Uint8Array([4, 5, 6]));
// Remove an attachment
await d.removeAttachment(attachmentRef.attachmentId);
});The unified ContentAddressedStore approach provides:
- Consistent interface: Same store interface for documents and attachments
- Flexible deployment: Separate or combined stores based on needs
- Content-addressable storage: Deduplication via contentHash
- Separate id and contentHash: No metadata collisions with deduplication
- Deterministic encryption: Tenant-wide deduplication for attachments
- Append-only files: Support for log files and growing data
- Streaming support: Memory-efficient upload and download for large files
- Random access: Efficient byte-range retrieval without loading entire files
- Security: Per-chunk encryption using existing key model
- Synchronization: Reuse proven sync patterns from document changes
- GDPR compliance: Coordinated cleanup via
purgeDocHistory() - Future-proof: Design allows migration to external storage
Core infrastructure (Phase 1) and attachment storage (Phase 2) are complete. Future phases will add synchronization, local caching, and advanced features.