Fortémi provides a secure, content-addressable file attachment system for storing files alongside notes. Files are automatically deduplicated, safely validated, and can have their content extracted for search indexing.
The file attachment system supports:
- Content-addressable storage with BLAKE3 hashing for automatic deduplication
- Multiple storage backends (database inline, filesystem, S3-compatible)
- UUIDv7-based filesystem paths for time-ordered, scalable storage
- Automatic content extraction for indexing and search
- EXIF metadata extraction from images (GPS, timestamps, camera info)
- Multi-layer security validation (magic bytes, extension blocklist, size limits)
- Reference counting for safe garbage collection
Formats:
- JPEG (.jpg, .jpeg)
- PNG (.png)
- GIF (.gif)
- WebP (.webp)
- HEIF/HEIC (.heic, .heif)
- TIFF (.tiff, .tif)
- RAW formats (Canon .cr2, Nikon .nef, Sony .arw, etc.)
Automatic Processing:
- EXIF metadata extraction (GPS coordinates, capture time, camera device)
- Image orientation detection
- Dimensions extraction
- Vision model analysis (optional, using LLaVA or similar models)
Example Use Cases:
- Photo attachments with GPS location tracking
- Screenshots for documentation
- Diagrams and charts
Formats:
- PDF (.pdf)
- Markdown (.md)
- Plain text (.txt)
- Microsoft Office (.docx, .xlsx, .pptx)
- OpenDocument (.odt, .ods, .odp)
- HTML (.html, .htm)
Automatic Processing:
- PDF text extraction using
pdftotext - OCR for scanned PDFs using Tesseract (when
pdf_ocrstrategy is enabled) - Office document conversion using Pandoc
- Direct text extraction for plaintext/markdown
Example Use Cases:
- Research papers and publications
- Meeting notes and reports
- Project documentation
Formats:
- MP3 (.mp3)
- WAV (.wav)
- FLAC (.flac)
- M4A (.m4a)
- OGG (.ogg)
- Opus (.opus)
Automatic Processing:
- Audio transcription using Whisper models
- Speech-to-text for voice notes
- Metadata extraction (duration, bitrate, codec)
Example Use Cases:
- Voice memos
- Meeting recordings
- Podcast clips
Formats:
- MP4 (.mp4)
- WebM (.webm)
- MOV (.mov)
- AVI (.avi)
- MKV (.mkv)
Automatic Processing:
- Video multimodal processing (frame extraction + audio transcription)
- Keyframe analysis for visual content
- Audio track transcription
Example Use Cases:
- Tutorial videos
- Screencasts
- Presentation recordings
Formats:
- GLB/GLTF (.glb, .gltf)
- STL (.stl)
- OBJ (.obj)
- FBX (.fbx)
Automatic Processing:
- Structured metadata extraction
- Model bounds and statistics
Example Use Cases:
- CAD designs
- 3D printing models
- Game assets
Formats:
- All programming language extensions (detected by extension)
- Common examples: .rs, .py, .js, .ts, .java, .cpp, .go, etc.
Automatic Processing:
- Code AST parsing using tree-sitter
- Syntax highlighting metadata
- Function/class extraction for semantic search
Example Use Cases:
- Code snippets
- Configuration files
- Scripts and utilities
Formats:
- ZIP (.zip)
- TAR (.tar, .tar.gz, .tgz)
- 7z (.7z)
Automatic Processing:
- Archive listing
- Metadata extraction (compressed size, file count)
Example Use Cases:
- Project backups
- File collections
- Data exports
Fortémi automatically extracts searchable content from attachments using document type-specific extraction strategies.
For: Plaintext, Markdown, CSV, JSON, YAML, XML
How it works:
- Direct UTF-8 text extraction
- No conversion needed
- Fastest strategy
Output:
- Raw text content indexed for full-text search
- Structured data preserved for display
For: Text-based PDF documents
How it works:
- Uses
pdftotextfor text extraction - Preserves text layout when possible
- Handles embedded fonts
Output:
- Searchable text content
- Page metadata (page count, dimensions)
Example extracted metadata:
{
"pages": 10,
"title": "Research Paper Title",
"author": "Author Name",
"creation_date": "2026-01-15"
}For: Scanned PDFs, image-based PDFs
How it works:
- Uses Tesseract OCR for text recognition
- Automatically triggered when
pdftotextyields low confidence - Supports multiple languages
Output:
- OCR-extracted text with confidence scores
- Page-by-page processing metadata
Configuration:
{
"extraction_strategy": "pdf_ocr",
"extraction_config": {
"languages": ["eng", "fra"],
"dpi": 300,
"page_segmentation_mode": 3
}
}For: Images (JPEG, PNG, WebP, HEIF)
How it works:
- Analyzes image content using vision models (LLaVA, GPT-4V)
- Generates descriptive captions
- Identifies objects, scenes, text in images
Output:
- Natural language image description
- Detected objects and concepts
- OCR text from images
Example output:
{
"caption": "A diagram showing system architecture with three main components",
"detected_objects": ["server", "database", "client"],
"extracted_text": "Load Balancer → App Server → PostgreSQL"
}For: Audio files (MP3, WAV, M4A, FLAC)
How it works:
- Uses Whisper models for speech recognition
- Generates timestamped transcriptions
- Supports multiple languages
Output:
- Full transcription text
- Timestamps for each segment
- Speaker diarization (optional)
Example output:
{
"transcript": "Welcome to the meeting. Today we'll discuss...",
"segments": [
{
"start": 0.0,
"end": 5.2,
"text": "Welcome to the meeting."
}
],
"language": "en",
"confidence": 0.94
}For: Video files (MP4, WebM, MOV)
How it works:
- Extracts keyframes for visual analysis
- Transcribes audio track
- Combines visual and audio insights
Output:
- Video transcript
- Keyframe descriptions
- Scene changes and timestamps
For: Source code files
How it works:
- Parses code using tree-sitter
- Extracts functions, classes, imports
- Generates semantic code summaries
Output:
- Function signatures and documentation
- Code structure metadata
- Import dependencies
Example output:
{
"functions": [
{
"name": "compute_hash",
"signature": "fn compute_hash(data: &[u8]) -> String",
"docstring": "Compute BLAKE3 hash of data"
}
],
"imports": ["blake3", "uuid"],
"language": "rust"
}For: Microsoft Office, OpenDocument formats
How it works:
- Uses Pandoc to convert to Markdown
- Preserves document structure
- Extracts images and media
Output:
- Markdown representation
- Embedded images extracted separately
- Document metadata
For: JSON, YAML, XML, TOML
How it works:
- Parses structured data
- Generates human-readable summaries
- Indexes key-value pairs
Output:
- Flattened key-value text
- Schema detection
- Nested structure metadata
For image files, Fortémi automatically extracts EXIF metadata for enhanced provenance tracking and spatial search.
Temporal Metadata:
datetime- Original capture time (fromDateTimeOriginal,DateTimeDigitized, orDateTime)- Timezone offsets (from
OffsetTimeOriginal,OffsetTimeDigitized)
Spatial Metadata (GPS):
latitude- Decimal degrees (positive = North, negative = South)longitude- Decimal degrees (positive = East, negative = West)altitude- Meters above sea level
Device Information:
make- Camera/device manufacturer (e.g., "Apple")model- Device model (e.g., "iPhone 15 Pro")software- Processing software (e.g., "Adobe Lightroom 2026.1")
Image Properties:
orientation- EXIF orientation tag (1-8)dimensions- Width × height in pixels
GPS coordinates are automatically converted to PostGIS-compatible formats for spatial queries.
WKT Format:
-- Point in Well-Known Text format
POINT(2.2945 48.8584)PostGIS Geography:
-- Geographic point with WGS84 (SRID 4326)
ST_SetSRID(ST_MakePoint(2.2945, 48.8584), 4326)::geographyUse Cases:
- Find all photos taken within 5km of a location
- Group photos by geographic region
- Create travel timeline maps
EXIF data may contain sensitive information:
- GPS coordinates reveal photo locations
- Timestamps reveal when photos were taken
- Device info may reveal camera model
Always review EXIF data before sharing notes publicly. Use EXIF stripping tools when privacy is required.
Fortémi implements multi-layer security validation to prevent malicious file uploads.
Files are validated by their actual content (magic bytes), not just extension.
Blocked Signatures:
Windows PE/MZ - 0x4D 0x5A
ELF (Linux) - 0x7F 0x45 0x4C 0x46
Mach-O (macOS) - 0xFE 0xED 0xFA 0xCE/CF
Java Class - 0xCA 0xFE 0xBA 0xBE
WebAssembly - 0x00 0x61 0x73 0x6D
Shell Scripts - #! (shebang)
Why: Prevents execution of disguised binaries (e.g., malware.jpg that's actually an .exe)
Dangerous extensions are explicitly blocked:
Executables:
.exe, .dll, .scr, .com, .msi, .so, .dylib
Scripts:
.sh, .bat, .cmd, .ps1, .vbs, .js, .jse
JVM Bytecode:
.jar, .war, .ear, .class
Packages:
.deb, .rpm, .apk, .dmg, .pkg
Office Macros:
.xlsm, .xlsb, .docm, .pptm (macro-enabled formats)
Other Dangerous:
.reg, .inf, .scf, .lnk, .hta
Why: Defense-in-depth against extension-based attacks
Files larger than the configured maximum are rejected.
Default Limits:
- API upload: 100 MB (configurable via
MAX_UPLOAD_SIZE) - Database inline storage threshold: 10 MB
Why: Prevents denial-of-service and storage exhaustion
Filenames are automatically sanitized to prevent path traversal and filesystem attacks.
Removed/Replaced:
- Path separators (
/,\) - Control characters (
\0,\x01, etc.) - Dangerous characters (
<,>,:,",|,?,*) - Overly long names (truncated to 255 characters)
Example:
// Input: "../../../etc/passwd"
// Output: "........etc.passwd"
// Input: "file<dangerous>.txt"
// Output: "file_dangerous_.txt"
// Input: "" (empty)
// Output: "unnamed_file"Why: Prevents directory traversal and filesystem injection
Upload Request
│
▼
┌─────────────────┐
│ Size Check │ → Reject if > MAX_UPLOAD_SIZE
└────────┬────────┘
│
▼
┌─────────────────┐
│ Extension Check │ → Reject if in blocklist
└────────┬────────┘
│
▼
┌─────────────────┐
│ Magic Byte Check│ → Reject if executable signature
└────────┬────────┘
│
▼
┌─────────────────┐
│ Sanitize Name │ → Clean filename
└────────┬────────┘
│
▼
┌─────────────────┐
│ Compute Hash │ → BLAKE3 for deduplication
└────────┬────────┘
│
▼
Store ✓
Files that fail validation are marked with status: quarantined and:
- Not indexed for search
- Not processed for content extraction
- Visible only to administrators
- Can be manually reviewed and deleted
Files are deduplicated using BLAKE3 hashing to save storage space.
-
Hash Computation:
// Compute content hash let hash = blake3::hash(file_data); let content_hash = format!("blake3:{}", hash.to_hex());
-
Lookup Existing Blob:
SELECT id FROM attachment_blob WHERE content_hash = $1;
-
Reuse or Create:
- If blob exists → reuse existing blob ID
- If new → create new blob and store data
-
Reference Counting:
-- Incremented on attach UPDATE attachment_blob SET reference_count = reference_count + 1 WHERE id = $1; -- Decremented on delete UPDATE attachment_blob SET reference_count = reference_count - 1 WHERE id = $1;
-
Garbage Collection:
- Blobs with
reference_count = 0are candidates for cleanup - Configurable grace period (default: 24 hours)
- Manual cleanup via
cleanup_orphaned_blobs(min_age_hours)
- Blobs with
Example Scenario:
- Upload
document.pdf(5 MB) to 10 different notes - Without deduplication: 50 MB storage
- With deduplication: 5 MB storage + 10 small metadata records
- Savings: 90%
Real-World Benefits:
- Repeated screenshots (e.g., same error message)
- Common reference documents
- Duplicate uploads by different users
Fortémi uses a flexible storage backend system supporting multiple storage types.
For: Small files (<10 MB by default)
Storage Location: PostgreSQL BYTEA column
Advantages:
- Atomic transactions with metadata
- Automatic backup with database
- No filesystem dependencies
Disadvantages:
- Increases database size
- TOAST limit: 1 GB per row
- Slower for large files
Configuration:
let repo = PgFileStorageRepository::new(
pool,
backend,
10_485_760 // 10 MB threshold
);For: Large files (≥10 MB by default)
Storage Location: UUIDv7-based directory structure
Path Format:
blobs/{first-2-hex}/{next-2-hex}/{uuid}.bin
Example:
blobs/01/94/01948f7e-8b2a-7c3d-9e4f-5a6b7c8d9e0f.bin
Advantages:
- No database size increase
- Fast filesystem operations
- Easy to snapshot/backup separately
- Scales to billions of files (65,536 subdirectories)
Disadvantages:
- Requires filesystem access
- Backup complexity (database + filesystem)
Directory Structure:
/var/matric/blobs/
├── 01/
│ ├── 94/
│ │ ├── 01948f7e-8b2a-7c3d-9e4f-5a6b7c8d9e0f.bin
│ │ └── 01949a1b-2c3d-4e5f-6a7b-8c9d0e1f2a3b.bin
│ └── 95/
│ └── 01959c2d-3e4f-5a6b-7c8d-9e0f1a2b3c4d.bin
└── 02/
└── ...
UUIDv7 Benefits:
- Time-ordered: newer files in newer directories
- Natural sharding: first 4 hex chars distribute across 65K dirs
- Predictable paths: deterministic generation from UUID
For: Cloud storage integration
Storage Location: S3, MinIO, Backblaze B2, etc.
Advantages:
- Unlimited scalability
- Geographic replication
- CDN integration
Status: Planned, not yet implemented
Decision Flow:
File Size Check
│
▼
< 10 MB? → YES → Database (inline BYTEA)
│
NO
│
▼
≥ 10 MB → Filesystem (UUIDv7 paths)
Override Threshold:
# Set inline threshold via environment variable
export INLINE_STORAGE_THRESHOLD_MB=5
# Or in Rust configuration
let repo = PgFileStorageRepository::new(
pool,
backend,
5 * 1024 * 1024 // 5 MB threshold
);-- Content-addressable blob storage
CREATE TABLE attachment_blob (
id UUID PRIMARY KEY, -- UUIDv7 blob ID
content_hash TEXT NOT NULL UNIQUE, -- blake3:{64-hex}
content_type TEXT NOT NULL, -- MIME type
size_bytes BIGINT NOT NULL, -- File size
storage_backend TEXT NOT NULL, -- 'database', 'filesystem', 's3'
storage_path TEXT, -- Path for filesystem/s3
data BYTEA, -- Inline data for 'database' backend
reference_count INT NOT NULL DEFAULT 1, -- Number of attachments referencing this
verified_at TIMESTAMPTZ, -- Last integrity check
verification_status TEXT, -- 'verified', 'failed', 'pending'
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Attachment metadata (links notes to blobs)
CREATE TABLE attachment (
id UUID PRIMARY KEY, -- UUIDv7 attachment ID
note_id UUID NOT NULL REFERENCES note(id) ON DELETE CASCADE,
blob_id UUID NOT NULL REFERENCES attachment_blob(id),
filename TEXT NOT NULL, -- Display name
original_filename TEXT, -- Original upload name
document_type_id UUID REFERENCES document_type(id),
status TEXT NOT NULL DEFAULT 'uploaded', -- Processing status
extraction_strategy TEXT, -- Content extraction strategy
extracted_text TEXT, -- Searchable extracted content
extracted_metadata JSONB, -- Structured metadata (EXIF, etc.)
has_preview BOOLEAN NOT NULL DEFAULT FALSE,
is_canonical_content BOOLEAN NOT NULL DEFAULT FALSE,
display_order INT, -- Ordering for multiple attachments
processing_error TEXT, -- Error message if failed
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Indexes
CREATE INDEX idx_attachment_note ON attachment(note_id);
CREATE INDEX idx_attachment_blob ON attachment(blob_id);
CREATE INDEX idx_attachment_status ON attachment(status);
CREATE INDEX idx_blob_hash ON attachment_blob(content_hash);
CREATE INDEX idx_blob_backend ON attachment_blob(storage_backend);
CREATE INDEX idx_blob_orphans ON attachment_blob(reference_count) WHERE reference_count = 0;When using Fortémi through the MCP (Model Context Protocol) server, all binary file operations use filesystem paths instead of base64 encoding. Binary data never passes through the LLM context window.
// Upload a file from disk to a note
upload_attachment({
note_id: "01948f7e-1234-5678-9abc-def012345678",
file_path: "/home/user/docs/research-paper.pdf",
content_type: "application/pdf"
})
// The MCP server reads the file directly from disk
// Returns: { id, note_id, filename, content_type, size_bytes, status }// Download an attachment to disk
download_attachment({
id: "01948f7e-8b2a-7c3d-9e4f-5a6b7c8d9e0f",
output_dir: "/home/user/downloads"
})
// The MCP server writes the file directly to disk
// Returns: { saved_to: "/home/user/downloads/research-paper.pdf", filename, size_bytes, content_type }// Export knowledge shard to file
knowledge_shard({
tags: ["project-x"],
output_path: "/tmp/project-x-shard.json"
})
// Returns: { saved_to: "/tmp/project-x-shard.json", ... }
// Import knowledge shard from file
knowledge_shard_import({
file_path: "/tmp/project-x-shard.json",
merge_strategy: "skip_existing"
})Binary data (images, PDFs, archives) can be megabytes or gigabytes. Passing this through the LLM context window as base64 would:
- Consume tokens unnecessarily
- Exceed context window limits
- Slow down AI responses
Instead, the MCP server handles file I/O directly on the filesystem, and only metadata (paths, sizes, content types) flows through the LLM conversation.
For files over 100 MB, unreliable connections, or mobile uploads where progress reporting is important, use the tus v1.0.0 resumable upload protocol. Unlike a single-request multipart upload, tus lets clients pause and resume transfers without restarting from scratch.
| Method | Path | Purpose |
|---|---|---|
OPTIONS |
/api/v1/notes/{note_id}/attachments/tus |
Discover server capabilities |
POST |
/api/v1/notes/{note_id}/attachments/tus |
Create a new upload session |
HEAD |
/api/v1/notes/{note_id}/attachments/tus/{upload_id} |
Get current offset (for resume) |
PATCH |
/api/v1/notes/{note_id}/attachments/tus/{upload_id} |
Append a chunk |
DELETE |
/api/v1/notes/{note_id}/attachments/tus/{upload_id} |
Cancel the upload |
| Header | Required On | Value |
|---|---|---|
Tus-Resumable |
All requests | 1.0.0 |
Upload-Length |
POST | File size in bytes |
Upload-Offset |
PATCH | Current offset (0 for first chunk) |
Content-Type |
PATCH | application/offset+octet-stream |
Upload-Metadata |
POST (optional) | Comma-separated key base64value pairs |
Key-value pairs where values are base64-encoded, comma-separated:
Upload-Metadata: filename aGVsbG8ud29ybGQ=, content_type dGV4dC9wbGFpbg==
Standard keys: filename, content_type (or filetype).
1. Create upload session
curl -X POST https://memory.example.com/api/v1/notes/{note_id}/attachments/tus \
-H "Authorization: Bearer $TOKEN" \
-H "Tus-Resumable: 1.0.0" \
-H "Upload-Length: 104857600" \
-H "Upload-Metadata: filename bXktdmlkZW8ubXA0, content_type dmlkZW8vbXA0"
# Response: 201 Created
# Location: /api/v1/notes/{note_id}/attachments/tus/{upload_id}
# Upload-Expires: Sun, 23 Feb 2026 16:00:00 GMT2. Send chunks (repeat until offset equals Upload-Length)
curl -X PATCH https://memory.example.com/api/v1/notes/{note_id}/attachments/tus/{upload_id} \
-H "Authorization: Bearer $TOKEN" \
-H "Tus-Resumable: 1.0.0" \
-H "Content-Type: application/offset+octet-stream" \
-H "Upload-Offset: 0" \
--data-binary @chunk-part-0.bin
# Response: 204 No Content (mid-upload)
# Upload-Offset: 52428800When the final chunk lands, response is 200 OK with the completed attachment JSON.
3. Resume after interruption (check current offset first)
curl -X HEAD https://memory.example.com/api/v1/notes/{note_id}/attachments/tus/{upload_id} \
-H "Authorization: Bearer $TOKEN" \
-H "Tus-Resumable: 1.0.0"
# Response: 200 OK
# Upload-Offset: 52428800 ← resume from here
# Upload-Length: 1048576004. Cancel upload
curl -X DELETE https://memory.example.com/api/v1/notes/{note_id}/attachments/tus/{upload_id} \
-H "Authorization: Bearer $TOKEN" \
-H "Tus-Resumable: 1.0.0"
# Response: 204 No ContentWhen the final chunk completes the upload (offset equals Upload-Length), Fortemi automatically:
- Validates file safety (magic bytes, extension blocklist)
- Detects the actual content type
- Stores via the standard attachment pipeline (BLAKE3 dedup, blob reference counting)
- Queues background extraction jobs (content extraction, EXIF, media optimization)
- Deletes the staging file
The 200 OK response body contains the completed Attachment JSON, identical to what POST /attachments returns.
Incomplete sessions expire after 24 hours by default (configurable via TUS_UPLOAD_EXPIRY_HOURS). The Upload-Expires response header shows when a session will be cleaned up. Expired sessions are automatically purged.
The default maximum chunk size is 50 MB (configurable via TUS_CHUNK_MAX_SIZE). Smaller chunks (5-20 MB) resume faster after interruptions; larger chunks have lower overhead for stable connections.
- tus-js-client — Browser and Node.js client with automatic resume
- Uppy — Full-featured file picker with built-in tus support
curl -X POST http://localhost:3000/api/v1/notes/{note_id}/attachments \
-H "Authorization: Bearer $TOKEN" \
-F "file=@document.pdf" \
-F "filename=research-paper.pdf"Response (201 Created):
{
"id": "01948f7e-8b2a-7c3d-9e4f-5a6b7c8d9e0f",
"note_id": "01948f7e-1234-5678-9abc-def012345678",
"blob_id": "01948f7e-4567-89ab-cdef-0123456789ab",
"filename": "research-paper.pdf",
"content_type": "application/pdf",
"size_bytes": 5242880,
"status": "uploaded",
"document_type_name": "pdf",
"has_preview": false,
"is_canonical_content": false,
"created_at": "2026-02-02T12:00:00Z"
}curl -X GET http://localhost:3000/api/v1/attachments/{attachment_id}/download \
-H "Authorization: Bearer $TOKEN" \
-O research-paper.pdfResponse:
Content-Type:application/pdf(original MIME type)Content-Disposition:attachment; filename="research-paper.pdf"- Binary file data
curl -X GET http://localhost:3000/api/v1/notes/{note_id}/attachments \
-H "Authorization: Bearer $TOKEN"Response (200 OK):
{
"attachments": [
{
"id": "01948f7e-8b2a-7c3d-9e4f-5a6b7c8d9e0f",
"note_id": "01948f7e-1234-5678-9abc-def012345678",
"filename": "research-paper.pdf",
"content_type": "application/pdf",
"size_bytes": 5242880,
"status": "completed",
"document_type_name": "pdf",
"has_preview": true,
"is_canonical_content": false,
"created_at": "2026-02-02T12:00:00Z"
},
{
"id": "01948f7e-9abc-def0-1234-56789abcdef0",
"note_id": "01948f7e-1234-5678-9abc-def012345678",
"filename": "diagram.png",
"content_type": "image/png",
"size_bytes": 245760,
"status": "completed",
"document_type_name": "image",
"has_preview": true,
"is_canonical_content": false,
"created_at": "2026-02-02T12:05:00Z"
}
]
}curl -X GET http://localhost:3000/api/v1/attachments/{attachment_id} \
-H "Authorization: Bearer $TOKEN"Response (200 OK):
{
"id": "01948f7e-9abc-def0-1234-56789abcdef0",
"note_id": "01948f7e-1234-5678-9abc-def012345678",
"blob_id": "01948f7e-4567-89ab-cdef-0123456789ab",
"filename": "vacation-photo.jpg",
"original_filename": "IMG_20260125_143022.jpg",
"content_type": "image/jpeg",
"status": "completed",
"extraction_strategy": "vision",
"extracted_text": "A scenic mountain view with snow-capped peaks and a lake in the foreground",
"extracted_metadata": {
"exif": {
"datetime": "2026-01-25T14:30:22Z",
"gps": {
"latitude": 46.5197,
"longitude": 6.6323,
"altitude": 1850.0
},
"device": {
"make": "Apple",
"model": "iPhone 15 Pro"
},
"dimensions": {
"width": 4032,
"height": 3024
}
},
"vision": {
"description": "Scenic mountain landscape",
"objects": ["mountain", "lake", "snow", "sky"],
"confidence": 0.92
}
},
"has_preview": true,
"is_canonical_content": false,
"created_at": "2026-02-02T12:05:00Z",
"updated_at": "2026-02-02T12:05:15Z"
}curl -X DELETE http://localhost:3000/api/v1/attachments/{attachment_id} \
-H "Authorization: Bearer $TOKEN"Response (204 No Content)
Note: The underlying blob is not immediately deleted. It remains in storage until the reference count reaches zero and the garbage collection grace period expires.
curl -X PATCH http://localhost:3000/api/v1/attachments/{attachment_id} \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"filename": "updated-name.pdf",
"is_canonical_content": true
}'Response (200 OK):
{
"id": "01948f7e-8b2a-7c3d-9e4f-5a6b7c8d9e0f",
"filename": "updated-name.pdf",
"is_canonical_content": true,
"updated_at": "2026-02-02T12:30:00Z"
}Do:
- Use descriptive names:
project-architecture-diagram.png - Include dates for versioning:
meeting-notes-2026-02-02.pdf - Use lowercase with hyphens:
research-paper-draft.docx
Don't:
- Use generic names:
document.pdf,image.jpg - Use special characters:
file<1>.txt,doc:final.pdf - Use very long names (>100 characters)
Strategy 1: One Note Per Document
Note: "Research Paper - Machine Learning Applications"
└── Attachment: research-paper.pdf
Strategy 2: Collection Note with Multiple Attachments
Note: "Project Assets"
├── Attachment: architecture-diagram.png
├── Attachment: database-schema.pdf
├── Attachment: api-documentation.md
└── Attachment: deployment-guide.pdf
Strategy 3: Image Gallery
Note: "Conference Photos - AI Summit 2026"
├── Attachment: keynote-speaker.jpg
├── Attachment: panel-discussion.jpg
├── Attachment: networking-event.jpg
└── Attachment: closing-ceremony.jpg
1. Search by Extracted Content:
# Full-text search includes extracted attachment content
curl -X GET "http://localhost:3000/api/v1/search?q=machine+learning"
# Returns notes with "machine learning" in:
# - Note content
# - Attachment extracted text (PDF, OCR, transcripts)2. Filter by Document Type:
# Find all PDF attachments
GET /api/v1/notes?document_type=pdf
# Find all images
GET /api/v1/notes?document_type=image3. Search by EXIF Metadata:
-- Find photos taken in Paris (within 10km radius)
SELECT n.id, n.title, a.filename
FROM note n
JOIN attachment a ON a.note_id = n.id
WHERE a.extracted_metadata->'exif'->'gps' IS NOT NULL
AND ST_DWithin(
ST_SetSRID(ST_MakePoint(
(a.extracted_metadata->'exif'->'gps'->>'longitude')::float,
(a.extracted_metadata->'exif'->'gps'->>'latitude')::float
), 4326)::geography,
ST_SetSRID(ST_MakePoint(2.3522, 48.8566), 4326)::geography,
10000 -- 10km
);4. Search by Capture Date:
-- Find photos taken in January 2026
SELECT n.id, n.title, a.filename,
a.extracted_metadata->'exif'->>'datetime' as capture_time
FROM note n
JOIN attachment a ON a.note_id = n.id
WHERE a.extracted_metadata->'exif'->>'datetime' BETWEEN '2026-01-01' AND '2026-01-31';1. Use Filesystem Backend for Large Files
Configure inline threshold based on your usage:
# For mostly small files (text, configs)
export INLINE_STORAGE_THRESHOLD_MB=10
# For large documents
export INLINE_STORAGE_THRESHOLD_MB=52. Regular Garbage Collection
Run periodic cleanup to remove orphaned blobs:
-- Delete blobs with zero references older than 24 hours
SELECT cleanup_orphaned_blobs(24);3. Optimize Extraction Strategy
Choose appropriate strategies per document type:
// Fast: skip extraction for temporary files
{
"document_type": "temp-file",
"extraction_strategy": "text_native",
"requires_attachment": false
}
// Thorough: full OCR for scanned documents
{
"document_type": "scanned-paper",
"extraction_strategy": "pdf_ocr",
"extraction_config": {
"languages": ["eng"],
"dpi": 300
}
}4. Batch Uploads
For multiple files, upload in parallel:
# Upload multiple files concurrently
for file in *.pdf; do
curl -X POST /api/v1/notes/$NOTE_ID/attachments \
-F "file=@$file" &
done
wait1. Validate File Sources
Only upload files from trusted sources. The security validation catches most threats, but defense-in-depth is important.
2. Review Quarantined Files
Check quarantined attachments regularly:
SELECT a.filename, a.processing_error, a.created_at
FROM attachment a
WHERE a.status = 'quarantined'
ORDER BY a.created_at DESC;3. Strip Sensitive EXIF Data
Before sharing notes publicly:
# Remove EXIF data from images
exiftool -all= image.jpg
# Or use ImageMagick
convert image.jpg -strip image-clean.jpg4. Limit Upload Sizes
Configure appropriate limits:
# Environment variable
export MAX_UPLOAD_SIZE=104857600 # 100 MB
# Nginx proxy
client_max_body_size 100M;5. Regular Integrity Checks
Verify blob integrity periodically:
-- Check unverified blobs
SELECT id, storage_path, size_bytes
FROM attachment_blob
WHERE verification_status IS NULL
OR verified_at < NOW() - INTERVAL '30 days';Cause: File extension is in the blocklist or magic bytes indicate executable.
Solution:
- Check file type:
file --mime-type yourfile.ext - If legitimate, convert to allowed format
- For code files, use plaintext upload with syntax highlighting
Cause: Extraction strategy mismatch or processing failure.
Solution:
- Check attachment status:
GET /api/v1/attachments/{id} - Review
processing_errorfield - Manually trigger re-extraction with correct strategy
Cause: Deduplication not working due to different content hashes.
Solution:
- Files are identical only if byte-for-byte same
- Even minor differences (metadata, compression) create different hashes
- This is intentional for data integrity
Cause: Image doesn't contain GPS EXIF data.
Solution:
- Verify EXIF data:
exiftool image.jpg | grep GPS - Many images have GPS stripped for privacy
- Use manual geolocation tagging if needed
Cause: Database inline storage for large files.
Solution:
- Migrate large blobs to filesystem:
SELECT mark_blob_migrated_to_filesystem(id) FROM attachment_blob WHERE storage_backend = 'database' AND size_bytes > 10485760; -- 10 MB
- Manually copy data to filesystem path
- Update storage backend
Cause: Grace period not expired or cleanup not run.
Solution:
-- Check orphaned blobs
SELECT COUNT(*), SUM(size_bytes) as total_bytes
FROM attachment_blob
WHERE reference_count = 0;
-- Run cleanup (age in hours)
SELECT cleanup_orphaned_blobs(24); -- 24 hour grace period- Document Type Guide - Learn about document types and extraction strategies
- Search Guide - Advanced search techniques for attachment content
- API Reference - Complete API documentation
- Security Best Practices - Operational security guidelines