High-performance embedding API with intelligent caching, batch processing, and production-ready Docker deployment
Features Quick Start Documentation Docker API
- π Blazing Fast: 3x faster startup, 5x faster batch processing
- π§ Smart Caching: Intelligent LRU caching with auto-offloading
- π¦ Production Ready: Docker support, monitoring, security built-in
- π Batch Processing: Process multiple texts efficiently in one request
- π Observable: Built-in metrics, health checks, and structured logging
- π Secure: Rate limiting, CORS, input validation, model whitelisting
- π³ Deploy Anywhere: Docker, Kubernetes, AWS, GCP, Azure ready
- β Single & Batch Embeddings - Efficient processing of one or many texts
- β Intelligent Model Caching - LRU eviction with TTL-based cleanup
- β Auto-offloading - Automatically unload unused models to save memory
- β Startup Validation - Pre-validate and warm-up models before serving
- β Multiple Models - Support for various HuggingFace embedding models
- β Async Architecture - Non-blocking I/O with thread pool execution
- β‘ 3x Faster Startup - Models cached after validation
- β‘ 5x Faster Batch - Batch endpoint vs individual requests
- β‘ Optimized Threading - Configurable thread pool workers
- β‘ Request Timeouts - Prevent hanging requests
- β‘ Processing Metrics - Track response times
- π Security: Input validation, rate limiting, CORS, model whitelist
- π Monitoring: Metrics endpoint, health checks, structured logs
- π³ Docker: Multi-stage builds, non-root user, health checks
- π High Availability: Nginx load balancing ready
- π Cloud Ready: Deploy to AWS, GCP, Azure with examples
- π Observability: Full metrics, logging, and tracing ready
# Clone the repository
git clone https://github.com/Amirhat/fast-embedding-api.git
cd fast-embedding
# Deploy with one command
chmod +x scripts/deploy.sh
./scripts/deploy.sh deploy
# Or use docker-compose
docker-compose up -d# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Start the server
python -m src.main# Health check
curl http://localhost:8000/health
# Generate embedding
curl -X POST http://localhost:8000/embed \
-H "Content-Type: application/json" \
-d '{"model_name":"BAAI/bge-small-en-v1.5","text":"Hello, world!"}'
# Batch embedding (5x faster!)
curl -X POST http://localhost:8000/embed/batch \
-H "Content-Type: application/json" \
-d '{"model_name":"BAAI/bge-small-en-v1.5","texts":["Hello","World","AI"]}'| Document | Description |
|---|---|
| π Quick Start Guide | Get started in 5 minutes |
| π Production Guide | Deploy to production |
| π³ Docker Setup | Docker configuration |
| π Changelog | Version history |
| π Project Status | Current status |
import requests
response = requests.post(
"http://localhost:8000/embed",
json={
"model_name": "BAAI/bge-small-en-v1.5",
"text": "Your text here"
}
)
result = response.json()
# Returns: embedding, model_name, dimension, text_length, processing_time_msresponse = requests.post(
"http://localhost:8000/embed/batch",
json={
"model_name": "BAAI/bge-small-en-v1.5",
"texts": ["Text 1", "Text 2", "Text 3"]
}
)
result = response.json()
# Returns: embeddings, model_name, dimension, count, processing_time_mscurl http://localhost:8000/health
# Returns: status, cache_info, uptime_secondscurl http://localhost:8000/metrics
# Returns: total_requests, total_embeddings, cache_info, uptimecurl http://localhost:8000/models
# Returns: required_models, cached_models, allowed_modelscurl http://localhost:8000/models/BAAI/bge-small-en-v1.5
# Returns: model_name, is_cached, load_time, loaded_at, last_used| Metric | Result | Status |
|---|---|---|
| Single Embedding | 12.62ms avg | β <15ms |
| Batch (10 texts) | 4.16ms per text | β <5ms |
| Throughput | 95.1 req/s | β >80 req/s |
| Concurrent (10) | 140.2 req/s | β >100 req/s |
| Cache Speedup | 3.2x faster | β >2x |
| Batch Speedup | 3.0x faster | β >2.5x |
| Error Rate | 0.00% | β <1% |
# Python benchmark
python benchmarks/benchmark.py
# K6 load testing
k6 run benchmarks/k6-load-test.jsπ See BENCHMARKS.md for detailed results
# Automated deployment
./scripts/deploy.sh deploy
# View logs
./scripts/deploy.sh logs
# Check status
./scripts/deploy.sh status# Production
docker-compose up -d
# Development with hot reload
docker-compose --profile dev up fast-embedding-dev
# Scale to 3 instances
docker-compose up -d --scale fast-embedding=3# Build
docker build -t fast-embedding:latest --target production .
# Run
docker run -d \
--name fast-embedding \
-p 8000:8000 \
-e ENABLE_RATE_LIMIT=true \
-v model_cache:/app/.cache/fastembed \
fast-embedding:latest# API Settings
HOST=0.0.0.0
PORT=8000
DEBUG=false
LOG_LEVEL=INFO
# Performance
THREAD_POOL_WORKERS=4 # CPU workers
MAX_TEXT_LENGTH=8192 # Max chars
MAX_BATCH_SIZE=32 # Max batch size
REQUEST_TIMEOUT=300 # Timeout (seconds)
# Cache
MODEL_CACHE_TTL=3600 # Cache TTL (seconds)
MAX_CACHED_MODELS=5 # Max cached models
CLEANUP_INTERVAL=60 # Cleanup interval
# Security
ALLOWED_MODELS=model1,model2 # Model whitelist
ENABLE_CORS=true
CORS_ORIGINS=* # Set specific domains in production!
ENABLE_RATE_LIMIT=true
RATE_LIMIT_REQUESTS=100 # Requests per minute
# Monitoring
ENABLE_METRICS=trueDefault models validated on startup:
BAAI/bge-small-en-v1.5BAAI/bge-base-en-v1.5sentence-transformers/all-MiniLM-L6-v2
Configure in src/config.py or via REQUIRED_MODELS environment variable.
fast-embedding/
βββ src/ # Source code
β βββ __init__.py
β βββ main.py # Litestar API
β βββ model_manager.py # Model caching
β βββ config.py # Configuration
β
βββ tests/ # Tests
β βββ test_api.py # API tests
β βββ example_client.py # Usage examples
β
βββ benchmarks/ # Performance benchmarks
β βββ benchmark.py # Benchmark suite
β
βββ docs/ # Documentation
β βββ QUICKSTART.md
β βββ PRODUCTION.md
β βββ DOCKER_SETUP.md
β βββ CHANGELOG.md
β βββ PROJECT_STATUS.md
β
βββ scripts/ # Utility scripts
β βββ deploy.sh # Deployment script
β βββ run.sh # Local run script
β
βββ .github/ # GitHub workflows
β βββ workflows/
β
βββ Dockerfile # Multi-stage Docker build
βββ docker-compose.yml # Docker orchestration
βββ requirements.txt # Python dependencies
βββ README.md # This file
Our comprehensive test suite includes unit tests, integration tests, and performance tests
# Install test dependencies
make dev-install
# Run all tests
make test
# Run with coverage report
make test-cov
# Run specific test categories
make test-unit # Unit tests only
make test-integration # Integration tests only
make test-fast # Fast mode (minimal output)tests/
βββ conftest.py # Shared fixtures
βββ test_config.py # Configuration tests (16 tests)
βββ test_model_manager.py # ModelCache tests (25+ tests)
βββ test_integration.py # API endpoint tests (40+ tests)
βββ test_api.py # Legacy tests (requires server)
βββ example_client.py # Example API client
# All tests
pytest tests/ --ignore=tests/test_api.py -v
# Specific test file
pytest tests/test_config.py -v
# Specific test class
pytest tests/test_integration.py::TestEmbedEndpoint -v
# With coverage
pytest tests/ --ignore=tests/test_api.py --cov=src --cov-report=html
# Stop on first failure
pytest tests/ -x
# Run last failed tests
pytest tests/ --lf# Generate HTML coverage report
make test-cov
# Open the report
open htmlcov/index.html # macOS# Start server
docker-compose up -d
# Run example client
python tests/example_client.pyFor detailed testing documentation, see tests/README.md.
# Run full benchmark suite
python benchmarks/benchmark.py- Input validation (text length, batch size)
- Model whitelist (
ALLOWED_MODELS) - Rate limiting (configurable)
- CORS configuration
- Request size limits (10MB)
- Timeout protection
- Non-root Docker user
- Authentication (TODO: JWT/API keys)
- HTTPS (configure reverse proxy)
# Set specific CORS origins
CORS_ORIGINS=https://yourdomain.com
# Enable rate limiting
ENABLE_RATE_LIMIT=true
# Use model whitelist
ALLOWED_MODELS=BAAI/bge-small-en-v1.5,BAAI/bge-base-en-v1.5
# Always use HTTPS in production (nginx/reverse proxy)# Push to ECR
docker tag fast-embedding:latest <account>.dkr.ecr.<region>.amazonaws.com/fast-embedding
docker push <account>.dkr.ecr.<region>.amazonaws.com/fast-embeddinggcloud builds submit --tag gcr.io/PROJECT-ID/fast-embedding
gcloud run deploy --image gcr.io/PROJECT-ID/fast-embedding --memory 4Giaz acr build --registry myregistry --image fast-embedding:latest .
az container create --resource-group rg --name fast-embedding \
--image myregistry.azurecr.io/fast-embedding --cpu 4 --memory 4See PRODUCTION.md for detailed deployment guides.
# Health check
curl http://localhost:8000/health
# Metrics
curl http://localhost:8000/metrics- Total requests processed
- Total embeddings generated
- Cache statistics (hits, misses, size)
- API uptime
- Model load times
- Processing times
- Prometheus metrics export (coming soon)
- ELK Stack logging
- CloudWatch integration
- Custom dashboards
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Litestar - Modern, performant web framework
- FastEmbed - Lightweight embedding library
- Qdrant - Vector database and FastEmbed creators
- π Documentation
- π Issue Tracker
- π¬ Discussions
β Star this repo if you find it useful!
Made with β€οΈ by the Fast Embedding Team