gatekeeper

A vector database built from scratch — to actually understand how embeddings, semantic search and RAG work under the hood, one small step at a time.

Teaching implementation, not production. Each commit is one learning step — from "what is cosine similarity?" to a full Retrieval-Augmented-Generation loop, and finally reading + contributing to the real thing (Nextcloud's context_chat_backend).

Status

Built so far — each one a single learning step:

Distance metrics by hand — dot, L2, cosine (gatekeeper/distance.py)
Normalizing — unit vectors, so cosine becomes a plain dot product
Brute-force search — VectorStore: add records, get the top-k most similar (gatekeeper/store.py)
Vectorized search — NumpyVectorStore: the same search as one matrix multiply (gatekeeper/numpy_store.py)
Real embeddings — turn actual sentences into vectors with a trained model (gatekeeper/embed.py, optional [embeddings] extra)
Pluggable backends — one EmbeddingBackend interface, swap local (sentence-transformers) for managed AWS Bedrock (Titan / Cohere) with the same store (gatekeeper/backends.py, optional [bedrock] extra)

Write-up: docs/chapter-1-similarity-and-search.md.

Quickstart

python3 -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -e ".[dev]"

python -m gatekeeper.distance       # cosine / L2 / dot on toy vectors
python -m gatekeeper.store          # brute-force semantic search demo
python -m gatekeeper.numpy_store    # the same search, vectorized with NumPy
pytest                           # run the checks

# optional: real text embeddings (pulls in torch — heavier, ~first run downloads a model)
pip install -e ".[embeddings]"
python -m gatekeeper.embed          # semantic search on real sentences

Why

Most people use a vector database as a black box. This repo opens the box: the geometry of similarity, brute-force vs. approximate search (IVF / HNSW), chunking, and grounding an LLM with retrieved context.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
docs		docs
gatekeeper		gatekeeper
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gatekeeper

Status

Quickstart

Why

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gatekeeper

Status

Quickstart

Why

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages