Skip to content

Amarnath001/paperMind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaperMind

Multi-Agent Research Intelligence Platform for processing research papers with AI agents (ingestion, embeddings, clustering, and RAG).

Tech Stack

  • Frontend: Next.js (TypeScript)
  • Backend: Flask API
  • Database: PostgreSQL + pgvector
  • Cache / Message Broker: Redis
  • Background Jobs: Celery
  • Containerization: Docker

Prerequisites

Quick Start

  1. Clone and enter the repo

    cd paperMind
  2. Create environment file

    cp .env.example .env
  3. Run with Docker Compose

    make up

    Or use docker compose up --build directly. Other useful commands:

    • make down – stop and remove containers
    • make logs – follow container logs
    • make restart – stop, rebuild, and start
  4. Access the app

Project Structure

/
├── frontend/          # Next.js TypeScript app
├── backend/           # Flask API
├── infra/             # Docker & deployment configs
├── docs/              # Architecture notes
├── docker-compose.yml
└── .env.example

Local Development (without Docker)

Backend

cd backend
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
export DATABASE_URL=postgresql://postgres:postgres@localhost:5432/papermind
export REDIS_URL=redis://localhost:6379/0
python run.py

Frontend

cd frontend
npm install
npm run dev

Ensure PostgreSQL and Redis are running (e.g. via docker compose up postgres redis -d).

API

  • GET /healthz – Returns {"status": "ok"} (liveness)
  • GET /readyz – Returns readiness status; checks PostgreSQL and Redis. Returns 503 if either is unreachable.

Milestone 1 – Core Application

This milestone introduces the core application layer:

  • Authentication – Email/password signup & login with bcrypt-hashed passwords and JWT-based auth (/auth/signup, /auth/login, /auth/me).
  • Workspaces – Multi-tenant workspaces with membership and roles; endpoints to create and list workspaces and fetch a workspace (/workspaces, /workspaces/:id).
  • Paper upload – Authenticated PDF uploads (max 20MB) into workspaces via /papers/upload, saving files under the backend uploads/ directory and recording metadata in PostgreSQL.
  • Library listing – Workspace-specific library listing via /papers?workspace_id=..., returning all papers for a workspace.

Milestone 2 – Async Ingestion Pipeline

Milestone 2 adds an asynchronous ingestion pipeline and job tracking:

  • Jobs table – Tracks background jobs (jobs table) including type, status (queued, running, completed, failed), progress, and errors.
  • Chunks table – Stores extracted text chunks for each paper (chunks table) with chunk_index, text, and token_count.
  • Celery + Redis – Uses Celery workers with Redis as the broker and result backend for asynchronous processing.
  • PDF extraction – Extracts text from uploaded PDFs using pypdf, then splits text into chunks (~800–1200 characters) with simple paragraph-based chunking.
  • Job-triggered ingestion – After a successful PDF upload, an ingestion job is created and a Celery task processes the paper in the background (updating paper status from uploadedprocessingready or failed).
  • APIs – New /jobs and /jobs/:id endpoints expose job metadata, and paper APIs expose processing status for frontend visibility.

Milestone 3 – Local Embeddings & Semantic Search

Milestone 3 adds native embedding generation and vector search powered by Postgres pgvector and Sentence Transformers. Fast, private, zero external AI costs.

  • Vector Storage – Adds .embedding vector(384) columns with ivfflat indexes for both papers and chunks tables.
  • Local Embedding Pipeline – Uses local, open-source models (default: BAAI/bge-small-en-v1.5) via sentence-transformers, removing the need for paid external APIs.
  • Chained Jobs – The ingestion pipeline automatically queues an embedding job after chunking, chaining the workflow smoothly: upload -> ingestion -> chunking -> embedding -> semantic search.
  • Search APIs – Exposes two new semantic vector-search endpoints: /search (find chunks matching a text query in a workspace) and /papers/<id>/similar (find related papers using their cached centroid vector).

LLM Provider (Gemini)

PaperMind keeps embeddings local (Sentence Transformers + pgvector) and uses an external LLM only for text generation (answering questions, summarisation, future RAG features).

  • Provider: Gemini (via the google-generativeai Python SDK).
  • Usage: LLMs are wrapped by LLMService in backend/app/services/llm_service.py, which exposes a simple generate_text(prompt) API and a generate_answer(question, context_chunks) helper for RAG-style prompts.
  • Model: Default model is gemini-1.5-flash (configurable via GEMINI_MODEL).
  • Configuration:
    • Set GEMINI_API_KEY in your .env (get a key from Google AI Studio).
    • Optional: override LLM_PROVIDER (currently only "gemini" is supported) or GEMINI_MODEL.

This prepares the system for Milestone 4 RAG features, where retrieved chunks from pgvector search will be passed into Gemini for high-quality answer generation.

Milestone 4 – RAG Chat with Citations

Milestone 4 adds a retrieval-augmented chat experience over workspace libraries:

  • Conversation storage – New conversations and messages tables track chat sessions, participants, and message history (including JSONB citations on assistant messages).
  • Retrieval service – A dedicated retrieval layer uses local embeddings + pgvector (chunks.embedding) to fetch the most relevant chunks for a question, scoped to a workspace (and optionally a single paper).
  • RAG answers with Gemini – The LLMService uses Gemini to generate grounded answers from retrieved chunks via generate_answer_with_citations, returning both an answer and structured citation metadata.
  • Chat APIs – New /chat endpoints:
    • POST /chat/conversations – create a workspace-scoped conversation.
    • GET /chat/conversations?workspace_id=... – list conversations in a workspace.
    • GET /chat/conversations/<conversation_id>/messages – fetch message history.
    • POST /chat/ask – ask a question within a conversation, run retrieval, generate an answer with citations, and persist both user and assistant messages.
  • Workspace & paper scoping – All chat and retrieval operations enforce workspace membership, and POST /chat/ask can optionally restrict retrieval to a single paper via paper_id.
  • Frontend chat UI – A simple chat interface at /workspace/[id]/chat shows conversations, message history, and assistant answers with inline citations (paper title, chunk index, and labels like [1], [2]).

The full pipeline is now:

upload → ingestion → chunking → embedding → retrieval → Gemini answer (with citations).

Retrieval Quality Optimisation: Local Reranking

To improve answer quality without adding any external API costs, PaperMind performs local cross-encoder reranking on top of pgvector search:

  • Two-stage retrieval:
    1. Initial retrieval: pgvector semantic search over chunks.embedding fetches a broader candidate set (default INITIAL_RETRIEVAL_LIMIT = 20).
    2. Local reranking: a Sentence Transformers cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) scores each (question, chunk) pair and selects the best FINAL_CONTEXT_LIMIT chunks (default 5).
  • Implementation:
    • Config values in Config:
      • RERANKER_MODEL (default "cross-encoder/ms-marco-MiniLM-L-6-v2").
      • ENABLE_RERANKING (default true).
      • INITIAL_RETRIEVAL_LIMIT (default 20).
      • FINAL_CONTEXT_LIMIT (default 5).
    • Reranking is implemented in backend/app/services/reranking_service.py and integrated into retrieve_context_for_question in backend/app/services/retrieval_service.py.
  • Why this helps:
    • pgvector recall is high but ranking is purely embedding-based; the cross-encoder re-scores the full question + chunk text jointly, which tends to surface more semantically precise context for Gemini.
    • Everything runs locally (no extra API calls), preserving privacy and keeping RAG costs low.

Milestone 5 – Research Insights

Milestone 5 adds higher-level intelligence over the paper library, beyond retrieval and chat:

  • Paper summaries – After embedding, an analysis job runs a Gemini-powered summarisation step via summarization_service.generate_paper_summary, storing a 3–5 sentence summary in papers.summary.
  • Topic extraction – The same analysis job uses topic_service.extract_paper_topics to extract 5–8 short topics/keywords per paper, stored in papers.topics (as TEXT[]).
  • Paper clustering – The clustering_service groups papers per workspace using KMeans over stored paper embeddings, writing a cluster_id back to each paper.
  • Workspace insights APIinsight_service.get_workspace_insights powers new /insights endpoints to return:
    • total_papers
    • clusters (papers grouped by cluster_id)
    • topics (aggregated topic counts)
    • recent_papers (latest papers with summaries, topics, clusters).
  • Insights dashboard – A new UI at /workspace/[id]/insights surfaces these insights: total papers, top topics, cluster cards (with per-cluster papers and summaries), and a recent papers list.

The full pipeline now looks like:

upload → ingestion → chunking → embedding → analysis (summary/topics/clusters) → retrieval → reranking → Gemini answer.

For new databases, the schema is created via backend/db/schema.sql on first container startup.

For existing local databases, you can either:

  • Reset the Postgres volume and re-run initialization:

    docker compose down -v
    docker compose up --build
  • Or apply the Milestone 2 migration SQL manually:

    # From the project root, with Postgres running
    docker compose exec postgres psql -U postgres -d papermind -f /docker-entrypoint-initdb.d/schema.sql
    docker compose exec postgres psql -U postgres -d papermind -f /path/to/backend/db/migrations/002_milestone2.sql
    docker compose exec postgres psql -U postgres -d papermind -f /path/to/backend/db/migrations/003_milestone3.sql

Running the Celery worker

The Celery worker is defined as a separate service in docker-compose.yml and is started automatically with:

docker compose up --build

The worker shares the same code and environment as the backend API and mounts the same uploads volume so that uploaded PDFs are available during ingestion.

Database Initialization

When you run:

docker compose up --build

the PostgreSQL container automatically runs the schema defined in:

  • backend/db/schema.sql

This uses Postgres' standard entrypoint mechanism by mounting the file into /docker-entrypoint-initdb.d/schema.sql. The script is only executed on first initialization of the postgres_data volume; subsequent docker compose up runs will not re-apply the schema.

If you want to reset the database state and re-run schema initialization (for example during early development), you can remove the volume:

docker compose down -v

Then start the stack again:

docker compose up --build

Milestone 6 – Production Deployment & CI/CD

Milestone 6 focuses on productionising PaperMind while keeping local development simple:

  • Storage abstraction (local + S3-compatible)storage_service now supports both local filesystem storage (default for development) and S3-compatible object storage (AWS S3, Cloudflare R2, etc.), controlled via STORAGE_PROVIDER and associated S3_* environment variables. Workers access PDFs through the same abstraction, so ingestion continues to work in either mode.
  • Production configurationConfig is fully environment-driven and grouped by concern (database, Redis, embeddings, LLMs, storage, retrieval). .env.example documents the new storage-related variables and remains safe for local use.
  • Logging & observability basics – The Flask app logs each request with method, path, status, and duration in a structured, container-friendly format and captures uncaught exceptions without exposing details to clients. Celery continues to emit standard worker logs.
  • Rate limiting & security hardening – Sensitive endpoints such as /auth/signup, /auth/login, /papers/upload, and /chat/ask are protected by lightweight rate limits using Flask-Limiter, in addition to existing auth and validation. Upload handling still enforces PDF-only uploads and a 20MB size cap.
  • Automated tests – A minimal pytest-based backend test suite has been added (for example, basic /healthz and auth validation checks), providing a foundation to grow coverage over time.
  • GitHub Actions CI – Workflows under .github/workflows/ now:
    • Run backend tests (backend-ci.yml) against a real Postgres + Redis stack.
    • Build the Next.js frontend (frontend-ci.yml) to catch compile-time issues.
    • Build backend and frontend Docker images (docker-build.yml) to ensure container definitions remain valid.

In production you will typically run:

  • the backend API (Flask app behind a WSGI server such as gunicorn) connected to managed Postgres, Redis, and optional S3-compatible storage;
  • one or more Celery workers using the same codebase and environment for ingestion, embedding, and analysis tasks;
  • the frontend (Next.js) either as a container or as a static export behind a CDN, configured to talk to the backend API.

Local development remains unchanged: docker compose up --build starts Postgres, Redis, backend API, Celery worker, and frontend, all using local filesystem storage by default.

Lightweight cloud deployment (Railway / Vercel / Render)

For low-cost or free-tier hosting (e.g. Railway, Render) where a separate Celery worker is not available, the backend supports a single-process, synchronous mode:

  • Embeddings: Use the Gemini API for embeddings (EMBEDDING_PROVIDER=gemini, default). No local ML stack (torch, sentence-transformers) is required; the image stays small.
  • Processing: Set ASYNC_PROCESSING=false (default in production-safe config). After upload, the full pipeline (ingest → chunk → embed → analyze) runs inline in the request. No Redis or Celery worker is required.
  • Reranking: Local cross-encoder reranking is off by default (ENABLE_RERANKING=false / ENABLE_LOCAL_RERANKING=false). Retrieval uses pgvector results directly.
  • Clustering: Set ENABLE_CLUSTERING=false (default) to avoid scikit-learn; the insights page still works with summaries, topics, and recent papers. cluster_id may be null.

Required env for this mode: GEMINI_API_KEY, your database and (if needed) storage URLs. The backend binds to PORT and listens on 0.0.0.0 for platform detection.

License

Private

About

Cloud-based research intelligence platform that processes papers with a multi-agent pipeline for embeddings, clustering, and RAG-powered question answering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors