Skip to content

Feat/pdf parser#323

Closed
michaeltomlinsontuks wants to merge 8 commits into
devfrom
feat/pdf-parser
Closed

Feat/pdf parser#323
michaeltomlinsontuks wants to merge 8 commits into
devfrom
feat/pdf-parser

Conversation

@michaeltomlinsontuks

Copy link
Copy Markdown
Contributor

PR: Timetable Solver and PDF Parser Integration

Branch: feat/pdf-parser
Target: dev

Summary

This pull request integrates a Python-based FastAPI PDF parser service and a NestJS solver/parser controller framework. It establishes asynchronous job processing queues (using BullMQ and Redis) to handle timetable optimizations and PDF document parsing asynchronously, rather than holding HTTP requests open during intensive tasks.

Motivation

  • Timetable optimization and PDF processing are computationally expensive operations. Transitioning to an asynchronous architecture prevents request timeouts and improves application scalability.
  • Python is highly suited for data extraction, whereas NestJS is better suited as a centralized API gateway and orchestrator. Separating PDF parsing into a standalone Python microservice allows us to leverage specialized parsing libraries.

Design Decisions

1. Queue-Based Processing via Redis and BullMQ

Decision: Enqueues PDF parsing and solver jobs into distinct BullMQ queues (pdf-parse and solver-optimize) and processes them asynchronously via worker callbacks.
Rationale: Standard request-response cycles are inadequate for long-running CPU-bound tasks. This queue structure decouples job submission from completion, providing robust retry logic, backoff, and resource isolation.

2. Standalone Python PDF Parsing Scalable Service

Decision: Built apps/pdf_parser as a containerized Python service exposing FastAPI endpoints for parsing University of Pretoria (UP) schedule PDFs.
Rationale: Python's ecosystem contains superior tools for PDF layout analysis and text extraction. Containerizing this as a separate microservice behind Traefik makes it modular, reusable, and easy to deploy.

3. Public Worker Callbacks

Decision: Exposed callback endpoints with NestJS @Public() bypass guards to allow workers to submit processing results back to the core backend.
Rationale: Background workers do not operate with active user session tokens. Public, validated callback endpoints allow workers to report results asynchronously without manual session generation.


Files Changed

New Files

File Description
apps/backend/src/pdf-parser/dto/pdf-parser.dto.ts Defines data transfer objects and validation schemas for PDF parsing submissions and callbacks.
apps/backend/src/pdf-parser/pdf-parser.controller.ts Implements user endpoints to submit PDF parse jobs and a public callback endpoint for parser workers.
apps/backend/src/pdf-parser/pdf-parser.module.ts Integrates PDF parser controller and service into the NestJS dependency tree.
apps/backend/src/pdf-parser/pdf-parser.service.ts Business logic to enqueue parsing tasks onto the Redis BullMQ queue.
apps/backend/src/redis/queue.constants.ts Declares queue names (pdf-parse and solver-optimize) to prevent string duplication.
apps/backend/src/redis/redis-queue.module.ts Configures NestJS BullMQ globally using the configured REDIS_URL connection parameter.
apps/backend/src/solver/dto/solver.dto.ts Validation schemas and types for submitting optimization tasks and receiving results.
apps/backend/src/solver/solver.controller.ts Endpoints to trigger timetable optimizer solver jobs and handle worker callbacks.
apps/backend/src/solver/solver.module.ts Standard NestJS module for the solver controller and service.
apps/backend/src/solver/solver.service.ts Service handling the enqueuing and status callback processing of solver jobs.
apps/pdf_parser/.dockerignore Defines ignored file patterns for Docker build context optimization.
apps/pdf_parser/main.py Main FastAPI application setting up endpoints, Swagger documentation, and health checks.
apps/pdf_parser/parser/__init__.py Package initialization for parser engine modules.
apps/pdf_parser/parser/base_parser.py Defines the abstract BaseParser interface for schedule document parsing.
apps/pdf_parser/parser/data_processor.py Clean up and post-processing utility for parsed event data structures.
apps/pdf_parser/parser/up_parser.py Implements the parsing engine specifically tailored to extract schedules from UP PDFs.
apps/pdf_parser/pdf_parser.Dockerfile Configures multi-stage Docker build environment for Python FastAPI.
apps/pdf_parser/requirements.txt Python package dependency list for PDF parsing, FastAPI, and instrumentation.
apps/pdf_parser/static/brand/... Static brand logo files (e.g., logos for UMTAS, DNS).
apps/pdf_parser/swagger_ui.py Implements a customized, branded HTML layout for FastAPI's Swagger UI documentation.
apps/pdf_parser/up_test_pdfs/... Test PDF schedules (Lectures, Exams, Tests) representing real-world UP structures.
apps/pdf_parser/verify_up_parser.py Verification command-line tool to dry-run and debug the UP PDF parser output.

Modified Files

File Change
apps/backend/src/app.module.ts Registers RedisQueueModule, PdfParserModule, and SolverModule.
apps/solver/solver.Dockerfile Copies custom Swagger UI configuration and static brand logos.
docker-compose.prod.yml Integrates pdf-parser container with Traefik routing configuration for production environment.
docker-compose.yml Adds pdf-parser configuration, maps local Redis container port 6379, and defines local profiles.
package.json Adds developer scripts for installing and verifying the PDF parser (pdf-parser:install, pdf-parser:verify).

API Endpoints (If Applicable)

Method Path Description
POST /pdf-parser/submit Enqueues a new PDF parsing job onto the BullMQ Redis queue.
POST /pdf-parser/callback Public callback called by workers to report parsing success or failure.
POST /solver/submit Enqueues a timetable optimization job onto the solver queue.
POST /solver/callback Public callback called by solver workers to upload computed solutions.
POST /parse (Python App) Accepts multi-part form-data file upload and returns parsed schedule data directly.
GET /health (Python App) Endpoint checking that the Python FastAPI server is running properly.

@sonarqubecloud

sonarqubecloud Bot commented Jun 1, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@Wilmar-Smit Wilmar-Smit closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants