Skip to content

feat: add PDF embedded images extractor#337

Open
BhakktiGautam wants to merge 3 commits into
Durgeshwar-AI:mainfrom
BhakktiGautam:feature/extract-pdf-images
Open

feat: add PDF embedded images extractor#337
BhakktiGautam wants to merge 3 commits into
Durgeshwar-AI:mainfrom
BhakktiGautam:feature/extract-pdf-images

Conversation

@BhakktiGautam

Copy link
Copy Markdown

📌 Closes Issue

Closes #327

🚀 Feature Description

Add PDF Embedded Images Extractor tool that extracts raw raster images (JPEG/PNG) from PDF files without re-compression or quality loss.

✨ What's New?

  • Extract all embedded images from multi-page PDFs
  • Preview thumbnails before extraction (up to 9 images)
  • Download images as organized ZIP file
  • Includes extraction report.txt with metadata

🔄 How It's Different from Existing PDF to PNG?

Feature Existing PDF to PNG New Extract Images
Output Rendered page as PNG Original embedded image
Quality Re-compressed Lossless original
Multiple per page? No (one per page) Yes
Background Includes page background Transparent/isolated

📁 Files Added/Modified

New Files:

  • backend/blueprints/pdf_extract_images.py - Backend API endpoints
  • frontend/src/pages/PdfExtractImages.jsx - React page component

Modified Files:

  • backend/main.py - Registered new blueprint
  • frontend/src/App.jsx - Added route
  • frontend/src/data/toolsData.jsx - Added tool metadata
  • frontend/src/components/Sidebar/Sidebar.jsx - Added sidebar link

✅ Rule Compliance

  • No data storage - All processing in memory (BytesIO, no temp files)
  • No external APIs - Uses local PyMuPDF only
  • File manipulation only - Pure PDF image extraction

🧪 How to Test

  1. Go to /pdf/extract-images
  2. Upload a PDF containing embedded images
  3. Preview thumbnails will appear
  4. Click "Extract All Images"
  5. ZIP file downloads with all images + report

📸 Screenshots

  • Preview page screenshot
  • Extracted ZIP content screenshot

🏗️ Technical Implementation

  • Library: PyMuPDF (fitz) - already in requirements.txt
  • Preview: Base64 encoded thumbnails from first 3 pages
  • ZIP: In-memory ZIP creation using BytesIO
  • Cleanup: No disk writes, pure memory operations

✅ Checklist

  • Code follows project rules
  • No temp files created
  • No external API calls
  • Error handling for invalid/corrupt PDFs
  • Works for PDFs without images (graceful message)

Ready for review! 🚀

@vercel

vercel Bot commented Jun 14, 2026

Copy link
Copy Markdown

@BhakktiGautam is attempting to deploy a commit to the Durgeshwar's projects Team on Vercel.

A member of the Team first needs to authorize it.

@BhakktiGautam

Copy link
Copy Markdown
Author

@Durgeshwar-AI

PR ready for review! ✅

Quick Summary

Files Changed

File Type
backend/blueprints/pdf_extract_images.py New
frontend/src/pages/PdfExtractImages.jsx New
backend/main.py Modified
frontend/src/App.jsx Modified
frontend/src/data/toolsData.jsx Modified

Testing Done

  • Extracts images from multi-page PDFs
  • Preview works correctly
  • ZIP download with report.txt
  • Handles PDFs without images gracefully
  • No temp files created (uses BytesIO)

Note

I wasn't able to fully test on local due to environment issues (Python path conflicts), but the code follows the project rules. Requesting you to please review and suggest any changes.

Thank you for your patience! 🙏

- Add ProgressManager for tracking long-running tasks
- Add SSE endpoint for streaming progress updates
- Add useSSE custom hook for frontend
- Add ProgressBar component with animations
- Integrate with PDF to PNG conversion
- Add fallback to client-side conversion

Closes Durgeshwar-AI#328
@Durgeshwar-AI

Copy link
Copy Markdown
Owner

@BhakktiGautam Event?

@BhakktiGautam

Copy link
Copy Markdown
Author

@Durgeshwar-AI

This PR (#339) is also for GSSoC 2026 under Issue #328.

Both PRs follow the project guidelines (no storage, no external APIs).

Please let me know if any changes are required. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Extract Embedded Images from PDF (Not Page-to-PNG)

2 participants