An AI-powered pipeline for extracting, chunking, and summarizing content from PDF documents using advanced chunking strategies and generative models. Supports text, tables, and images, with vector search and retrieval via ChromaDB.
- PDF partitioning and intelligent chunking
- Handles text, tables, and images
- AI-generated summaries for mixed content
- Embedding and vector search with ChromaDB
- Clone the repository
- Install dependencies
- Run the notebook to process your PDF and create a searchable vector store
- Python
- LangChain
- Unstructured
- ChromaDB
- Google Generative AI
MIT