A small, self-contained toolkit to:
- Split multi-invoice PDFs into individual invoice PDF files
- Extract invoice-level metadata (invoice number, date, consignee, dealer)
- Extract and parse tabular item rows from each invoice
- Aggregate and export all items to CSV or Excel
This repository contains two entry points: cli.py (interactive CLI) and ui.py (Tkinter GUI). Core logic lives in custom_modules/.
Contents
cli.py— interactive command-line processor that prompts for output format and filename.ui.py— Tkinter-based GUI with file pickers and progress UI.custom_modules/invoice_splitter.py— PDF splitting logic using PyMuPDF.custom_modules/dataocr.py— invoice-level extraction (PyMuPDF + bold-span heuristics).custom_modules/table_extractor.py— table extraction and column parsing by X coordinate.pdfs/— place source PDFs here (example default used by CLI:pdfs/invoice.pdf).individual_invoice/— default output folder for split PDFs.
Processing steps (full flow)
- Split PDF into individual invoices
- Module:
custom_modules/invoice_splitter.py - How it detects invoices: looks for the text marker
"Tax Invoice"to detect the start of an invoice and"This is a Computer Generated Invoice"to detect the end. When both markers are present the pages between them are saved as a new PDF. - Output filenames:
<originalname>_<batchid>_<count>.pdf(batchid is a short unique id).
- Extract invoice-level fields
- Module:
custom_modules/dataocr.py - Implementation: uses PyMuPDF (
fitz) to walk page text blocks. For specific fields (invoice number, date, consignee, dealer) it looks for known keywords (e.g.Invoice No.,Dated,Consignee,Authorised Signatory) and extracts bold spans nearby when available.
- Extract and parse the items table
- Module:
custom_modules/table_extractor.py - It finds the table region using start/end markers (
S.No.andAmount Chargable(in words)) and selects wide text blocks (>80% page width) inside that region. - It then parses rows by inspecting each line's
x0coordinates and mapping ranges to columns (S.No, Description, HSN, Quantity, Rate, Per/Unit, Discount, Amount). The output per-invoice is a dict like{ "items": [ {"items": ..., "Qnty": ..., "price": ..., "unit": ..., "discount": ..., "total": ...}, ... ] }.
- Aggregate + Export
-
Both
cli.pyandui.pyconvert parsed invoice data into row dictionaries with columns:VCH_SERIES,SALE/PURC_TYPE,MC_NAME,VCH/BILL_DATE,VCH/BILL_NO,PARTY_NAME,ITEM_NAME,QUANTITY,UNIT,PRICE,DISCOUNT_PERCENT,LIST_PRICE_ALT_UNIT,LIST_PRICE,AMOUNT -
Export uses
pandas.DataFrameand supports writing to Excel (.xlsx, viaopenpyxl) or CSV. Both overwrite and append modes are supported.
Installation
Prerequisites
- Python 3.10+ recommended (project
pyproject.tomlrequires >=3.10). - On Windows
tkinteris typically included; if not, install the platform packages for Tk support.
Install dependencies (recommended)
Option A — install with uv (if you use it for process management):
pip install uv
uv syncOption B — plain pip (explicit):
pip install pandas pymupdf openpyxl tqdm
# optional: pip install -r requirements.txtIf you want, I can add a requirements.txt listing these packages so you can use pip install -r requirements.txt.
Running the project
CLI (interactive)
python cli.py
# or with uv: uv run cli.pyBehavior
cli.pydefault input:pdfs/invoice.pdf(change theinput_pdf_filevariable inmain()to use a different file).- The CLI prompts for output format (
excelorcsv) and output filename (defaultinvoice_data). - If the chosen output file already exists the CLI asks whether to
write(overwrite) orappend.
GUI (Tkinter)
python ui.py
# or: uv run ui.pyThe GUI provides file pickers for the input PDF and output folder, options for output filename, format (Excel / CSV) and write mode (overwrite/append). It shows processing progress and a complete/summary dialog on success.
Examples (PowerShell)
Install deps and run GUI:
pip install pandas pymupdf openpyxl tqdm
python ui.pyProcess a PDF and export CSV via CLI:
python cli.py
# choose `csv` when prompted and supply filenameConfiguration and tuning
- Change default input or output paths by editing
cli.pyor by using the GUI. - If your invoice documents use different start/end markers than the defaults, update them in
custom_modules/invoice_splitter.py(search forTax InvoiceandThis is a Computer Generated Invoice). - If parsed table columns are wrong, open
custom_modules/table_extractor.pyand adjust thex0ranges used to detect each column. These ranges are specific to your invoice layout and may need calibration.
Troubleshooting
- No data extracted: confirm the PDF contains selectable text. PyMuPDF reads embedded text; scanned images require OCR (e.g.,
pytesseract) and code changes todataocr.py. - Wrong columns / mis-parsed rows: adjust X-coordinate ranges in
custom_modules/table_extractor.pyand re-run. - Excel export error: ensure
openpyxlis installed. - GUI doesn't start: ensure
tkinteris available in your Python distribution.