TaxGuard-TN is a production-grade, hybrid AI framework designed for the Tamil Nadu Revenue & Disaster Management Department. It combines Neuro-Symbolic AI, Multi-Agent Orchestration, and Bilingual NLP to detect financial fraud and tax evasion with deterministic traceability.
Modern tax evasion schemes are often hidden within complex narratives and multi-layered financial networks. TaxGuard-TN moves beyond traditional black-box machine learning by implementing a Neuro-Symbolic architecture. It doesn't just predict risk; it explains it through statutory legal rules and structured intermediate representations (IR).
- Bilingual Intelligence: Native support for Tamil (தமிழ்) and English document processing.
- Explainable AI (XAI): Every risk score is backed by a logical trace, meeting government transparency standards.
- Hybrid Engine: Fuses semantic NLP signals with hard-coded legal overrides and graph-based network analysis.
- Deterministic Reliability: Guaranteed reproducibility for audit purposes via locked-seed execution.
The system operates through five specialized agents, each providing a unique risk signal (
| Agent | Signal | Role | Core Technology |
|---|---|---|---|
| Narrator (NLP) | Semantic Risk Detection | MuRIL (Indic-BERT), Cosine Similarity | |
| Accountant (Fin) | Financial Anomaly | Forensic Analytics, Statistical Profiling | |
| Graph (Network) | Contagion Analysis | NetworkX, Graph Neural Networks (PyG) | |
| Skeptic (Adv) | Collusion Detection | Adversarial Pattern Matching | |
| Legal (Logic) | Hard Override | PyDatalog, Symbolic Logic Engine |
The system's strongest differentiator. It converts messy free-text from OCR into structured TaxFactIR (JSON) using Pydantic validation. This ensures that the logic engine receives "clean" facts, preventing "garbage-in, garbage-out" scenarios.
- Layout-Aware OCR: Uses Tesseract with custom configurations for Indian invoice formats.
- Indic NLP: Leverages
google/muril-base-casedto handle Tamil Unicode (U+0B80–U+0BFF) and English seamlessly within the same embedding space.
Statutory rules (e.g., "Offshore evasion under Section X") are defined as logical predicates. If a violation is detected, the Legal Agent pins the risk score to 0.99, overriding all probabilistic signals.
NLP_Model/
├── 📂 NLP_MODEL_FULL_CODES/ # Core implementation and entry points
│ ├── main_model.py # Main pipeline execution script
│ ├── NLP_MODEL.pdf # Comprehensive system documentation
│ ├── NLP_REPORT.pdf # Final project report
│ └── 📊 Functional_test.xlsx # Test case templates
├── 📂 NLP_V2/ # Output artifacts and visualizations
│ ├── risk_landscape.html # Interactive 3D risk visualization
│ ├── network_graph.png # Entity relationship visualization
│ └── 📄 master_dataset.csv # Final audit registry with risk scores
├── 📂 TN_Tax_Dataset/ # Processed datasets for Tamil Nadu region
├── 📓 TaxGuard_TN_Kaggle_Notebook.ipynb # Production-ready Kaggle workflow
└── 📓 TaxGuard_TN_Dataset_Creation_Notebook.ipynb # Synthetic data generation engine- Python 3.8+
- Tesseract OCR (installed on system)
- GPU (Optional but recommended for MuRIL embeddings)
git clone https://github.com/lokesh-poreddy/NLP_MODEL.git
pip install -r requirements.txtThe entire pipeline can be triggered via the main script:
python NLP_MODEL_FULL_CODES/main_model.pyNote: Ensure your dataset paths are configured in system_config.json.
TaxGuard-TN generates a suite of interactive and static reports located in the NLP_V2/ directory:
- Risk Heatmaps: Identify geographical clusters of high-risk entities.
- Temporal Anomalies: Track revenue shifts and tax credit spikes over a 30-year trend.
- Explainable Audit Registry: A CSV export detailing the "Why" behind every high-risk flag.
Author: Aaditya Gupta
Contact: GitHub Profile
Special thanks to the open-source community for the Transformers, PyG, and pyDatalog libraries that made this hybrid system possible.
© 2026 TaxGuard-TN. Developed for academic and government research purposes.