This project implements a Natural Language to SQL (NL-to-SQL) model from scratch, without external APIs. It translates simple natural language queries (e.g., "List customers older than 30") into SQL queries (e.g., SELECT * FROM customers WHERE age > 30;) for a predefined database schema. The model is trained in Python using PyTorch and deployed for inference in C++ using tiny-dnn. A SQLite executor validates generated queries.
- Input: Simple natural language queries.
- Output: SQL
SELECTqueries withWHEREclauses. - Training: Seq2seq LSTM model in Python (PyTorch).
- Inference: Lightweight C++ inference engine (
tiny-dnn). - Dataset: Synthetic dataset for a
customerstable; extensible to Spider/WikiSQL. - Execution: SQLite integration for query validation.
nl2sql/
├── data/
│ ├── dataset.json # Synthetic NL-SQL dataset
│ ├── schema.json # Database schema
├── python/
│ ├── train.py # Training script (PyTorch)
│ ├── tokenizer.py # Tokenization logic
│ ├── evaluate.py # Evaluation and SQLite executor
├── cpp/
│ ├── inference.cpp # C++ inference engine
│ ├── schema_loader.cpp # Schema parser
│ ├── executor.cpp # SQLite query executor
├── requirements.txt # Python dependencies
├── README.md # This file
- Python 3.8+:
- Install dependencies:
pip install -r requirements.txt - Required packages:
torch>=1.9.0,nltk>=3.6.0
- Install dependencies:
- C++:
- SQLite: For query execution (
test.db). - Hardware: CPU for training/inference; GPU recommended for faster training.
-
Clone Repository:
git clone <repository-url> cd nl2sql
-
Install Python Dependencies:
pip install -r requirements.txt python -c "import nltk; nltk.download('punkt')" -
Install C++ Dependencies:
- Install
tiny-dnnandnlohmann/json(follow their GitHub instructions). - Install SQLite:
sudo apt-get install libsqlite3-dev(Ubuntu). - Ensure
g++supports C++17.
- Install
-
Create SQLite Database:
- Create
test.dbwith thecustomerstable:sqlite3 test.db
CREATE TABLE customers (name TEXT, age INTEGER, city TEXT); INSERT INTO customers VALUES ('Alice', 35, 'London'), ('Bob', 25, 'New York'), ('Charlie', 40, 'London'); .exit
- Create
Train the seq2seq model using the synthetic dataset:
cd python
python train.py- Outputs:
nl2sql_model.pth(model weights),vocab.json(vocabulary, requires manual export). - Training takes ~5-10 minutes on a CPU for the small dataset (50 epochs).
Convert PyTorch weights to tiny-dnn format (nl2sql_model.bin). This requires a custom script (not provided) to map PyTorch tensors to tiny-dnn's binary format. Alternatively, reimplement the LSTM inference logic in C++.
Compile and run the C++ inference engine:
cd ../cpp
g++ -std=c++17 inference.cpp schema_loader.cpp -o inference -ltiny_dnn -lsqlite3 -I /path/to/nlohmann
./inference- Example output:
NL Query: List customers from London SQL Query: SELECT * FROM customers WHERE city = 'London';
Test generated SQL queries on test.db:
g++ -std=c++17 executor.cpp -o executor -lsqlite3
./executor- Example output:
Alice 35 London Charlie 40 London
Run evaluation in Python to generate and test SQL queries:
cd ../python
python evaluate.py- Example output:
NL Query: List customers from London SQL Query: SELECT * FROM customers WHERE city = 'London'; Results: [('Alice', 35, 'London'), ('Charlie', 40, 'London')]
- Synthetic Dataset (
data/dataset.json): 5 NL-SQL pairs for thecustomerstable. - Schema (
data/schema.json): Definescustomers(name, age, city). - Scaling Up: Replace with Spider (GitHub) or WikiSQL for complex queries. Update
tokenizer.pyto preprocess larger datasets.
- Dataset: The synthetic dataset is small and supports only simple queries.
- Model: Basic LSTM-based seq2seq; lacks attention or transformer capabilities.
- Schema: Hardcoded to
customerstable. Dynamic schema support requires enhancements. - Inference:
tiny-dnnintegration assumes manual weight conversion from PyTorch. - Evaluation: Lacks BLEU score or exact match metrics (add via
nltk.translate.bleu_score).
- Larger Dataset: Use Spider/WikiSQL for diverse schemas and queries.
- Advanced Model: Add attention mechanisms or switch to a transformer architecture.
- Dynamic Schema: Modify
schema_loader.cppto condition on table/column metadata. - Error Handling: Validate SQL syntax before execution.
- Performance: Optimize C++ inference with Eigen or custom LSTM implementation.
- Training Fails: Ensure
dataset.jsonandschema.jsonare valid. Check GPU availability (torch.cuda.is_available()). - Inference Errors: Verify
nl2sql_model.binandvocab.jsonexist and match the model architecture. - SQLite Issues: Ensure
test.dbhas the correct schema and data.
MIT License. See LICENSE file (create one if needed).