NL-to-SQL Model

This project implements a Natural Language to SQL (NL-to-SQL) model from scratch, without external APIs. It translates simple natural language queries (e.g., "List customers older than 30") into SQL queries (e.g., SELECT * FROM customers WHERE age > 30;) for a predefined database schema. The model is trained in Python using PyTorch and deployed for inference in C++ using tiny-dnn. A SQLite executor validates generated queries.

Features

Input: Simple natural language queries.
Output: SQL SELECT queries with WHERE clauses.
Training: Seq2seq LSTM model in Python (PyTorch).
Inference: Lightweight C++ inference engine (tiny-dnn).
Dataset: Synthetic dataset for a customers table; extensible to Spider/WikiSQL.
Execution: SQLite integration for query validation.

Directory Structure

nl2sql/
├── data/
│   ├── dataset.json       # Synthetic NL-SQL dataset
│   ├── schema.json        # Database schema
├── python/
│   ├── train.py           # Training script (PyTorch)
│   ├── tokenizer.py       # Tokenization logic
│   ├── evaluate.py        # Evaluation and SQLite executor
├── cpp/
│   ├── inference.cpp      # C++ inference engine
│   ├── schema_loader.cpp  # Schema parser
│   ├── executor.cpp       # SQLite query executor
├── requirements.txt       # Python dependencies
├── README.md             # This file

Prerequisites

Python 3.8+:
- Install dependencies: pip install -r requirements.txt
- Required packages: torch>=1.9.0, nltk>=3.6.0
C++:
- Compiler: g++ with C++17 support
- Libraries:
  - tiny-dnn: GitHub
  - nlohmann/json: GitHub
  - sqlite3: Install libsqlite3-dev (Ubuntu) or equivalent
SQLite: For query execution (test.db).
Hardware: CPU for training/inference; GPU recommended for faster training.

Setup

Clone Repository:
```
git clone <repository-url>
cd nl2sql
```

Install Python Dependencies:

pip install -r requirements.txt
python -c "import nltk; nltk.download('punkt')"

Install C++ Dependencies:
- Install tiny-dnn and nlohmann/json (follow their GitHub instructions).
- Install SQLite: sudo apt-get install libsqlite3-dev (Ubuntu).
- Ensure g++ supports C++17.

Create SQLite Database:

Create test.db with the customers table:

sqlite3 test.db

CREATE TABLE customers (name TEXT, age INTEGER, city TEXT);
INSERT INTO customers VALUES ('Alice', 35, 'London'), ('Bob', 25, 'New York'), ('Charlie', 40, 'London');
.exit

Usage

1. Train the Model

Train the seq2seq model using the synthetic dataset:

cd python
python train.py

Outputs: nl2sql_model.pth (model weights), vocab.json (vocabulary, requires manual export).
Training takes ~5-10 minutes on a CPU for the small dataset (50 epochs).

2. Export Model Weights

Convert PyTorch weights to tiny-dnn format (nl2sql_model.bin). This requires a custom script (not provided) to map PyTorch tensors to tiny-dnn's binary format. Alternatively, reimplement the LSTM inference logic in C++.

3. Run Inference

Compile and run the C++ inference engine:

cd ../cpp
g++ -std=c++17 inference.cpp schema_loader.cpp -o inference -ltiny_dnn -lsqlite3 -I /path/to/nlohmann
./inference

Example output:

NL Query: List customers from London
SQL Query: SELECT * FROM customers WHERE city = 'London';

4. Execute SQL Queries

Test generated SQL queries on test.db:

g++ -std=c++17 executor.cpp -o executor -lsqlite3
./executor

Example output:
```
Alice  35  London
Charlie  40  London
```

5. Evaluate Model

Run evaluation in Python to generate and test SQL queries:

cd ../python
python evaluate.py

Example output:

NL Query: List customers from London
SQL Query: SELECT * FROM customers WHERE city = 'London';
Results: [('Alice', 35, 'London'), ('Charlie', 40, 'London')]

Dataset

Synthetic Dataset (data/dataset.json): 5 NL-SQL pairs for the customers table.
Schema (data/schema.json): Defines customers(name, age, city).
Scaling Up: Replace with Spider (GitHub) or WikiSQL for complex queries. Update tokenizer.py to preprocess larger datasets.

Limitations

Dataset: The synthetic dataset is small and supports only simple queries.
Model: Basic LSTM-based seq2seq; lacks attention or transformer capabilities.
Schema: Hardcoded to customers table. Dynamic schema support requires enhancements.
Inference: tiny-dnn integration assumes manual weight conversion from PyTorch.
Evaluation: Lacks BLEU score or exact match metrics (add via nltk.translate.bleu_score).

Extending the Project

Larger Dataset: Use Spider/WikiSQL for diverse schemas and queries.
Advanced Model: Add attention mechanisms or switch to a transformer architecture.
Dynamic Schema: Modify schema_loader.cpp to condition on table/column metadata.
Error Handling: Validate SQL syntax before execution.
Performance: Optimize C++ inference with Eigen or custom LSTM implementation.

Troubleshooting

Training Fails: Ensure dataset.json and schema.json are valid. Check GPU availability (torch.cuda.is_available()).
Inference Errors: Verify nl2sql_model.bin and vocab.json exist and match the model architecture.
SQLite Issues: Ensure test.db has the correct schema and data.

License

MIT License. See LICENSE file (create one if needed).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NL-to-SQL Model

Features

Directory Structure

Prerequisites

Setup

Usage

1. Train the Model

2. Export Model Weights

3. Run Inference

4. Execute SQL Queries

5. Evaluate Model

Dataset

Limitations

Extending the Project

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cpp		cpp
data		data
python		python
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NL-to-SQL Model

Features

Directory Structure

Prerequisites

Setup

Usage

1. Train the Model

2. Export Model Weights

3. Run Inference

4. Execute SQL Queries

5. Evaluate Model

Dataset

Limitations

Extending the Project

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages