Skip to content

AshrafGalibShaik/NL-to-SQL-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NL to SQL

NL-to-SQL Model

This project implements a Natural Language to SQL (NL-to-SQL) model from scratch, without external APIs. It translates simple natural language queries (e.g., "List customers older than 30") into SQL queries (e.g., SELECT * FROM customers WHERE age > 30;) for a predefined database schema. The model is trained in Python using PyTorch and deployed for inference in C++ using tiny-dnn. A SQLite executor validates generated queries.

Features

  • Input: Simple natural language queries.
  • Output: SQL SELECT queries with WHERE clauses.
  • Training: Seq2seq LSTM model in Python (PyTorch).
  • Inference: Lightweight C++ inference engine (tiny-dnn).
  • Dataset: Synthetic dataset for a customers table; extensible to Spider/WikiSQL.
  • Execution: SQLite integration for query validation.

Directory Structure

nl2sql/
├── data/
│   ├── dataset.json       # Synthetic NL-SQL dataset
│   ├── schema.json        # Database schema
├── python/
│   ├── train.py           # Training script (PyTorch)
│   ├── tokenizer.py       # Tokenization logic
│   ├── evaluate.py        # Evaluation and SQLite executor
├── cpp/
│   ├── inference.cpp      # C++ inference engine
│   ├── schema_loader.cpp  # Schema parser
│   ├── executor.cpp       # SQLite query executor
├── requirements.txt       # Python dependencies
├── README.md             # This file

Prerequisites

  • Python 3.8+:
    • Install dependencies: pip install -r requirements.txt
    • Required packages: torch>=1.9.0, nltk>=3.6.0
  • C++:
    • Compiler: g++ with C++17 support
    • Libraries:
      • tiny-dnn: GitHub
      • nlohmann/json: GitHub
      • sqlite3: Install libsqlite3-dev (Ubuntu) or equivalent
  • SQLite: For query execution (test.db).
  • Hardware: CPU for training/inference; GPU recommended for faster training.

Setup

  1. Clone Repository:

    git clone <repository-url>
    cd nl2sql
  2. Install Python Dependencies:

    pip install -r requirements.txt
    python -c "import nltk; nltk.download('punkt')"
  3. Install C++ Dependencies:

    • Install tiny-dnn and nlohmann/json (follow their GitHub instructions).
    • Install SQLite: sudo apt-get install libsqlite3-dev (Ubuntu).
    • Ensure g++ supports C++17.
  4. Create SQLite Database:

    • Create test.db with the customers table:
      sqlite3 test.db
      CREATE TABLE customers (name TEXT, age INTEGER, city TEXT);
      INSERT INTO customers VALUES ('Alice', 35, 'London'), ('Bob', 25, 'New York'), ('Charlie', 40, 'London');
      .exit

Usage

1. Train the Model

Train the seq2seq model using the synthetic dataset:

cd python
python train.py
  • Outputs: nl2sql_model.pth (model weights), vocab.json (vocabulary, requires manual export).
  • Training takes ~5-10 minutes on a CPU for the small dataset (50 epochs).

2. Export Model Weights

Convert PyTorch weights to tiny-dnn format (nl2sql_model.bin). This requires a custom script (not provided) to map PyTorch tensors to tiny-dnn's binary format. Alternatively, reimplement the LSTM inference logic in C++.

3. Run Inference

Compile and run the C++ inference engine:

cd ../cpp
g++ -std=c++17 inference.cpp schema_loader.cpp -o inference -ltiny_dnn -lsqlite3 -I /path/to/nlohmann
./inference
  • Example output:
    NL Query: List customers from London
    SQL Query: SELECT * FROM customers WHERE city = 'London';
    

4. Execute SQL Queries

Test generated SQL queries on test.db:

g++ -std=c++17 executor.cpp -o executor -lsqlite3
./executor
  • Example output:
    Alice  35  London
    Charlie  40  London
    

5. Evaluate Model

Run evaluation in Python to generate and test SQL queries:

cd ../python
python evaluate.py
  • Example output:
    NL Query: List customers from London
    SQL Query: SELECT * FROM customers WHERE city = 'London';
    Results: [('Alice', 35, 'London'), ('Charlie', 40, 'London')]
    

Dataset

  • Synthetic Dataset (data/dataset.json): 5 NL-SQL pairs for the customers table.
  • Schema (data/schema.json): Defines customers(name, age, city).
  • Scaling Up: Replace with Spider (GitHub) or WikiSQL for complex queries. Update tokenizer.py to preprocess larger datasets.

Limitations

  • Dataset: The synthetic dataset is small and supports only simple queries.
  • Model: Basic LSTM-based seq2seq; lacks attention or transformer capabilities.
  • Schema: Hardcoded to customers table. Dynamic schema support requires enhancements.
  • Inference: tiny-dnn integration assumes manual weight conversion from PyTorch.
  • Evaluation: Lacks BLEU score or exact match metrics (add via nltk.translate.bleu_score).

Extending the Project

  1. Larger Dataset: Use Spider/WikiSQL for diverse schemas and queries.
  2. Advanced Model: Add attention mechanisms or switch to a transformer architecture.
  3. Dynamic Schema: Modify schema_loader.cpp to condition on table/column metadata.
  4. Error Handling: Validate SQL syntax before execution.
  5. Performance: Optimize C++ inference with Eigen or custom LSTM implementation.

Troubleshooting

  • Training Fails: Ensure dataset.json and schema.json are valid. Check GPU availability (torch.cuda.is_available()).
  • Inference Errors: Verify nl2sql_model.bin and vocab.json exist and match the model architecture.
  • SQLite Issues: Ensure test.db has the correct schema and data.

License

MIT License. See LICENSE file (create one if needed).

About

This project implements a **Natural Language to SQL (NL-to-SQL)** model from scratch, without external APIs. It translates simple natural language queries (e.g., "List customers older than 30") into SQL queries (e.g., `SELECT * FROM customers WHERE age > 30;`) for a predefined database schema.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors