Skip to content

Umair-IITD/Tabular_Prediction_Using_LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prompt Engineering for Tabular Prediction using LLMs

Titanic Survival Prediction — Comparing 5 Prompting Strategies across 2 LLMs

Python Groq LLaMA Jupyter License: MIT


Overview

This project investigates whether advanced prompt engineering techniques can meaningfully improve the performance of Large Language Models (LLMs) on structured, tabular prediction tasks — an increasingly important challenge in real-world AI deployments such as risk assessment, medical triage, and customer churn prediction.

Using the classic Titanic survival dataset, I designed and evaluated 5 distinct prompting strategies across 2 production LLMs (llama-3.1-8b-instant and llama-3.3-70b-versatile), resulting in 10 experimental conditions evaluated on a 500-passenger held-out test set. All predictions were generated via real LLM API calls — no simulations, no shortcuts.

The project goes beyond simply calling an API. It involves careful prompt design, data leakage prevention, structured output parsing with fallback strategies, rate-limit-aware inference, and a systematic hard-case analysis to understand where LLM reasoning breaks down.


Key Highlights

  • Built and evaluated 10 LLM-based experimental pipelines
  • Processed 7,000+ real API inference calls
  • Designed robust prompt engineering strategies (CoT, ToT, Self-Consistency)
  • Implemented fault-tolerant parsing & checkpointing
  • Conducted systematic failure analysis on misclassified cases

Results

Model Strategy Accuracy Precision Recall F1
llama-3.3-70b-versatile Zero-shot 0.822 0.770 0.750 0.760
llama-3.3-70b-versatile Self-consistency 0.818 0.805 0.681 0.738
llama-3.1-8b-instant Zero-shot 0.806 0.733 0.761 0.747
llama-3.1-8b-instant Self-consistency 0.806 0.736 0.755 0.745
llama-3.3-70b-versatile CoT 0.802 0.795 0.638 0.708
llama-3.1-8b-instant CoT 0.800 0.806 0.617 0.699
llama-3.3-70b-versatile Five-shot 0.788 0.781 0.606 0.683
llama-3.3-70b-versatile ToT 0.778 0.781 0.569 0.658
llama-3.1-8b-instant ToT 0.652 0.532 0.612 0.569
llama-3.1-8b-instant Five-shot 0.632 0.506 0.846 0.633

Best single result: llama-3.3-70b-versatile with zero-shot prompting → 82.2% accuracy, F1 = 0.760


What I Built

5 prompting strategies × 2 LLMs = 10 experimental conditions  
500 test passengers →  
- 8 standard conditions → 500 × 8 = 4000 API calls  
- Self-consistency (3 runs × 2 models) → 500 × 3 × 2 = 3000 API calls  
Total 7,000 real API calls

Prompting Strategies Implemented

Strategy Description Key Design Choice
Zero-shot Task description + historical context only Tests raw LLM knowledge
Five-shot 5 demographically diverse labeled examples Covers all key survival groups
Chain-of-Thought 5 examples + step-by-step reasoning traces Forces factor-by-factor analysis
Self-Consistency 3 independent runs from different analytical perspectives + majority vote Reduces variance in predictions
Tree-of-Thought Explicit hypothesis exploration (survived vs. died) with confidence rating Prevents premature commitment

Key Engineering Decisions

  • No data leakage: test set strictly separated from few-shot pool before any prompt construction
  • Robust label parsing: 5-tier fallback parser handles JSON, keyword, and digit-based model outputs
  • Checkpoint-based inference: results saved after every condition so long runs are resumable
  • Rate-limit-aware delays: exponential backoff with per-strategy tuned delays
  • Self-consistency diversity: 5 analytical personas (naval historian, sociologist, statistician, maritime expert, Edwardian historian) to maximise vote diversity

Visual Results

Metrics Comparison — All 10 Conditions Metrics Comparison

Model F1 Comparison Model Comparison

Confusion Matrices Confusion Matrices

Hard Case Analysis Hard Cases


Hard Case Analysis

I identified 15 passengers misclassified across 5+ conditions and analyzed the root causes. Three systematic failure modes emerged:

  1. Male survival blind spot — When a passenger is male, 3rd class, and paid a low fare, both models predict death with near certainty regardless of prompting strategy. Male 3rd-class survivors like Albimona, Andersson, and Niskanen were misclassified across all 10 conditions.

  2. Female fatality failure — The strong female-survival prior makes it nearly impossible for the models to predict death for women. The Allison case (a 2-year-old 1st-class girl who died) was misclassified by every strategy on both models.

  3. Missing feature collapse — When age is absent, the models lose one of their few potential override signals. Combined with male sex and 3rd-class travel, this made correct prediction essentially unreachable.

Key insight: Neither chain-of-thought nor tree-of-thought prompting was sufficient to override strong statistical priors. This reveals a fundamental ceiling on prompting-only approaches for tabular prediction — fine-tuning is likely required to break through it.


Project Structure

.
├── notebooks/
│   └── Tabular-prediction.ipynb   # Complete experiment pipeline
├── plots/
│   ├── metrics_comparison.png    # All 4 metrics across 10 conditions
│   ├── model_comparison.png       # F1 comparison between models
│   ├── confusion_matrices.png  # Confusion matrices for all conditions
│   └── hard_cases_analysis.png    # Error distribution + top misclassified
├── report/
│   ├── Tabular_Prediction.docx    # Full written report
│   └── Instruction(1).pdf         # Original assignment specification
├── results/
│   ├── titanic_predictions_real.csv  # Raw predictions for all 10 conditions
│   ├── titanic_metrics.csv           # Computed metrics table
│   └── hard_cases.csv                # Misclassified passenger analysis
├── datasets/
│   └── Titanic-Dataset.csv        # Source dataset (891 passengers)
├── .env.example                   # API key template
├── .gitignore
└── requirements.txt

Getting Started

Reproducibility Note

Due to the stochastic nature of LLM outputs and API-based inference, results may vary slightly across runs. However, overall trends and relative performance across prompting strategies remain consistent.

Prerequisites

  • Python 3.10+
  • A free Groq API key (used for llama-3.1-8b-instant and llama-3.3-70b-versatile)

Installation

# 1. Clone the repository
git clone https://github.com/Umair-IITD/Tabular_Prediction_Using_LLM.git
cd Tabular_Prediction_Using_LLM

# 2. Install dependencies
pip install -r requirements.txt

# 3. Set up your API key
cp .env.example .env
# Edit .env and add your Groq API key:
# GROQ_API_KEY=your_key_here

Running the Experiment

# Open the notebook
jupyter notebook notebooks/Tabular-prediction.ipynb

# Run all cells in order (Step 1 through Step 11)
# The experiment auto-saves checkpoints after each condition
# so it is safe to interrupt and resume

Time estimate: The full run takes approximately 2–4 hours depending on API latency. Self-consistency is the most expensive strategy (3 runs per passenger per model).

Reproducing Specific Results

If you only want to reproduce the metrics and charts without re-running inference:

# In any Python environment with the results/ folder present:
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score

results = pd.read_csv('results/titanic_predictions_real.csv')
pred_cols = [c for c in results.columns if c not in ['PassengerId', 'true_label']]

for col in pred_cols:
    acc = accuracy_score(results['true_label'], results[col])
    f1  = f1_score(results['true_label'], results[col], zero_division=0)
    print(f'{col:<50} Acc: {acc:.3f}  F1: {f1:.3f}')

Technical Stack

Component Technology
LLM Inference Groq API
Models llama-3.1-8b-instant, llama-3.3-70b-versatile
Data Processing pandas, numpy
Evaluation Metrics scikit-learn
Visualisation matplotlib, seaborn
Progress Tracking tqdm
Environment Management python-dotenv
Notebook Jupyter

Requirements

groq
pandas
numpy
scikit-learn
matplotlib
seaborn
tqdm
python-dotenv
jupyter

Install all with:

pip install -r requirements.txt

Findings Summary

  1. Larger models benefit more from complex prompting. llama-3.3-70b showed meaningful gains from self-consistency and CoT. llama-3.1-8b saw degraded performance from five-shot and ToT, suggesting a capacity threshold for complex prompt structures.

  2. Self-consistency is the most reliable advanced strategy. At the cost of 3x the API calls, it consistently outperformed zero-shot for the 70B model without introducing the instability seen in five-shot or ToT.

  3. Zero-shot is surprisingly competitive. For well-represented domains (like the Titanic, which LLMs have seen extensively in pre-training), zero-shot with contextual hints nearly matches complex multi-step strategies.

  4. Prompting has a hard ceiling on tabular tasks. The hard case analysis shows that LLM statistical priors — particularly sex and class — cannot be overridden by prompting alone when features strongly align with dominant training patterns. Fine-tuning or retrieval-augmented approaches would likely be needed to push accuracy significantly higher.


Author

Md Umair Alam
B.Tech, IIT Delhi


License

MIT License — see LICENSE for details.


Acknowledgements

About

LLM-based tabular prediction system using advanced prompt engineering (Zero-shot, Few-shot, CoT, Self-Consistency, ToT) on the Titanic dataset. Includes 10 experimental setups across multiple LLMs with detailed evaluation (Accuracy, Precision, Recall, F1) and hard-case error analysis to study reasoning capabilities of modern LLMs on structured data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors