Prompt Engineering for Tabular Prediction using LLMs

Titanic Survival Prediction — Comparing 5 Prompting Strategies across 2 LLMs

Overview

This project investigates whether advanced prompt engineering techniques can meaningfully improve the performance of Large Language Models (LLMs) on structured, tabular prediction tasks — an increasingly important challenge in real-world AI deployments such as risk assessment, medical triage, and customer churn prediction.

Using the classic Titanic survival dataset, I designed and evaluated 5 distinct prompting strategies across 2 production LLMs (llama-3.1-8b-instant and llama-3.3-70b-versatile), resulting in 10 experimental conditions evaluated on a 500-passenger held-out test set. All predictions were generated via real LLM API calls — no simulations, no shortcuts.

The project goes beyond simply calling an API. It involves careful prompt design, data leakage prevention, structured output parsing with fallback strategies, rate-limit-aware inference, and a systematic hard-case analysis to understand where LLM reasoning breaks down.

Key Highlights

Built and evaluated 10 LLM-based experimental pipelines
Processed 7,000+ real API inference calls
Designed robust prompt engineering strategies (CoT, ToT, Self-Consistency)
Implemented fault-tolerant parsing & checkpointing
Conducted systematic failure analysis on misclassified cases

Results

Model	Strategy	Accuracy	Precision	Recall	F1
llama-3.3-70b-versatile	Zero-shot	0.822	0.770	0.750	0.760
llama-3.3-70b-versatile	Self-consistency	0.818	0.805	0.681	0.738
llama-3.1-8b-instant	Zero-shot	0.806	0.733	0.761	0.747
llama-3.1-8b-instant	Self-consistency	0.806	0.736	0.755	0.745
llama-3.3-70b-versatile	CoT	0.802	0.795	0.638	0.708
llama-3.1-8b-instant	CoT	0.800	0.806	0.617	0.699
llama-3.3-70b-versatile	Five-shot	0.788	0.781	0.606	0.683
llama-3.3-70b-versatile	ToT	0.778	0.781	0.569	0.658
llama-3.1-8b-instant	ToT	0.652	0.532	0.612	0.569
llama-3.1-8b-instant	Five-shot	0.632	0.506	0.846	0.633

Best single result: llama-3.3-70b-versatile with zero-shot prompting → 82.2% accuracy, F1 = 0.760

What I Built

5 prompting strategies × 2 LLMs = 10 experimental conditions  
500 test passengers →  
- 8 standard conditions → 500 × 8 = 4000 API calls  
- Self-consistency (3 runs × 2 models) → 500 × 3 × 2 = 3000 API calls  
Total 7,000 real API calls

Prompting Strategies Implemented

Strategy	Description	Key Design Choice
Zero-shot	Task description + historical context only	Tests raw LLM knowledge
Five-shot	5 demographically diverse labeled examples	Covers all key survival groups
Chain-of-Thought	5 examples + step-by-step reasoning traces	Forces factor-by-factor analysis
Self-Consistency	3 independent runs from different analytical perspectives + majority vote	Reduces variance in predictions
Tree-of-Thought	Explicit hypothesis exploration (survived vs. died) with confidence rating	Prevents premature commitment

Key Engineering Decisions

No data leakage: test set strictly separated from few-shot pool before any prompt construction
Robust label parsing: 5-tier fallback parser handles JSON, keyword, and digit-based model outputs
Checkpoint-based inference: results saved after every condition so long runs are resumable
Rate-limit-aware delays: exponential backoff with per-strategy tuned delays
Self-consistency diversity: 5 analytical personas (naval historian, sociologist, statistician, maritime expert, Edwardian historian) to maximise vote diversity

Visual Results

Metrics Comparison — All 10 Conditions	Model F1 Comparison
Confusion Matrices	Hard Case Analysis

Hard Case Analysis

I identified 15 passengers misclassified across 5+ conditions and analyzed the root causes. Three systematic failure modes emerged:

Male survival blind spot — When a passenger is male, 3rd class, and paid a low fare, both models predict death with near certainty regardless of prompting strategy. Male 3rd-class survivors like Albimona, Andersson, and Niskanen were misclassified across all 10 conditions.
Female fatality failure — The strong female-survival prior makes it nearly impossible for the models to predict death for women. The Allison case (a 2-year-old 1st-class girl who died) was misclassified by every strategy on both models.
Missing feature collapse — When age is absent, the models lose one of their few potential override signals. Combined with male sex and 3rd-class travel, this made correct prediction essentially unreachable.

Key insight: Neither chain-of-thought nor tree-of-thought prompting was sufficient to override strong statistical priors. This reveals a fundamental ceiling on prompting-only approaches for tabular prediction — fine-tuning is likely required to break through it.

Project Structure

.
├── notebooks/
│   └── Tabular-prediction.ipynb   # Complete experiment pipeline
├── plots/
│   ├── metrics_comparison.png    # All 4 metrics across 10 conditions
│   ├── model_comparison.png       # F1 comparison between models
│   ├── confusion_matrices.png  # Confusion matrices for all conditions
│   └── hard_cases_analysis.png    # Error distribution + top misclassified
├── report/
│   ├── Tabular_Prediction.docx    # Full written report
│   └── Instruction(1).pdf         # Original assignment specification
├── results/
│   ├── titanic_predictions_real.csv  # Raw predictions for all 10 conditions
│   ├── titanic_metrics.csv           # Computed metrics table
│   └── hard_cases.csv                # Misclassified passenger analysis
├── datasets/
│   └── Titanic-Dataset.csv        # Source dataset (891 passengers)
├── .env.example                   # API key template
├── .gitignore
└── requirements.txt

Getting Started

Reproducibility Note

Due to the stochastic nature of LLM outputs and API-based inference, results may vary slightly across runs. However, overall trends and relative performance across prompting strategies remain consistent.

Prerequisites

Python 3.10+
A free Groq API key (used for llama-3.1-8b-instant and llama-3.3-70b-versatile)

Installation

# 1. Clone the repository
git clone https://github.com/Umair-IITD/Tabular_Prediction_Using_LLM.git
cd Tabular_Prediction_Using_LLM

# 2. Install dependencies
pip install -r requirements.txt

# 3. Set up your API key
cp .env.example .env
# Edit .env and add your Groq API key:
# GROQ_API_KEY=your_key_here

Running the Experiment

# Open the notebook
jupyter notebook notebooks/Tabular-prediction.ipynb

# Run all cells in order (Step 1 through Step 11)
# The experiment auto-saves checkpoints after each condition
# so it is safe to interrupt and resume

Time estimate: The full run takes approximately 2–4 hours depending on API latency. Self-consistency is the most expensive strategy (3 runs per passenger per model).

Reproducing Specific Results

If you only want to reproduce the metrics and charts without re-running inference:

# In any Python environment with the results/ folder present:
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score

results = pd.read_csv('results/titanic_predictions_real.csv')
pred_cols = [c for c in results.columns if c not in ['PassengerId', 'true_label']]

for col in pred_cols:
    acc = accuracy_score(results['true_label'], results[col])
    f1  = f1_score(results['true_label'], results[col], zero_division=0)
    print(f'{col:<50} Acc: {acc:.3f}  F1: {f1:.3f}')

Technical Stack

Component	Technology
LLM Inference	Groq API
Models	llama-3.1-8b-instant, llama-3.3-70b-versatile
Data Processing	pandas, numpy
Evaluation Metrics	scikit-learn
Visualisation	matplotlib, seaborn
Progress Tracking	tqdm
Environment Management	python-dotenv
Notebook	Jupyter

Requirements

groq
pandas
numpy
scikit-learn
matplotlib
seaborn
tqdm
python-dotenv
jupyter

Install all with:

pip install -r requirements.txt

Findings Summary

Larger models benefit more from complex prompting. llama-3.3-70b showed meaningful gains from self-consistency and CoT. llama-3.1-8b saw degraded performance from five-shot and ToT, suggesting a capacity threshold for complex prompt structures.
Self-consistency is the most reliable advanced strategy. At the cost of 3x the API calls, it consistently outperformed zero-shot for the 70B model without introducing the instability seen in five-shot or ToT.
Zero-shot is surprisingly competitive. For well-represented domains (like the Titanic, which LLMs have seen extensively in pre-training), zero-shot with contextual hints nearly matches complex multi-step strategies.
Prompting has a hard ceiling on tabular tasks. The hard case analysis shows that LLM statistical priors — particularly sex and class — cannot be overridden by prompting alone when features strongly align with dominant training patterns. Fine-tuning or retrieval-augmented approaches would likely be needed to push accuracy significantly higher.

Author

Md Umair Alam
B.Tech, IIT Delhi

GitHub: https://github.com/Umair-IITD
LinkedIn: https://www.linkedin.com/in/umair-alam-iitd/

License

MIT License — see LICENSE for details.

Acknowledgements

Dataset: Titanic - Machine Learning from Disaster (Kaggle)
Inference: Groq for ultra-fast LLaMA inference
Models: Meta LLaMA 3.1 and 3.3 families

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt Engineering for Tabular Prediction using LLMs

Titanic Survival Prediction — Comparing 5 Prompting Strategies across 2 LLMs

Overview

Key Highlights

Results

What I Built

Prompting Strategies Implemented

Key Engineering Decisions

Visual Results

Hard Case Analysis

Project Structure

Getting Started

Reproducibility Note

Prerequisites

Installation

Running the Experiment

Reproducing Specific Results

Technical Stack

Requirements

Findings Summary

Author

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
notebooks		notebooks
plots		plots
report		report
results		results
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Prompt Engineering for Tabular Prediction using LLMs

Titanic Survival Prediction — Comparing 5 Prompting Strategies across 2 LLMs

Overview

Key Highlights

Results

What I Built

Prompting Strategies Implemented

Key Engineering Decisions

Visual Results

Hard Case Analysis

Project Structure

Getting Started

Reproducibility Note

Prerequisites

Installation

Running the Experiment

Reproducing Specific Results

Technical Stack

Requirements

Findings Summary

Author

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages