This project investigates whether advanced prompt engineering techniques can meaningfully improve the performance of Large Language Models (LLMs) on structured, tabular prediction tasks — an increasingly important challenge in real-world AI deployments such as risk assessment, medical triage, and customer churn prediction.
Using the classic Titanic survival dataset, I designed and evaluated 5 distinct prompting strategies across 2 production LLMs (llama-3.1-8b-instant and llama-3.3-70b-versatile), resulting in 10 experimental conditions evaluated on a 500-passenger held-out test set. All predictions were generated via real LLM API calls — no simulations, no shortcuts.
The project goes beyond simply calling an API. It involves careful prompt design, data leakage prevention, structured output parsing with fallback strategies, rate-limit-aware inference, and a systematic hard-case analysis to understand where LLM reasoning breaks down.
- Built and evaluated 10 LLM-based experimental pipelines
- Processed 7,000+ real API inference calls
- Designed robust prompt engineering strategies (CoT, ToT, Self-Consistency)
- Implemented fault-tolerant parsing & checkpointing
- Conducted systematic failure analysis on misclassified cases
| Model | Strategy | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| llama-3.3-70b-versatile | Zero-shot | 0.822 | 0.770 | 0.750 | 0.760 |
| llama-3.3-70b-versatile | Self-consistency | 0.818 | 0.805 | 0.681 | 0.738 |
| llama-3.1-8b-instant | Zero-shot | 0.806 | 0.733 | 0.761 | 0.747 |
| llama-3.1-8b-instant | Self-consistency | 0.806 | 0.736 | 0.755 | 0.745 |
| llama-3.3-70b-versatile | CoT | 0.802 | 0.795 | 0.638 | 0.708 |
| llama-3.1-8b-instant | CoT | 0.800 | 0.806 | 0.617 | 0.699 |
| llama-3.3-70b-versatile | Five-shot | 0.788 | 0.781 | 0.606 | 0.683 |
| llama-3.3-70b-versatile | ToT | 0.778 | 0.781 | 0.569 | 0.658 |
| llama-3.1-8b-instant | ToT | 0.652 | 0.532 | 0.612 | 0.569 |
| llama-3.1-8b-instant | Five-shot | 0.632 | 0.506 | 0.846 | 0.633 |
Best single result: llama-3.3-70b-versatile with zero-shot prompting → 82.2% accuracy, F1 = 0.760
5 prompting strategies × 2 LLMs = 10 experimental conditions
500 test passengers →
- 8 standard conditions → 500 × 8 = 4000 API calls
- Self-consistency (3 runs × 2 models) → 500 × 3 × 2 = 3000 API calls
Total 7,000 real API calls
| Strategy | Description | Key Design Choice |
|---|---|---|
| Zero-shot | Task description + historical context only | Tests raw LLM knowledge |
| Five-shot | 5 demographically diverse labeled examples | Covers all key survival groups |
| Chain-of-Thought | 5 examples + step-by-step reasoning traces | Forces factor-by-factor analysis |
| Self-Consistency | 3 independent runs from different analytical perspectives + majority vote | Reduces variance in predictions |
| Tree-of-Thought | Explicit hypothesis exploration (survived vs. died) with confidence rating | Prevents premature commitment |
- No data leakage: test set strictly separated from few-shot pool before any prompt construction
- Robust label parsing: 5-tier fallback parser handles JSON, keyword, and digit-based model outputs
- Checkpoint-based inference: results saved after every condition so long runs are resumable
- Rate-limit-aware delays: exponential backoff with per-strategy tuned delays
- Self-consistency diversity: 5 analytical personas (naval historian, sociologist, statistician, maritime expert, Edwardian historian) to maximise vote diversity
I identified 15 passengers misclassified across 5+ conditions and analyzed the root causes. Three systematic failure modes emerged:
-
Male survival blind spot — When a passenger is male, 3rd class, and paid a low fare, both models predict death with near certainty regardless of prompting strategy. Male 3rd-class survivors like Albimona, Andersson, and Niskanen were misclassified across all 10 conditions.
-
Female fatality failure — The strong female-survival prior makes it nearly impossible for the models to predict death for women. The Allison case (a 2-year-old 1st-class girl who died) was misclassified by every strategy on both models.
-
Missing feature collapse — When age is absent, the models lose one of their few potential override signals. Combined with male sex and 3rd-class travel, this made correct prediction essentially unreachable.
Key insight: Neither chain-of-thought nor tree-of-thought prompting was sufficient to override strong statistical priors. This reveals a fundamental ceiling on prompting-only approaches for tabular prediction — fine-tuning is likely required to break through it.
.
├── notebooks/
│ └── Tabular-prediction.ipynb # Complete experiment pipeline
├── plots/
│ ├── metrics_comparison.png # All 4 metrics across 10 conditions
│ ├── model_comparison.png # F1 comparison between models
│ ├── confusion_matrices.png # Confusion matrices for all conditions
│ └── hard_cases_analysis.png # Error distribution + top misclassified
├── report/
│ ├── Tabular_Prediction.docx # Full written report
│ └── Instruction(1).pdf # Original assignment specification
├── results/
│ ├── titanic_predictions_real.csv # Raw predictions for all 10 conditions
│ ├── titanic_metrics.csv # Computed metrics table
│ └── hard_cases.csv # Misclassified passenger analysis
├── datasets/
│ └── Titanic-Dataset.csv # Source dataset (891 passengers)
├── .env.example # API key template
├── .gitignore
└── requirements.txt
Due to the stochastic nature of LLM outputs and API-based inference, results may vary slightly across runs. However, overall trends and relative performance across prompting strategies remain consistent.
- Python 3.10+
- A free Groq API key (used for llama-3.1-8b-instant and llama-3.3-70b-versatile)
# 1. Clone the repository
git clone https://github.com/Umair-IITD/Tabular_Prediction_Using_LLM.git
cd Tabular_Prediction_Using_LLM
# 2. Install dependencies
pip install -r requirements.txt
# 3. Set up your API key
cp .env.example .env
# Edit .env and add your Groq API key:
# GROQ_API_KEY=your_key_here# Open the notebook
jupyter notebook notebooks/Tabular-prediction.ipynb
# Run all cells in order (Step 1 through Step 11)
# The experiment auto-saves checkpoints after each condition
# so it is safe to interrupt and resumeTime estimate: The full run takes approximately 2–4 hours depending on API latency. Self-consistency is the most expensive strategy (3 runs per passenger per model).
If you only want to reproduce the metrics and charts without re-running inference:
# In any Python environment with the results/ folder present:
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score
results = pd.read_csv('results/titanic_predictions_real.csv')
pred_cols = [c for c in results.columns if c not in ['PassengerId', 'true_label']]
for col in pred_cols:
acc = accuracy_score(results['true_label'], results[col])
f1 = f1_score(results['true_label'], results[col], zero_division=0)
print(f'{col:<50} Acc: {acc:.3f} F1: {f1:.3f}')| Component | Technology |
|---|---|
| LLM Inference | Groq API |
| Models | llama-3.1-8b-instant, llama-3.3-70b-versatile |
| Data Processing | pandas, numpy |
| Evaluation Metrics | scikit-learn |
| Visualisation | matplotlib, seaborn |
| Progress Tracking | tqdm |
| Environment Management | python-dotenv |
| Notebook | Jupyter |
groq
pandas
numpy
scikit-learn
matplotlib
seaborn
tqdm
python-dotenv
jupyter
Install all with:
pip install -r requirements.txt-
Larger models benefit more from complex prompting. llama-3.3-70b showed meaningful gains from self-consistency and CoT. llama-3.1-8b saw degraded performance from five-shot and ToT, suggesting a capacity threshold for complex prompt structures.
-
Self-consistency is the most reliable advanced strategy. At the cost of 3x the API calls, it consistently outperformed zero-shot for the 70B model without introducing the instability seen in five-shot or ToT.
-
Zero-shot is surprisingly competitive. For well-represented domains (like the Titanic, which LLMs have seen extensively in pre-training), zero-shot with contextual hints nearly matches complex multi-step strategies.
-
Prompting has a hard ceiling on tabular tasks. The hard case analysis shows that LLM statistical priors — particularly sex and class — cannot be overridden by prompting alone when features strongly align with dominant training patterns. Fine-tuning or retrieval-augmented approaches would likely be needed to push accuracy significantly higher.
Md Umair Alam
B.Tech, IIT Delhi
- GitHub: https://github.com/Umair-IITD
- LinkedIn: https://www.linkedin.com/in/umair-alam-iitd/
MIT License — see LICENSE for details.
- Dataset: Titanic - Machine Learning from Disaster (Kaggle)
- Inference: Groq for ultra-fast LLaMA inference
- Models: Meta LLaMA 3.1 and 3.3 families



