Natural Language Processing, 2024/25
As part of the Natural Language Processing course at Politecnico di Milano, our team—Red Chilli Models—embarked on a challenging and rewarding mission: to build a full-scale, end-to-end Retrieval-Augmented Generation (RAG) pipeline. Our goal was to go beyond a simple proof-of-concept. We conducted a deep dive into the RAG-Instruct dataset (~40,000 QA pairs), performing in-depth data analysis, benchmarking multiple LLMs, and fine-tuning a FLAN-T5 model. As the project owner and a key contributor, I guided our team's strategy to explore advanced techniques, from two-stage retrieval systems to experimental multi-step generation with voting. This document details the architecture, methods, and findings of our comprehensive exploration into the world of modern, context-aware AI.
This project was a true collaborative effort, and its success is a testament to the talent and dedication of the entire team.
- Name: RAG-Instruct
- Source: Hugging Face
- Size: ~40,000 QA pairs
- Features: Each instance includes a question, a generated answer, and retrieved documents as supporting context.
This phase focused on thoroughly understanding the dataset's structure, content, and inherent characteristics using various NLP techniques.
-
Briefly describe the data:
- Structure & Task: The dataset is structured into
question,answer, anddocumentsfields. It was collected for Retrieval-Augmented Generation (RAG) tasks, where a Language Model (LLM) generates answers grounded in provided context. - Document Types & Counts: Contains ~40,541 QA pairs. The
documentsfield comprises lists of strings (individual passages) which were combined for analysis. - Length Distribution (on a 10% sample): Questions (mean ~23 tokens), answers (mean ~105 tokens), and combined context documents (mean ~1166 tokens) exhibit distinct length distributions. The long context length highlights the need for context management strategies in LLM integration.
- Vocabulary (on a 10% sample): The total collection vocabulary (unique tokens across all combined context documents in the sample) is large (~184,890 unique tokens). The average vocabulary size per combined document entry is ~491 unique tokens, indicating focused context blocks.
- Structure & Task: The dataset is structured into
-
Play around with documents using code from the early parts of the course:
- Cluster the documents & visualize:
- Combined document text from a 10% sample was vectorized using
TfidfVectorizer(withmax_df=0.8,min_df=5,stop_words='english'). - K-Means clustering (K=25, guided by Elbow method analysis) was applied to identify thematic groupings.
- Interpretation: The clustering successfully identified diverse and interpretable themes such as Music, Linguistics, Chemistry, Computer Software, Web Technologies, Education/Academia, Programming, Sports, Politics/Warfare, Literature, Biology/Medicine, Data Management, History, Games, Religion, Industry/Manufacturing, Film/Television, Law/Legal System, and Demographics.
- Combined document text from a 10% sample was vectorized using
- Index the documents for keyword search:
- A PyTerrier index was created over the individual passages from the
documentscolumn. - Demonstrated keyword search using BM25 retrieval, including dynamic spelling correction and phrase search capabilities.
- A PyTerrier index was created over the individual passages from the
- Train a Word2Vec embedding:
- A Word2Vec model (
gensim.models.Word2Vec) was trained on the tokenized text from the dataset. - Investigated embedding properties through
most_similarterms and vector arithmetic (e.g., 'king' - 'man' + 'woman' ≈ 'queen'). - A BM25 score visualization (Bar Chart) of sampled word embeddings was performed to illustrate semantic relationships.
- A Word2Vec model (
- Cluster the documents & visualize:
This phase involved building and evaluating models for the RAG task, including fine-tuning pre-trained models and assessing LLM performance in various settings.
-
Implement and evaluate a RAG pipeline:
- Retriever System:
- Utilized a two-stage retrieval approach: a
SentenceTransformer(bi-encoder) for initial dense retrieval, and aCrossEncoderfor re-ranking. - An
hnswlibindex was used for efficient Approximate Nearest Neighbor search over corpus embeddings. - A function
get_related_docsretrieves and formats the top relevant documents into a context string for the LLM.
- Utilized a two-stage retrieval approach: a
- LLM Integration & Generation:
- The "AITeamVN/Vi-Qwen2-3B-RAG" (a 3 Billion parameter model) was selected as the generator, loaded with 4-bit quantization (
bitsandbytes) to optimize resource usage. - LLM Characterization: Its configuration (Qwen2 architecture, large context window) and Hugging Face model card were analyzed for RAG-specific tuning and capabilities.
- Baseline Evaluation: Initial qualitative evaluations of the end-to-end RAG pipeline using sample queries revealed LLM performance was highly dependent on retriever quality. When retrieved context was poor, the LLM frequently defaulted to its pre-trained knowledge, leading to answers that were not grounded in the provided context. With good context, answers were relevant and grounded.
- Prompt Engineering & Generation Parameter Tuning: Systematically experimented with various prompt structures (e.g., strict "answer-only-from-context" instructions, role-playing) and LLM generation parameters (
temperature,top_p,do_sample,max_new_tokens). - Core RAG Generation Logic: A programmatic RAG flow was implemented to control answer sourcing. This flow attempts to answer strictly from the provided context. If the LLM's response indicates (via specific cues or brevity) that the context is insufficient, the system falls back to generating an answer from the LLM's general knowledge.
- Context Length Handling: Strategies for managing very long retrieved contexts (e.g., truncation) were investigated to ensure they fit within the LLM's effective processing window.
- The "AITeamVN/Vi-Qwen2-3B-RAG" (a 3 Billion parameter model) was selected as the generator, loaded with 4-bit quantization (
- Retriever System:
-
Evaluate LLMs: zero-shot, few-shot, and one-shot performance:
- Assessed the LLM's inherent capabilities to answer questions without, or with very limited, direct fine-tuning.
- The "AITeamVN/Vi-Qwen2-3B-RAG" model and other general-purpose LLMs (e.g., FLAN-T5-small, FLAN-T5-base, Falcon-RW-1B, Mistral-7B-Instruct-v0.1) were evaluated in zero-shot, one-shot, and few-shot settings.
- Qualitative and quantitative evaluations (ROUGE, BLEU, BERTScore) were performed to compare performance and assess in-context learning.
-
Fine-tune a pretrained model:
- A FLAN-T5 model (e.g.,
google/flan-t5-small) was selected and fine-tuned on the RAG-Instruct dataset. - The model was fine-tuned using (instruction, input/context, output/answer) pairs, adapting it to the specific QA task.
- Standard fine-tuning techniques were applied using the Hugging Face
transformersTrainer API. - The fine-tuned FLAN-T5 model's performance was evaluated against its pre-trained version and compared to the "AITeamVN/Vi-Qwen2-3B-RAG" model in a RAG setup.
- A FLAN-T5 model (e.g.,
This section outlines additional investigations performed or planned beyond the core project requirements.
- Programmatic Multi-Step RAG with Stage 1 Voting:
- As an advanced refinement to the core RAG generation logic, an experimental Stage 1 ensemble/voting mechanism was explored.
- This approach runs multiple conservative generation attempts (e.g., Greedy, VeryLowTemp, LowTemp) with the strict context prompt. A voting threshold (e.g., requiring 2 out of 3 votes) then determines if the context is deemed sufficient for a grounded answer.
- Findings: This voting strategy showed improved robustness in identifying poor context and correctly falling back to general knowledge for many queries, but highlighted persistent challenges with extremely irrelevant context leading to confident, ungrounded hallucinations.
- Voice Interactive Chatbot: Making use of text-to-speech (TTS) and speech-to-text (STT) models to create a voice-interactive chatbot interface for the RAG system. This involves integrating audio input/output with the existing text-based QA pipeline, allowing users to ask questions verbally and receive spoken answers.
- Python
- Pandas, NumPy
- NLTK, Scikit-learn (for TF-IDF, KMeans, SVD)
- Gensim (for Word2Vec)
- Hugging Face
transformers(for LLMs, tokenizers, FLAN-T5 models, SentenceTransformers, CrossEncoders) bitsandbytes(for LLM quantization)hnswlib(for ANN indexing)pyterrier(for indexing and BM25 keyword search exploration)evaluate,rouge-score,bert-score,bleu(for metric calculation)TTS,soundfile,espeak-ng(for Text-to-Speech)- Matplotlib, Seaborn, Plotly (for visualizations)
- Google Colab / Kaggle Notebooks (for development and GPU resources)
This was assigned as part of the course Natural Language Processing (2024/25) at Politecnico di Milano. We extend our sincere gratitude to:
- Professor Mark Carman, for providing excellent guidance throughout the course.
- Teaching Assistant Nicolò Brunello, for their support and valuable feedback throughout the course.