Nagasantosh nagasantoshchavvakula

👋 Hello, I'm Nagasantosh!

🔍 Passionate about Data Science, Machine Learning, Deep Learning, Generative AI, Computer Vision, NLP, MLOps, Data Engineering, and Full-Stack Development
🚀 Currently building end-to-end AI, Machine Learning, Deep Learning, Generative AI (RAG), Data Engineering, and Analytics solutions using modern tools, frameworks, and cloud-ready architectures
🌱 Continuously expanding expertise in Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Transformer-based NLP, Transfer Learning, MLOps, Cloud AI Services, and Scalable Data Pipelines
🤝 Open to collaborating on Machine Learning, Deep Learning, Generative AI, Computer Vision, NLP, Data Engineering, ETL Pipelines, Analytics Platforms, and Real-World AI Applications
💡 Experienced in developing projects involving Predictive Analytics, Recommendation Systems, Fraud Detection, Medical Image Classification, Document Intelligence, Sentiment Analysis, Workflow Automation, and Interactive Dashboards
📊 Skilled in Python, SQL, Scikit-learn, TensorFlow, PyTorch, Hugging Face, LangChain, Streamlit, Power BI, Tableau, MLflow, DVC, Docker, Git, GitHub, and Cloud-Based Data & AI Technologies
🏗️ Strong focus on building scalable, production-ready applications that combine data engineering, machine learning, deep learning, MLOps, and software engineering best practices
🎯 Aspiring to contribute to innovative teams solving complex business problems through AI-driven, data-centric solutions

🌟 Academic Background

🎓 Master of Science in Computer Science
University of Central Missouri, USA | Jan 2023 - May 2024
🎓 Bachelor of Technology in Electrical and Electronics Engineering
JB Institute of Engineering and Technology, India | Aug 2015 - July 2019

🏆 Certifications

Certified In

☁️ AWS Machine Learning Engineering
🚀 AWS Cloud Practitioner

🤝 Let’s Connect

🧰 Core Tech Stack

Tools & Technologies

🧠 Key Areas of Expertise

Generative AI & Large Language Models (LLMs) (Retrieval-Augmented Generation (RAG), Prompt Engineering, LangChain, Vector Databases, Semantic Search, Document Intelligence, AI-Powered Knowledge Retrieval)
Data Analytics & Business Intelligence (Exploratory Data Analysis, Statistical Analysis, KPI Reporting, Dashboard Development, Business Insights)
Machine Learning & Predictive Modeling (Regression, Classification, Clustering, Recommendation Systems, Fraud Detection, Churn Prediction)
Deep Learning & Computer Vision (CNNs, DNNs, Transfer Learning, Medical Image Classification, Ensemble Learning, Explainable AI)
Natural Language Processing (NLP) (Sentiment Analysis, Transformer Models, Hugging Face, Text Preprocessing, Information Retrieval, Context-Aware AI Systems)
Data Engineering & ETL Pipelines (Data Ingestion, Transformation, Workflow Automation, Batch Processing, Data Pipelines, Workflow Orchestration)
MLOps & ML Lifecycle Management (DVC, MLflow, CI/CD Pipelines, Automated Testing, Experiment Tracking, Model Versioning, Reproducibility)
Model Optimization & Evaluation (Hyperparameter Tuning, Cross Validation, ROC-AUC, Confusion Matrix, Performance Analysis)
Cloud & Scalable AI Systems (AWS, Azure, Docker, Streamlit, Cloud-Based ML Workflows, Deployment-Ready AI Applications)
Data Visualization & Interactive Dashboards (Power BI, Tableau, Plotly, Streamlit, Matplotlib, Seaborn)
Software Engineering & Application Development (Python, Java, Spring Boot, React, REST APIs, Git, GitHub Actions, Modular Architecture, Workflow Automation)

📊 Featured Projects As Data Science Intern

🤖 Document QA ChatBot — Retrieval-Augmented Generation (RAG) Application

https://github.com/nagasantoshchavvakula/Document-QA-ChatBot.git

A comprehensive Retrieval-Augmented Generation (RAG) based Document Question Answering application designed to enable users to interact with PDF documents using natural language. The system combines Large Language Models (LLMs), vector databases, semantic search, document processing pipelines, and Generative AI techniques to deliver accurate, context-aware responses directly from uploaded documents.

This project simulates a real-world Enterprise Knowledge Assistant capable of extracting, indexing, retrieving, and generating insights from unstructured document data. By integrating modern AI frameworks and vector search technologies, the application demonstrates how organizations can build intelligent document understanding systems for research, compliance, legal analysis, customer support, and enterprise knowledge management use cases.

The project follows a complete end-to-end Generative AI and Retrieval-Augmented Generation lifecycle, integrating document ingestion, text extraction, text chunking, embedding generation, vector indexing, semantic retrieval, prompt engineering, LLM-powered response generation, environment management, debugging, and deployment-ready application development.

Status: ⚙️ In Progress
Core Stack: Python, Streamlit, LangChain, Groq Llama3, HuggingFace Embeddings, FAISS, PyPDFLoader, Sentence Transformers, python-dotenv, Git, GitHub

Generative AI, NLP & Document Intelligence Focus

Built an end-to-end Document QA ChatBot using Retrieval-Augmented Generation (RAG) architecture
Implemented PDF ingestion, text extraction, chunking, and semantic retrieval using PyPDFLoader, LangChain, and FAISS
Generated vector embeddings using HuggingFace BAAI/bge-small-en-v1.5 for efficient similarity search
Developed an interactive Streamlit interface for document-based question answering
Engineered prompt templates and retrieval workflows to deliver accurate, context-aware responses
Applied secure configuration management using python-dotenv and environment variables

LLM Integration & AI Application Development

Integrated Groq-hosted Llama3-8B-8192 for document-grounded answer generation
Built semantic search pipelines using LangChain Retrieval Chains, vector search, and prompt engineering
Optimized document retrieval through chunking strategies and embedding-based nearest-neighbor search
Developed scalable PDF processing and AI-powered knowledge retrieval workflows
Followed software engineering best practices including modular architecture, dependency management, version control, and reproducible development workflows

Problem Solving & Engineering Highlights

Resolved LangChain version compatibility, Python 3.13 dependency conflicts, and Streamlit configuration issues
Migrated the project to Python 3.11 for stable package compatibility
Implemented secure API key management by moving credentials from source code to environment variables
Improved maintainability through structured project organization and dependency pinning

Skills Demonstrated

Generative AI: RAG, Prompt Engineering, LLM Integration
NLP: Semantic Search, Information Retrieval, Document Understanding
LLMs: Groq Llama3, Context-Aware Response Generation
Vector Databases: FAISS, Embedding-Based Search
AI Frameworks: LangChain, HuggingFace Embeddings
Document Processing: PDF Parsing, Text Chunking
Development: Python, Streamlit, API Integration
DevOps & Collaboration: Virtual Environments, Git, GitHub, Dependency Management

Goal: Deliver a production-ready Document Intelligence solution that combines Retrieval-Augmented Generation, semantic search, vector databases, and Large Language Models to demonstrate real-world AI engineering, knowledge retrieval, and enterprise-scale document understanding capabilities.

🩺 Image Classification for Medical Diagnosis — Deep Learning Pipeline Project

https://github.com/nagasantoshchavvakula/Image-Classification-For-Medical-Diagnosis.git

A comprehensive deep learning-based medical image classification pipeline designed to detect Pneumonia from Chest X-ray images using advanced neural networks, transfer learning architectures, explainable AI, and ensemble learning techniques. This project simulates a real-world healthcare diagnostic system for assisting medical professionals in making data-driven clinical predictions.

The project follows a complete end-to-end deep learning lifecycle, integrating image preprocessing, neural network training, transfer learning, optimization strategies, explainable AI visualizations, and ensemble-based prediction systems for improved classification performance and model interpretability.

Status: ⚙️ In Progress
Core Stack: Python, TensorFlow, Keras, NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn, DVC, PyTest, GitHub Actions (CI/CD), Jupyter Notebook, Git

Deep Learning & Data Pipeline Focus

Implemented structured image data loading pipelines using TensorFlow and Keras ImageDataGenerator
Performed image preprocessing, normalization, augmentation, and dataset balancing using class weights
Built scalable and reusable deep learning workflows for training, evaluation, and inference
Applied data augmentation techniques including rotation, zoom, flipping, width/height shifting for robust generalization
Managed dataset versioning and reproducibility using DVC (Data Version Control)
Conducted comprehensive experimentation and notebook-driven analysis for model comparison and optimization
Maintained modular project architecture with separate components for data loading, model building, training, and evaluation

Model Development, Evaluation & MLOps

Developed and optimized Baseline MLP, Deep Neural Networks, and Transfer Learning models using TensorFlow and Keras for medical image classification
Implemented pretrained architectures including MobileNetV2, VGG16, and ResNet50 with fine-tuning for enhanced model performance
Applied advanced training strategies using EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard, and custom learning rate scheduling
Performed experimentation with activation functions, optimizers, regularization techniques, and hyperparameter optimization
Built ensemble learning models using soft-voting techniques to improve prediction accuracy and robustness
Designed end-to-end evaluation pipelines using Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix, and ROC Curves
Implemented Grad-CAM based Explainable AI (XAI) techniques for model interpretability and medical image visualization
Integrated PyTest, DVC, Git, GitHub, modular pipelines, and MLOps best practices for scalable and reproducible deep learning workflows

Skills Demonstrated

Deep Learning: Neural Networks, Transfer Learning, Ensemble Learning
Computer Vision: Medical Image Classification, Chest X-ray Analysis ?* Explainable AI: Grad-CAM Visualization
Machine Learning: Model Evaluation, Hyperparameter Optimization
Data Processing: Image Augmentation, Preprocessing, Dataset Balancing
Programming: Python, TensorFlow, Keras, NumPy, Pandas
Visualization: Matplotlib, Seaborn
MLOps: DVC, Automated Testing (PyTest), Workflow Management
Version Control: Git, GitHub
Analytical Thinking: Model Optimization, Diagnostic Performance Analysis

Goal: Deliver a production-ready medical image classification engine that combines deep learning, transfer learning, explainable AI, and ensemble learning techniques to demonstrate scalable healthcare AI workflows, strong analytical capabilities, and deployment-ready deep learning solutions.

🏡 Real Estate Price Prediction Engine — Machine Learning Pipeline Project

https://github.com/nagasantoshchavvakula/Real-Estate-Price-Prediction.git

A comprehensive regression-based machine learning pipeline designed to predict real estate property prices using advanced statistical modeling, unsupervised learning, and recommendation systems. This project simulates a real-world pricing platform for buyers, sellers, and agents to make data-driven property valuation decisions.

The project follows a complete end-to-end ML lifecycle, integrating regression modeling, clustering for market segmentation, recommendation systems, and ensemble learning for improved predictive performance.

Status: ✅ Completed
Core Stack: Python, Pandas, NumPy, Scikit-learn, XGBoost, LightGBM, Matplotlib, Seaborn, PCA (Dimensionality Reduction), PyTest, GitHub Actions (CI/CD), Streamlit, Joblib, Jupyter Notebook, Git

MLOps & Data Focus

Implemented structured data pipelines for preprocessing, feature engineering, and transformation
Performed comprehensive exploratory data analysis (EDA) to uncover pricing trends and feature relationships
Handled missing values, categorical encoding, and feature scaling for robust model performance
Applied Principal Component Analysis (PCA) for dimensionality reduction and multicollinearity handling
Built modular and reusable ML components for regression, clustering, and recommendation systems
Ensured reproducibility through organized workflows, version control, and notebook-based experimentation

Experimentation & Deployment

Trained and evaluated multiple regression models including Linear, Ridge, Lasso, ElasticNet, Random Forest, and Gradient Boosting
Performed hyperparameter tuning using GridSearchCV and RandomizedSearchCV
Designed evaluation pipelines using RMSE, MAE, and R² for regression performance comparison
Implemented clustering workflows (K-Means, Hierarchical, DBSCAN) with validation using Silhouette Score
Built recommendation systems (content-based, collaborative, hybrid) for personalized property suggestions
Developed ensemble models (Voting & Stacking) to improve predictive accuracy
Created an interactive Streamlit dashboard for real-time predictions and insights
Integrated automated testing using PyTest and CI/CD pipelines via GitHub Actions

Skills Demonstrated

Machine Learning: Regression Modeling, Clustering, Recommendation Systems, Ensemble Learning
Data Science: EDA, Feature Engineering, Dimensionality Reduction (PCA)
Programming: Python (Pandas, NumPy, Scikit-learn)
Visualization: Matplotlib, Seaborn
MLOps: CI/CD Pipelines, Automated Testing (PyTest), Workflow Management
Deployment: Streamlit Dashboard, Model Serialization (Joblib)
Analytical Thinking: Model Evaluation, Performance Optimization, Business Insights

Goal: Deliver a production-ready real estate pricing engine that combines regression modeling, market segmentation, and recommendation systems demonstrating scalable machine learning workflows, strong analytical capabilities, and deployment-ready solutions.

📉 Telco Customer Churn Prediction — Machine Learning Pipeline Project

https://github.com/nagasantoshchavvakula/Customer-Churn-Prediction

An end-to-end machine learning pipeline project to predict customer churn for a telecommunications company using real-world business data. The project follows the complete ML lifecycle, from exploratory data analysis and preprocessing to model development, hyperparameter tuning, and deployment-ready model serialization.

This project emphasizes reproducibility, automated testing, and CI/CD workflows, demonstrating best practices for building production-ready ML systems.

Status: ✅ Completed
Core Stack: Python, Pandas, NumPy, Scikit-learn (Preprocessing, Model Development & Evaluation), XGBoost, Matplotlib, Seaborn, SMOTE (Imbalanced-learn), PyTest, GitHub Actions (CI/CD), Joblib (Model Serialization), Jupyter Notebook, Git

Machine Learning & Analytics Focus

Conducted Exploratory Data Analysis (EDA) to understand customer behavior and churn patterns
Computed statistical summaries including distributions, correlations, and hypothesis testing
Built preprocessing pipelines for missing value handling, categorical encoding, and feature scaling
Implemented stratified train-test splitting to preserve class distribution
Developed multiple ML models: Logistic Regression, KNN, Decision Tree, Random Forest, XGBoost
Addressed class imbalance using SMOTE techniques
Evaluated models using Accuracy, Precision, Recall, F1-score, ROC-AUC, confusion matrices, and ROC curves
Performed hyperparameter tuning with GridSearchCV and RandomizedSearchCV
Selected and serialized the best-performing model and scaler for production use

Systems & MLOps Direction

Automated testing using PyTest
CI/CD pipelines implemented with GitHub Actions for reproducibility and deployment readiness
Data visualization for model insights using Matplotlib and Seaborn
Workflow version control and experiment tracking using Git and Jupyter Notebook

Skills Demonstrated

Python programming for ML pipelines
Data preprocessing & feature engineering
Model evaluation & hyperparameter optimization
Ensemble learning with XGBoost
Imbalanced data handling and statistical analysis
CI/CD for ML workflows
Model serialization and deployment readiness

Goal: Build a production-ready ML pipeline that predicts customer churn, provides actionable business insights, and demonstrates reproducible, automated, and deployment-ready ML workflows.

🤖 ML Lifecycle & MLOps Sentiment Analysis System — End-to-End NLP Pipeline

https://github.com/nagasantoshchavvakula/Sentiment-Analysis-MLOps

An end-to-end Machine Learning Lifecycle (MLOps) project for sentiment analysis using HuggingFace transformer models. The project demonstrates industry-standard practices including data versioning, experiment tracking, automated CI/CD pipelines, and cloud-ready deployment. Transfer learning is applied to fine-tune a pre-trained NLP model for binary sentiment classification, ensuring scalable, reproducible, and production-ready ML workflows.

Status: ⚙️ In Progress
Core Stack: Python, HuggingFace Transformers (DistilBERT), PyTorch, Scikit-learn, DVC, MLflow, Flask, Docker, GitHub Actions (CI/CD)

MLOps & Data Focus

Implemented data lifecycle management using DVC for dataset and model artifact versioning
Built preprocessing pipelines for text cleaning and tokenization
Performed exploratory data analysis (EDA) on text data
Fine-tuned DistilBERT for sentiment classification
Developed reproducible ML pipelines with parameter tracking and dependency management

Experimentation & Deployment

Integrated MLflow for experiment tracking, hyperparameter logging, and model versioning
Designed evaluation workflows including accuracy, precision, recall, F1-score, and confusion matrix
Built RESTful Flask API for real-time and batch sentiment predictions
Created automated CI/CD pipelines using GitHub Actions for testing, training, and deployment
Containerized the application using Docker for consistent deployments
Implemented monitoring-ready endpoints for production health checks and model performance tracking

Skills Demonstrated

Machine Learning & NLP: Transfer Learning, Sentiment Analysis, Transformer Models
MLOps Tools: DVC, MLflow, CI/CD pipelines
Programming: Python
Deployment & DevOps: Flask API, Docker, GitHub Actions
Data Engineering: Versioning, Pipeline Orchestration
Cloud & Production Concepts: Model Serving, Reproducibility

Goal: Deliver an industry-standard, production-ready MLOps pipeline for NLP sentiment analysis that demonstrates best practices in experiment tracking, deployment, and reproducibility.

🤖 Introduction to Machine Learning — Core ML Implementation Project

https://github.com/nagasantoshchavvakula/Intro_ML_Starter_Code_Implementation

A foundational machine learning implementation project designed to demonstrate the core components of a typical ML workflow, including data preprocessing, dataset splitting, model training, prediction, and evaluation.

The project focuses on building reusable Python functions that implement key machine learning operations using NumPy and Scikit-learn, providing a hands-on understanding of how regression and classification models are trained and evaluated in real-world data science pipelines.

This project emphasizes clean data preparation, modular ML pipeline design, and reliable model evaluation, reflecting practical machine learning development workflows used in analytics and AI systems.

Status: ✅ Completed
Core Stack: Python, NumPy, scikit-learn, pytest, Git, GitHub

Machine Learning Pipeline Focus

Implemented feature normalization using min–max scaling to standardize dataset inputs
Designed flexible missing value imputation strategies including mean, median, and zero replacement
Built dataset train-test splitting functions to support proper model validation
Trained Linear Regression models for continuous value prediction tasks
Developed Logistic Regression classifiers for binary classification problems
Created reusable prediction functions to generate outputs on unseen data

Model Evaluation & Validation

Implemented regression evaluation using Mean Squared Error (MSE)
Calculated classification performance using Accuracy metrics
Validated ML pipeline functionality using automated unit testing with pytest
Ensured reproducibility and correctness across preprocessing, training, and prediction stages

Engineering & Development Practices

Developed modular Python functions for reusable ML workflows
Applied automated testing to validate model pipelines
Used Git and GitHub for version control and project management
Maintained clean documentation with structured function docstrings and examples

Skills Demonstrated

Python programming for machine learning
Data preprocessing and feature scaling
Regression and classification model development
Model performance evaluation and testing
Building reliable and modular ML pipelines

Goal: Demonstrate a practical understanding of machine learning fundamentals by implementing a complete ML workflow from data preprocessing to model evaluation using industry-standard Python libraries.

🛒 Ecommerce Fraud Detection — End-to-End Pipeline

https://github.com/nagasantoshchavvakula/Ecommerce-Fraud-Detection-End-to-End-Data-Pipeline

A production-style data engineering project that implements an automated pipeline for detecting fraudulent e-commerce transactions. The system processes raw transaction data, performs cleaning and feature engineering, and loads analytics-ready datasets into a MySQL analytics layer to support fraud monitoring and BI dashboards.

This project simulates a real-world enterprise data pipeline, incorporating ETL orchestration, modular workflow design, and monitoring to ensure reliable processing of transactional data used for fraud analysis.

Status: ✅ Completed
Core Stack: Python (Pandas, SQLAlchemy), Prefect (Workflow Orchestration), MySQL (Staging & Analytics), SQL (Analytical Queries), Data Engineering (ETL Pipelines, Feature Engineering, Data Modeling)

Data Engineering Focus

Designed an automated ETL pipeline to ingest, transform, and load e-commerce transaction data
Processed raw CSV datasets and stored them in MySQL staging tables for controlled transformation
Built modular ETL tasks (Extract → Transform → Load) orchestrated using Prefect workflows
Developed analytics-ready datasets optimized for fraud detection queries and BI dashboards
Implemented data validation, schema standardization, and feature engineering during transformation

Fraud Analytics & Feature Engineering

Engineered fraud detection features such as promo misuse, device-location mismatch, and transaction anomalies
Generated key fraud metrics including fraud rate, suspicious user patterns, and high-risk country indicators
Designed SQL analytical queries to detect abnormal transaction behaviors and high-risk merchant categories
Created aggregated KPIs enabling drill-down analysis at the transaction and user levels

Workflow Automation & Monitoring

Automated pipeline orchestration using Prefect with scheduling, retries, and workflow logging
Implemented modular task-based ETL architecture for scalable data processing
Enabled real-time monitoring and failure recovery using Prefect UI
Designed pipelines to support reliable batch processing for large transaction datasets

Skills Demonstrated

Python for data engineering (Pandas, SQLAlchemy)
ETL pipeline development and workflow orchestration
Prefect for scheduling, monitoring, and pipeline automation
MySQL database design (staging and analytics schemas)
SQL analytics and fraud detection metrics
Designing BI-ready datasets for reporting and dashboards

Goal: Build a scalable data engineering pipeline for fraud detection analytics, demonstrating how automated ETL workflows, feature engineering, and SQL analytics can transform raw transactional data into actionable fraud insights for business intelligence systems.

📱 YouTube–TikTok Short Form Video Analytics — End-to-End Data Analytics Pipeline

https://github.com/nagasantoshchavvakula/YouTube-TikTok-Short-Form-Video-Analytics

An end-to-end data analytics pipeline designed to analyze and visualize engagement trends from short-form video platforms such as YouTube Shorts and TikTok. The project demonstrates the complete analytics lifecycle — from raw data ingestion to interactive dashboard visualization — using MySQL, Python, DVC, and Streamlit.

The system focuses on transforming raw social media datasets into actionable engagement insights, enabling exploration of metrics such as views, likes, shares, comments, and trending patterns through interactive dashboards.

Status: ✅ Completed
Core Stack: Python, Pandas, NumPy, MySQL, Streamlit, Plotly, Matplotlib, DVC, Git, GitHub, Virtualenv, PowerBI

Data Engineering & Analytics Focus

Designed an automated pipeline to ingest raw CSV datasets from Kaggle into a structured MySQL analytics database
Performed data cleaning, transformation, and feature engineering using Python (Pandas)
Built modular Python scripts for ingestion, processing, and analysis workflows
Conducted exploratory data analysis (EDA) to identify engagement patterns across short-form content
Generated analytics-ready datasets for dashboard visualization and reporting

Dashboard & Visualization Layer

Developed an interactive Streamlit dashboard to visualize engagement trends and performance metrics
Built a Power BI dashboard connected to MySQL, enabling deeper business insights through KPI tracking and visual analytics
Designed visualizations including trend lines, engagement comparisons, and category-based insights

Data Management & Reproducibility

Implemented Data Version Control (DVC) to track dataset versions and ensure reproducible data workflows
Organized project structure with modular scripts and reproducible pipelines
Managed source code, collaboration, and versioning using Git and GitHub

Engineering Practices

Modular pipeline architecture separating ingestion, processing, and analysis layers
Version-controlled datasets and scripts for reproducible analytics workflows
Designed the project to scale for larger social media datasets and additional analytics dashboards

Goal: Demonstrate an end-to-end data analytics workflow that integrates data engineering, analysis, and dashboard visualization, transforming raw social media datasets into actionable engagement insights.

🚗 Sales Performance Optimization Pipeline — End-to-End ETL & BI Analytics System

https://github.com/nagasantoshchavvakula/bmw-car-sales_Performance-Optimization-Pipeline

An end-to-end data engineering and business intelligence project designed to automate the extraction, transformation, and loading (ETL) of vehicle sales data into a structured MySQL database while generating interactive analytics dashboards in Power BI.

The project focuses on identifying key regional and vehicle factors that drive high sales performance, enabling data-driven decision-making through automated data pipelines and real-time business intelligence reporting.

Status: ✅ Completed
Core Stack: Python (Pandas, SQLAlchemy), Prefect, MySQL, Power BI, Excel

Data Engineering & ETL Pipeline

Designed and implemented a complete ETL pipeline using Python, Prefect, and MySQL for automated data ingestion and transformation
Built modular Prefect workflows using @task and @flow decorators to manage pipeline dependencies and data processing steps
Performed comprehensive data auditing and schema design to ensure clean, consistent, and analytics-ready database structures
Automated loading of transformed datasets into a centralized MySQL data warehouse

Analytics & Business Intelligence

Developed an interactive Power BI dashboard connected to the MySQL database for real-time sales analytics
Created KPI visualizations to track sales trends by region, vehicle model, and sales classification
Enabled business stakeholders to quickly identify top-performing regions and vehicle categories

Data Analysis Focus

Identified key drivers behind “High” sales classifications through exploratory data analysis
Applied structured data profiling and preprocessing using Excel and Python
Delivered actionable insights supporting sales performance optimization and strategic decision-making

Skills Demonstrated

Data pipeline design and ETL automation
Workflow orchestration with Prefect
Relational database schema design and SQL integration
Business intelligence dashboard development
Data auditing and structured data transformation

Goal: Build a scalable data pipeline that automates sales data processing while enabling business stakeholders to analyze regional and product-level sales performance through interactive dashboards.

💰 Personal Finance Tracker & Investment Portfolio Analyzer — Financial Analytics System

https://github.com/nagasantoshchavvakula/Personal-Finance-Tracker-and-Investment-Portfolio-Analyzer

A Python-based personal finance analytics system designed to track expenses, manage budgets, and analyze investment portfolios through automated data workflows. The project demonstrates the integration of Python programming, financial modeling, and workflow orchestration to transform raw financial data into actionable insights.

This system focuses on financial data analysis, automation, and reproducible data pipelines, enabling users to monitor spending behavior, evaluate investment performance, and generate analytical reports for better financial decision-making.

Status: ✅ Completed
Core Stack: Python, Object-Oriented Programming (OOP), NumPy, pandas, Matplotlib, Prefect, DVC, Git

Financial Analytics & Data Processing Focus

Tracking financial transactions, expenses, and budget allocations using structured data models
Performing statistical and time-series analysis on spending patterns and investment performance
Processing financial datasets using Pandas and NumPy for numerical and analytical computations
Generating reports and visualizations to evaluate budget adherence and portfolio growth
Extracting insights on spending trends and investment returns through analytical workflows

System Architecture & Engineering Practices

Developed modular Python architecture using Object-Oriented Programming (OOP) principles
Designed reusable classes for transactions, accounts, and investment portfolios
Implemented data validation and structured financial data models
Built reproducible data pipelines using Data Version Control (DVC)
Automated scheduled financial analysis using Prefect workflow orchestration

Skills Demonstrated

Python programming and OOP system design
Financial data analysis and modeling
Pandas and NumPy for structured data processing
Workflow automation using Prefect
Reproducible data pipelines with DVC
Data visualization and reporting with Matplotlib
Git-based version control and project organization

Goal: Demonstrate the design of an automated financial analytics system that combines data engineering, statistical analysis, and workflow automation to support personal finance monitoring and investment decision-making.

📊 Exploratory Data Analysis & Visualization of Student Performance — Interactive Analytics Dashboard

https://github.com/nagasantoshchavvakula/EDA_Student_Performance

An end-to-end data analysis and visualization project focused on exploring the factors influencing student exam performance using a Kaggle dataset. The project demonstrates how Python-based data analytics workflows can transform raw educational datasets into meaningful insights through exploratory data analysis, statistical visualization, and interactive dashboards.

The project integrates data preprocessing, statistical analysis, and interactive visualization to enable dynamic exploration of student performance metrics, helping identify patterns across demographic, socioeconomic, and academic variables.

Status: ✅ Completed
Core Stack: Python, pandas, NumPy, matplotlib, seaborn, plotly, Streamlit, GitHub, Cloud Deployment

Data Analysis & EDA Focus

Cleaned and preprocessed the Kaggle Student Performance in Exams dataset by handling missing values, duplicates, and data inconsistencies
Conducted exploratory data analysis to uncover patterns and relationships between exam scores and demographic variables
Performed statistical analysis and correlation analysis across math, reading, and writing scores
Engineered meaningful features and calculated key performance indicators (KPIs) for student performance evaluation

Data Visualization & Insights

Built multiple visualization types including histograms, bar charts, scatter plots, box plots, and heatmaps
Used Seaborn, Matplotlib, and Plotly to visually communicate trends, correlations, and distribution patterns
Identified key performance drivers such as parental education level, lunch type, and test preparation courses

Interactive Dashboard Development

Developed a Streamlit-based interactive analytics dashboard for dynamic exploration of student performance data
Implemented filters and KPIs allowing users to explore results by gender, race/ethnicity, education level, and exam category
Designed an intuitive interface to make complex data insights accessible for non-technical users

Deployment & Data Application Engineering

Deployed the Streamlit application using Streamlit Community Cloud
Integrated GitHub for version control and automated deployment workflows
Enabled cloud-based access for interactive exploration of insights from anywhere

Skills Demonstrated

Exploratory Data Analysis (EDA)
Data Cleaning & Preprocessing
Statistical Analysis & Correlation Analysis
Interactive Data Visualization
Dashboard Development with Streamlit
Cloud Deployment & Version Control

Goal: Demonstrate the ability to perform end-to-end exploratory data analysis, build interactive dashboards, and deploy Python-based data applications that transform raw datasets into accessible and actionable insights.

🔄 DVC CSV Tracker — Data Version Control with Git & DVC

https://github.com/nagasantoshchavvakula/Data_Version_Control_With-DVC_And_Git

A practical data version control project demonstrating how to integrate DVC (Data Version Control) with Git to manage and track changes in structured CSV datasets. The project highlights how modern data teams maintain reproducibility, data integrity, and collaborative workflows by versioning datasets alongside code.

This project focuses on data versioning workflows commonly used in MLOps and data engineering pipelines, ensuring that dataset changes are traceable, reproducible, and synchronized with source code repositories.

Status: ✅ Completed
Core Stack: Git, GitHub, DVC (Data Version Control), CSV Data Management, MLOps Fundamentals, Reproducibility, Command-Line Tools

Data Versioning & MLOps Focus

Initialized and configured DVC within a Git repository to manage dataset versioning
Tracked structured CSV datasets using DVC data tracking mechanisms
Maintained reproducible dataset history while separating data artifacts from source code
Demonstrated reproducible workflows commonly used in machine learning and data science pipelines

Workflow & Engineering Practices

Used DVC commands such as dvc init and dvc add to track dataset changes
Managed dataset metadata files (.dvc) and repository configurations
Committed DVC metadata and configuration files to Git for version control
Simulated dataset updates and maintained version history through Git commits
Pushed repository updates to GitHub to enable collaborative data project workflows

Skills Demonstrated

Git & GitHub version control
Data Version Control (DVC)
Reproducible data science workflows
CSV dataset management
Command-line based data engineering workflows
Collaboration practices in data-driven projects

Goal: Demonstrate how to build reproducible data science workflows by integrating Git and DVC for dataset versioning, enabling scalable collaboration and reliable data pipeline management.

📊 Student Performance Analysis — Python Data Exploration Project

https://github.com/nagasantoshchavvakula/Student-Performance-Analysis

A Python-based data analysis project designed to explore and summarize student performance datasets using Pandas. The project demonstrates how structured CSV data can be programmatically processed to extract meaningful insights through statistical analysis and exploratory data techniques.

This project focuses on data exploration, statistical computation, and structured data handling, highlighting how Python can be used to quickly analyze datasets and generate performance insights for decision-making.

Status: ✅ Completed
Core Stack: Python, Pandas, CSV Handling, Data Analysis, Descriptive Statistics, Exception Handling

Data Analysis Focus

Reading and processing structured student datasets from CSV files using Pandas
Performing exploratory data analysis (EDA) on student attributes such as scores and age
Calculating statistical metrics including mean, median, standard deviation, minimum, and maximum values
Extracting key information such as student names and previewing top records for quick dataset inspection
Generating descriptive summaries to identify performance patterns within the dataset

Data Engineering & Code Quality Practices

Implemented structured data handling using Pandas DataFrames
Added robust exception handling for missing files and column inconsistencies
Designed reusable analysis scripts for quick dataset preview and insight generation
Demonstrated practical workflows for cleaning and summarizing real-world tabular datasets

Skills Demonstrated

Python programming for data analysis
Pandas DataFrame manipulation
CSV data ingestion and transformation
Descriptive statistical analysis
Error handling and robust data processing

Goal: Demonstrate the ability to analyze structured datasets using Python and Pandas, perform statistical analysis, and extract actionable insights through efficient data exploration workflows.

🚀 Full-Stack Projects & Internship Experience — Role: Full Stack Web Development Intern

Gained hands-on experience in full-stack development using Spring Boot & React
Demonstrated strong problem-solving, technical skills, and project execution
Awarded Certificate & Letter of Recommendation 📄 View Certificate & LOR

Skills:

Web Development Projects:

🔐 Secure User Authentication System

Built a JWT-based authentication system with login & registration
Implemented BCrypt password hashing and secure session handling
Designed Spring Security-based protected APIs
Integrated React frontend with protected routes

https://github.com/nagasantoshchavvakula/Secure-User-Authentication-System.git

👨‍💼 Employee Management System (EMS)

Developed a full-stack CRUD application with JWT authentication
Admin can manage employee records securely
Implemented Spring Boot REST APIs + React frontend
Includes validation and role-based access control

https://github.com/nagasantoshchavvakula/Employee-Management-System.git

🌐 Social Media Application

Built a full-stack app with posts, likes, comments, and follow system
Implemented JWT authentication and secure REST APIs
Developed responsive UI with React and backend with Spring Boot

https://github.com/nagasantoshchavvakula/Social-Media-App.git

💬 Real-Time Chat Application

Developed a real-time chat system using WebSockets
Supports multiple chat rooms and persistent chat history
Implemented JWT-based authentication
Tech: Spring Boot, WebSocket, React, SockJS, StompJS

https://github.com/nagasantoshchavvakula/Real-Time-Chat-Application.git

🎯 Career Focus

I enjoy building data-driven systems that combine analytics, engineering, artificial intelligence and machine learning to solve real-world business problems.

Current interests include:

Data analytics & visualization
Data engineering pipelines
Machine learning & AI applications
Deep learning & neural networks
MLOps and scalable ML systems
Cloud-based analytics and AI platforms

📂 Projects & Portfolio

Here are some of my featured projects demonstrating expertise in Data Analytics, Machine Learning, Deep Learning, MLOps, Data Engineering, and Full-Stack Development:

🤖 AI, Machine Learning & Deep Learning

Image Classification for Medical Diagnosis – End-to-end deep learning pipeline for pneumonia detection using CNNs, Transfer Learning (MobileNetV2, VGG16, ResNet50), Grad-CAM explainability, ensemble learning, DVC, and CI/CD workflows.
Document QA ChatBot – Retrieval-Augmented Generation (RAG) application leveraging Groq Llama3, LangChain, HuggingFace Embeddings, and FAISS for intelligent PDF document question answering, semantic search, context-aware response generation, and enterprise knowledge retrieval.
Real Estate Price Prediction Engine – Production-ready ML system combining regression models, clustering, recommendation systems, ensemble learning, Streamlit deployment, and automated testing.
Telco Customer Churn Prediction – End-to-end ML pipeline using XGBoost, SMOTE, hyperparameter tuning, CI/CD, and deployment-ready model serialization.
ML Lifecycle & MLOps Sentiment Analysis System – Transformer-based NLP solution using DistilBERT, DVC, MLflow, Docker, Flask APIs, and GitHub Actions.

⚙️ Data Engineering & Analytics

Ecommerce Fraud Detection Pipeline – Automated ETL pipeline using Python, Prefect, MySQL, and fraud analytics for detecting suspicious e-commerce transactions.
YouTube–TikTok Short Form Video Analytics – End-to-end analytics platform with MySQL, DVC, Streamlit, and Power BI dashboards for social media engagement analysis.
Sales Performance Optimization Pipeline – Automated ETL and BI reporting solution leveraging Python, Prefect, MySQL, and Power BI.
DVC CSV Tracker – Demonstrates reproducible data science workflows using Git and DVC for dataset versioning and management.

📊 Data Analytics & Financial Systems

Personal Finance Tracker & Investment Portfolio Analyzer – Python-based financial analytics system using OOP, NumPy, Pandas, Prefect, and DVC for budgeting and portfolio analysis.
Student Performance Analysis – Exploratory data analysis and statistical insights generation using Python and Pandas.
Student Performance Interactive Dashboard – Streamlit-powered analytics dashboard featuring interactive visualizations and educational performance insights.

🌐 Full-Stack Development

Employee Management System – Full-stack CRUD application with Spring Boot, React, JWT authentication, validation, and role-based access control.
Secure User Authentication System – Spring Security and JWT-based authentication platform featuring secure login, registration, BCrypt password hashing, and protected REST APIs.
Social Media Application – Full-stack social networking platform built with Spring Boot and React, supporting posts, likes, comments, user profiles, follow/unfollow functionality, JWT authentication, and secure REST APIs.
Real-Time Chat Application – Real-time messaging platform using Spring Boot, WebSockets, React, SockJS, and STOMP, featuring multiple chat rooms, persistent chat history, live communication, and JWT-based authentication.

📊 GitHub Stats

⭐ Thanks for visiting my GitHub! Feel free to explore my projects and connect if you'd like to collaborate.

Nagasantosh nagasantoshchavvakula

Achievements

Achievements

👋 Hello, I'm Nagasantosh!

🌟 Academic Background

🏆 Certifications

Certified In

🤝 Let’s Connect

🧰 Core Tech Stack

Tools & Technologies

🧠 Key Areas of Expertise

📊 Featured Projects As Data Science Intern

🤖 Document QA ChatBot — Retrieval-Augmented Generation (RAG) Application

🩺 Image Classification for Medical Diagnosis — Deep Learning Pipeline Project

🏡 Real Estate Price Prediction Engine — Machine Learning Pipeline Project

📉 Telco Customer Churn Prediction — Machine Learning Pipeline Project

🤖 ML Lifecycle & MLOps Sentiment Analysis System — End-to-End NLP Pipeline

🤖 Introduction to Machine Learning — Core ML Implementation Project

🛒 Ecommerce Fraud Detection — End-to-End Pipeline

📱 YouTube–TikTok Short Form Video Analytics — End-to-End Data Analytics Pipeline

🚗 Sales Performance Optimization Pipeline — End-to-End ETL & BI Analytics System

💰 Personal Finance Tracker & Investment Portfolio Analyzer — Financial Analytics System

📊 Exploratory Data Analysis & Visualization of Student Performance — Interactive Analytics Dashboard

🔄 DVC CSV Tracker — Data Version Control with Git & DVC

📊 Student Performance Analysis — Python Data Exploration Project

🚀 Full-Stack Projects & Internship Experience — Role: Full Stack Web Development Intern

🔐 Secure User Authentication System

👨‍💼 Employee Management System (EMS)

🌐 Social Media Application

💬 Real-Time Chat Application

🎯 Career Focus

📂 Projects & Portfolio

🤖 AI, Machine Learning & Deep Learning

⚙️ Data Engineering & Analytics

📊 Data Analytics & Financial Systems

🌐 Full-Stack Development

📊 GitHub Stats

Popular repositories Loading

Uh oh!