- π Passionate about Data Science, Machine Learning, Deep Learning, Generative AI, Computer Vision, NLP, MLOps, Data Engineering, and Full-Stack Development
- π Currently building end-to-end AI, Machine Learning, Deep Learning, Generative AI (RAG), Data Engineering, and Analytics solutions using modern tools, frameworks, and cloud-ready architectures
- π± Continuously expanding expertise in Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Transformer-based NLP, Transfer Learning, MLOps, Cloud AI Services, and Scalable Data Pipelines
- π€ Open to collaborating on Machine Learning, Deep Learning, Generative AI, Computer Vision, NLP, Data Engineering, ETL Pipelines, Analytics Platforms, and Real-World AI Applications
- π‘ Experienced in developing projects involving Predictive Analytics, Recommendation Systems, Fraud Detection, Medical Image Classification, Document Intelligence, Sentiment Analysis, Workflow Automation, and Interactive Dashboards
- π Skilled in Python, SQL, Scikit-learn, TensorFlow, PyTorch, Hugging Face, LangChain, Streamlit, Power BI, Tableau, MLflow, DVC, Docker, Git, GitHub, and Cloud-Based Data & AI Technologies
- ποΈ Strong focus on building scalable, production-ready applications that combine data engineering, machine learning, deep learning, MLOps, and software engineering best practices
- π― Aspiring to contribute to innovative teams solving complex business problems through AI-driven, data-centric solutions
- π Master of Science in Computer Science
University of Central Missouri, USA | Jan 2023 - May 2024 - π Bachelor of Technology in Electrical and Electronics Engineering
JB Institute of Engineering and Technology, India | Aug 2015 - July 2019
- βοΈ AWS Machine Learning Engineering
- π AWS Cloud Practitioner
- Generative AI & Large Language Models (LLMs) (Retrieval-Augmented Generation (RAG), Prompt Engineering, LangChain, Vector Databases, Semantic Search, Document Intelligence, AI-Powered Knowledge Retrieval)
- Data Analytics & Business Intelligence (Exploratory Data Analysis, Statistical Analysis, KPI Reporting, Dashboard Development, Business Insights)
- Machine Learning & Predictive Modeling (Regression, Classification, Clustering, Recommendation Systems, Fraud Detection, Churn Prediction)
- Deep Learning & Computer Vision (CNNs, DNNs, Transfer Learning, Medical Image Classification, Ensemble Learning, Explainable AI)
- Natural Language Processing (NLP) (Sentiment Analysis, Transformer Models, Hugging Face, Text Preprocessing, Information Retrieval, Context-Aware AI Systems)
- Data Engineering & ETL Pipelines (Data Ingestion, Transformation, Workflow Automation, Batch Processing, Data Pipelines, Workflow Orchestration)
- MLOps & ML Lifecycle Management (DVC, MLflow, CI/CD Pipelines, Automated Testing, Experiment Tracking, Model Versioning, Reproducibility)
- Model Optimization & Evaluation (Hyperparameter Tuning, Cross Validation, ROC-AUC, Confusion Matrix, Performance Analysis)
- Cloud & Scalable AI Systems (AWS, Azure, Docker, Streamlit, Cloud-Based ML Workflows, Deployment-Ready AI Applications)
- Data Visualization & Interactive Dashboards (Power BI, Tableau, Plotly, Streamlit, Matplotlib, Seaborn)
- Software Engineering & Application Development (Python, Java, Spring Boot, React, REST APIs, Git, GitHub Actions, Modular Architecture, Workflow Automation)
https://github.com/nagasantoshchavvakula/Document-QA-ChatBot.git
A comprehensive Retrieval-Augmented Generation (RAG) based Document Question Answering application designed to enable users to interact with PDF documents using natural language. The system combines Large Language Models (LLMs), vector databases, semantic search, document processing pipelines, and Generative AI techniques to deliver accurate, context-aware responses directly from uploaded documents.
This project simulates a real-world Enterprise Knowledge Assistant capable of extracting, indexing, retrieving, and generating insights from unstructured document data. By integrating modern AI frameworks and vector search technologies, the application demonstrates how organizations can build intelligent document understanding systems for research, compliance, legal analysis, customer support, and enterprise knowledge management use cases.
The project follows a complete end-to-end Generative AI and Retrieval-Augmented Generation lifecycle, integrating document ingestion, text extraction, text chunking, embedding generation, vector indexing, semantic retrieval, prompt engineering, LLM-powered response generation, environment management, debugging, and deployment-ready application development.
Status: βοΈ In Progress
Core Stack: Python, Streamlit, LangChain, Groq Llama3, HuggingFace Embeddings, FAISS, PyPDFLoader, Sentence Transformers, python-dotenv, Git, GitHub
Generative AI, NLP & Document Intelligence Focus
- Built an end-to-end Document QA ChatBot using Retrieval-Augmented Generation (RAG) architecture
- Implemented PDF ingestion, text extraction, chunking, and semantic retrieval using PyPDFLoader, LangChain, and FAISS
- Generated vector embeddings using HuggingFace BAAI/bge-small-en-v1.5 for efficient similarity search
- Developed an interactive Streamlit interface for document-based question answering
- Engineered prompt templates and retrieval workflows to deliver accurate, context-aware responses
- Applied secure configuration management using python-dotenv and environment variables
LLM Integration & AI Application Development
- Integrated Groq-hosted Llama3-8B-8192 for document-grounded answer generation
- Built semantic search pipelines using LangChain Retrieval Chains, vector search, and prompt engineering
- Optimized document retrieval through chunking strategies and embedding-based nearest-neighbor search
- Developed scalable PDF processing and AI-powered knowledge retrieval workflows
- Followed software engineering best practices including modular architecture, dependency management, version control, and reproducible development workflows
Problem Solving & Engineering Highlights
- Resolved LangChain version compatibility, Python 3.13 dependency conflicts, and Streamlit configuration issues
- Migrated the project to Python 3.11 for stable package compatibility
- Implemented secure API key management by moving credentials from source code to environment variables
- Improved maintainability through structured project organization and dependency pinning
Skills Demonstrated
- Generative AI: RAG, Prompt Engineering, LLM Integration
- NLP: Semantic Search, Information Retrieval, Document Understanding
- LLMs: Groq Llama3, Context-Aware Response Generation
- Vector Databases: FAISS, Embedding-Based Search
- AI Frameworks: LangChain, HuggingFace Embeddings
- Document Processing: PDF Parsing, Text Chunking
- Development: Python, Streamlit, API Integration
- DevOps & Collaboration: Virtual Environments, Git, GitHub, Dependency Management
Goal: Deliver a production-ready Document Intelligence solution that combines Retrieval-Augmented Generation, semantic search, vector databases, and Large Language Models to demonstrate real-world AI engineering, knowledge retrieval, and enterprise-scale document understanding capabilities.
https://github.com/nagasantoshchavvakula/Image-Classification-For-Medical-Diagnosis.git
A comprehensive deep learning-based medical image classification pipeline designed to detect Pneumonia from Chest X-ray images using advanced neural networks, transfer learning architectures, explainable AI, and ensemble learning techniques. This project simulates a real-world healthcare diagnostic system for assisting medical professionals in making data-driven clinical predictions.
The project follows a complete end-to-end deep learning lifecycle, integrating image preprocessing, neural network training, transfer learning, optimization strategies, explainable AI visualizations, and ensemble-based prediction systems for improved classification performance and model interpretability.
Status: βοΈ In Progress
Core Stack: Python, TensorFlow, Keras, NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn, DVC, PyTest, GitHub Actions (CI/CD), Jupyter Notebook, Git
Deep Learning & Data Pipeline Focus
- Implemented structured image data loading pipelines using TensorFlow and Keras ImageDataGenerator
- Performed image preprocessing, normalization, augmentation, and dataset balancing using class weights
- Built scalable and reusable deep learning workflows for training, evaluation, and inference
- Applied data augmentation techniques including rotation, zoom, flipping, width/height shifting for robust generalization
- Managed dataset versioning and reproducibility using DVC (Data Version Control)
- Conducted comprehensive experimentation and notebook-driven analysis for model comparison and optimization
- Maintained modular project architecture with separate components for data loading, model building, training, and evaluation
Model Development, Evaluation & MLOps
- Developed and optimized Baseline MLP, Deep Neural Networks, and Transfer Learning models using TensorFlow and Keras for medical image classification
- Implemented pretrained architectures including MobileNetV2, VGG16, and ResNet50 with fine-tuning for enhanced model performance
- Applied advanced training strategies using EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard, and custom learning rate scheduling
- Performed experimentation with activation functions, optimizers, regularization techniques, and hyperparameter optimization
- Built ensemble learning models using soft-voting techniques to improve prediction accuracy and robustness
- Designed end-to-end evaluation pipelines using Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix, and ROC Curves
- Implemented Grad-CAM based Explainable AI (XAI) techniques for model interpretability and medical image visualization
- Integrated PyTest, DVC, Git, GitHub, modular pipelines, and MLOps best practices for scalable and reproducible deep learning workflows
Skills Demonstrated
- Deep Learning: Neural Networks, Transfer Learning, Ensemble Learning
- Computer Vision: Medical Image Classification, Chest X-ray Analysis ?* Explainable AI: Grad-CAM Visualization
- Machine Learning: Model Evaluation, Hyperparameter Optimization
- Data Processing: Image Augmentation, Preprocessing, Dataset Balancing
- Programming: Python, TensorFlow, Keras, NumPy, Pandas
- Visualization: Matplotlib, Seaborn
- MLOps: DVC, Automated Testing (PyTest), Workflow Management
- Version Control: Git, GitHub
- Analytical Thinking: Model Optimization, Diagnostic Performance Analysis
Goal: Deliver a production-ready medical image classification engine that combines deep learning, transfer learning, explainable AI, and ensemble learning techniques to demonstrate scalable healthcare AI workflows, strong analytical capabilities, and deployment-ready deep learning solutions.
https://github.com/nagasantoshchavvakula/Real-Estate-Price-Prediction.git
A comprehensive regression-based machine learning pipeline designed to predict real estate property prices using advanced statistical modeling, unsupervised learning, and recommendation systems. This project simulates a real-world pricing platform for buyers, sellers, and agents to make data-driven property valuation decisions.
The project follows a complete end-to-end ML lifecycle, integrating regression modeling, clustering for market segmentation, recommendation systems, and ensemble learning for improved predictive performance.
Status: β
Completed
Core Stack: Python, Pandas, NumPy, Scikit-learn, XGBoost, LightGBM, Matplotlib, Seaborn, PCA (Dimensionality Reduction), PyTest, GitHub Actions (CI/CD), Streamlit, Joblib, Jupyter Notebook, Git
MLOps & Data Focus
- Implemented structured data pipelines for preprocessing, feature engineering, and transformation
- Performed comprehensive exploratory data analysis (EDA) to uncover pricing trends and feature relationships
- Handled missing values, categorical encoding, and feature scaling for robust model performance
- Applied Principal Component Analysis (PCA) for dimensionality reduction and multicollinearity handling
- Built modular and reusable ML components for regression, clustering, and recommendation systems
- Ensured reproducibility through organized workflows, version control, and notebook-based experimentation
Experimentation & Deployment
- Trained and evaluated multiple regression models including Linear, Ridge, Lasso, ElasticNet, Random Forest, and Gradient Boosting
- Performed hyperparameter tuning using GridSearchCV and RandomizedSearchCV
- Designed evaluation pipelines using RMSE, MAE, and RΒ² for regression performance comparison
- Implemented clustering workflows (K-Means, Hierarchical, DBSCAN) with validation using Silhouette Score
- Built recommendation systems (content-based, collaborative, hybrid) for personalized property suggestions
- Developed ensemble models (Voting & Stacking) to improve predictive accuracy
- Created an interactive Streamlit dashboard for real-time predictions and insights
- Integrated automated testing using PyTest and CI/CD pipelines via GitHub Actions
Skills Demonstrated
- Machine Learning: Regression Modeling, Clustering, Recommendation Systems, Ensemble Learning
- Data Science: EDA, Feature Engineering, Dimensionality Reduction (PCA)
- Programming: Python (Pandas, NumPy, Scikit-learn)
- Visualization: Matplotlib, Seaborn
- MLOps: CI/CD Pipelines, Automated Testing (PyTest), Workflow Management
- Deployment: Streamlit Dashboard, Model Serialization (Joblib)
- Analytical Thinking: Model Evaluation, Performance Optimization, Business Insights
Goal: Deliver a production-ready real estate pricing engine that combines regression modeling, market segmentation, and recommendation systems demonstrating scalable machine learning workflows, strong analytical capabilities, and deployment-ready solutions.
https://github.com/nagasantoshchavvakula/Customer-Churn-Prediction
An end-to-end machine learning pipeline project to predict customer churn for a telecommunications company using real-world business data. The project follows the complete ML lifecycle, from exploratory data analysis and preprocessing to model development, hyperparameter tuning, and deployment-ready model serialization.
This project emphasizes reproducibility, automated testing, and CI/CD workflows, demonstrating best practices for building production-ready ML systems.
Status: β
Completed
Core Stack: Python, Pandas, NumPy, Scikit-learn (Preprocessing, Model Development & Evaluation), XGBoost, Matplotlib, Seaborn, SMOTE (Imbalanced-learn), PyTest, GitHub Actions (CI/CD), Joblib (Model Serialization), Jupyter Notebook, Git
Machine Learning & Analytics Focus
- Conducted Exploratory Data Analysis (EDA) to understand customer behavior and churn patterns
- Computed statistical summaries including distributions, correlations, and hypothesis testing
- Built preprocessing pipelines for missing value handling, categorical encoding, and feature scaling
- Implemented stratified train-test splitting to preserve class distribution
- Developed multiple ML models: Logistic Regression, KNN, Decision Tree, Random Forest, XGBoost
- Addressed class imbalance using SMOTE techniques
- Evaluated models using Accuracy, Precision, Recall, F1-score, ROC-AUC, confusion matrices, and ROC curves
- Performed hyperparameter tuning with GridSearchCV and RandomizedSearchCV
- Selected and serialized the best-performing model and scaler for production use
Systems & MLOps Direction
- Automated testing using PyTest
- CI/CD pipelines implemented with GitHub Actions for reproducibility and deployment readiness
- Data visualization for model insights using Matplotlib and Seaborn
- Workflow version control and experiment tracking using Git and Jupyter Notebook
Skills Demonstrated
- Python programming for ML pipelines
- Data preprocessing & feature engineering
- Model evaluation & hyperparameter optimization
- Ensemble learning with XGBoost
- Imbalanced data handling and statistical analysis
- CI/CD for ML workflows
- Model serialization and deployment readiness
Goal: Build a production-ready ML pipeline that predicts customer churn, provides actionable business insights, and demonstrates reproducible, automated, and deployment-ready ML workflows.
https://github.com/nagasantoshchavvakula/Sentiment-Analysis-MLOps
An end-to-end Machine Learning Lifecycle (MLOps) project for sentiment analysis using HuggingFace transformer models. The project demonstrates industry-standard practices including data versioning, experiment tracking, automated CI/CD pipelines, and cloud-ready deployment. Transfer learning is applied to fine-tune a pre-trained NLP model for binary sentiment classification, ensuring scalable, reproducible, and production-ready ML workflows.
Status: βοΈ In Progress
Core Stack: Python, HuggingFace Transformers (DistilBERT), PyTorch, Scikit-learn, DVC, MLflow, Flask, Docker, GitHub Actions (CI/CD)
MLOps & Data Focus
- Implemented data lifecycle management using DVC for dataset and model artifact versioning
- Built preprocessing pipelines for text cleaning and tokenization
- Performed exploratory data analysis (EDA) on text data
- Fine-tuned DistilBERT for sentiment classification
- Developed reproducible ML pipelines with parameter tracking and dependency management
Experimentation & Deployment
- Integrated MLflow for experiment tracking, hyperparameter logging, and model versioning
- Designed evaluation workflows including accuracy, precision, recall, F1-score, and confusion matrix
- Built RESTful Flask API for real-time and batch sentiment predictions
- Created automated CI/CD pipelines using GitHub Actions for testing, training, and deployment
- Containerized the application using Docker for consistent deployments
- Implemented monitoring-ready endpoints for production health checks and model performance tracking
Skills Demonstrated
- Machine Learning & NLP: Transfer Learning, Sentiment Analysis, Transformer Models
- MLOps Tools: DVC, MLflow, CI/CD pipelines
- Programming: Python
- Deployment & DevOps: Flask API, Docker, GitHub Actions
- Data Engineering: Versioning, Pipeline Orchestration
- Cloud & Production Concepts: Model Serving, Reproducibility
Goal: Deliver an industry-standard, production-ready MLOps pipeline for NLP sentiment analysis that demonstrates best practices in experiment tracking, deployment, and reproducibility.
https://github.com/nagasantoshchavvakula/Intro_ML_Starter_Code_Implementation
A foundational machine learning implementation project designed to demonstrate the core components of a typical ML workflow, including data preprocessing, dataset splitting, model training, prediction, and evaluation.
The project focuses on building reusable Python functions that implement key machine learning operations using NumPy and Scikit-learn, providing a hands-on understanding of how regression and classification models are trained and evaluated in real-world data science pipelines.
This project emphasizes clean data preparation, modular ML pipeline design, and reliable model evaluation, reflecting practical machine learning development workflows used in analytics and AI systems.
Status: β
Completed
Core Stack: Python, NumPy, scikit-learn, pytest, Git, GitHub
Machine Learning Pipeline Focus
- Implemented feature normalization using minβmax scaling to standardize dataset inputs
- Designed flexible missing value imputation strategies including mean, median, and zero replacement
- Built dataset train-test splitting functions to support proper model validation
- Trained Linear Regression models for continuous value prediction tasks
- Developed Logistic Regression classifiers for binary classification problems
- Created reusable prediction functions to generate outputs on unseen data
Model Evaluation & Validation
- Implemented regression evaluation using Mean Squared Error (MSE)
- Calculated classification performance using Accuracy metrics
- Validated ML pipeline functionality using automated unit testing with pytest
- Ensured reproducibility and correctness across preprocessing, training, and prediction stages
Engineering & Development Practices
- Developed modular Python functions for reusable ML workflows
- Applied automated testing to validate model pipelines
- Used Git and GitHub for version control and project management
- Maintained clean documentation with structured function docstrings and examples
Skills Demonstrated
- Python programming for machine learning
- Data preprocessing and feature scaling
- Regression and classification model development
- Model performance evaluation and testing
- Building reliable and modular ML pipelines
Goal: Demonstrate a practical understanding of machine learning fundamentals by implementing a complete ML workflow from data preprocessing to model evaluation using industry-standard Python libraries.
https://github.com/nagasantoshchavvakula/Ecommerce-Fraud-Detection-End-to-End-Data-Pipeline
A production-style data engineering project that implements an automated pipeline for detecting fraudulent e-commerce transactions. The system processes raw transaction data, performs cleaning and feature engineering, and loads analytics-ready datasets into a MySQL analytics layer to support fraud monitoring and BI dashboards.
This project simulates a real-world enterprise data pipeline, incorporating ETL orchestration, modular workflow design, and monitoring to ensure reliable processing of transactional data used for fraud analysis.
Status: β
Completed
Core Stack: Python (Pandas, SQLAlchemy), Prefect (Workflow Orchestration), MySQL (Staging & Analytics), SQL (Analytical Queries), Data Engineering (ETL Pipelines, Feature Engineering, Data Modeling)
Data Engineering Focus
- Designed an automated ETL pipeline to ingest, transform, and load e-commerce transaction data
- Processed raw CSV datasets and stored them in MySQL staging tables for controlled transformation
- Built modular ETL tasks (Extract β Transform β Load) orchestrated using Prefect workflows
- Developed analytics-ready datasets optimized for fraud detection queries and BI dashboards
- Implemented data validation, schema standardization, and feature engineering during transformation
Fraud Analytics & Feature Engineering
- Engineered fraud detection features such as promo misuse, device-location mismatch, and transaction anomalies
- Generated key fraud metrics including fraud rate, suspicious user patterns, and high-risk country indicators
- Designed SQL analytical queries to detect abnormal transaction behaviors and high-risk merchant categories
- Created aggregated KPIs enabling drill-down analysis at the transaction and user levels
Workflow Automation & Monitoring
- Automated pipeline orchestration using Prefect with scheduling, retries, and workflow logging
- Implemented modular task-based ETL architecture for scalable data processing
- Enabled real-time monitoring and failure recovery using Prefect UI
- Designed pipelines to support reliable batch processing for large transaction datasets
Skills Demonstrated
- Python for data engineering (Pandas, SQLAlchemy)
- ETL pipeline development and workflow orchestration
- Prefect for scheduling, monitoring, and pipeline automation
- MySQL database design (staging and analytics schemas)
- SQL analytics and fraud detection metrics
- Designing BI-ready datasets for reporting and dashboards
Goal: Build a scalable data engineering pipeline for fraud detection analytics, demonstrating how automated ETL workflows, feature engineering, and SQL analytics can transform raw transactional data into actionable fraud insights for business intelligence systems.
https://github.com/nagasantoshchavvakula/YouTube-TikTok-Short-Form-Video-Analytics
An end-to-end data analytics pipeline designed to analyze and visualize engagement trends from short-form video platforms such as YouTube Shorts and TikTok. The project demonstrates the complete analytics lifecycle β from raw data ingestion to interactive dashboard visualization β using MySQL, Python, DVC, and Streamlit.
The system focuses on transforming raw social media datasets into actionable engagement insights, enabling exploration of metrics such as views, likes, shares, comments, and trending patterns through interactive dashboards.
Status: β
Completed
Core Stack: Python, Pandas, NumPy, MySQL, Streamlit, Plotly, Matplotlib, DVC, Git, GitHub, Virtualenv, PowerBI
Data Engineering & Analytics Focus
- Designed an automated pipeline to ingest raw CSV datasets from Kaggle into a structured MySQL analytics database
- Performed data cleaning, transformation, and feature engineering using Python (Pandas)
- Built modular Python scripts for ingestion, processing, and analysis workflows
- Conducted exploratory data analysis (EDA) to identify engagement patterns across short-form content
- Generated analytics-ready datasets for dashboard visualization and reporting
Dashboard & Visualization Layer
- Developed an interactive Streamlit dashboard to visualize engagement trends and performance metrics
- Built a Power BI dashboard connected to MySQL, enabling deeper business insights through KPI tracking and visual analytics
- Designed visualizations including trend lines, engagement comparisons, and category-based insights
Data Management & Reproducibility
- Implemented Data Version Control (DVC) to track dataset versions and ensure reproducible data workflows
- Organized project structure with modular scripts and reproducible pipelines
- Managed source code, collaboration, and versioning using Git and GitHub
Engineering Practices
- Modular pipeline architecture separating ingestion, processing, and analysis layers
- Version-controlled datasets and scripts for reproducible analytics workflows
- Designed the project to scale for larger social media datasets and additional analytics dashboards
Goal: Demonstrate an end-to-end data analytics workflow that integrates data engineering, analysis, and dashboard visualization, transforming raw social media datasets into actionable engagement insights.
https://github.com/nagasantoshchavvakula/bmw-car-sales_Performance-Optimization-Pipeline
An end-to-end data engineering and business intelligence project designed to automate the extraction, transformation, and loading (ETL) of vehicle sales data into a structured MySQL database while generating interactive analytics dashboards in Power BI.
The project focuses on identifying key regional and vehicle factors that drive high sales performance, enabling data-driven decision-making through automated data pipelines and real-time business intelligence reporting.
Status: β
Completed
Core Stack: Python (Pandas, SQLAlchemy), Prefect, MySQL, Power BI, Excel
Data Engineering & ETL Pipeline
- Designed and implemented a complete ETL pipeline using Python, Prefect, and MySQL for automated data ingestion and transformation
- Built modular Prefect workflows using @task and @flow decorators to manage pipeline dependencies and data processing steps
- Performed comprehensive data auditing and schema design to ensure clean, consistent, and analytics-ready database structures
- Automated loading of transformed datasets into a centralized MySQL data warehouse
Analytics & Business Intelligence
- Developed an interactive Power BI dashboard connected to the MySQL database for real-time sales analytics
- Created KPI visualizations to track sales trends by region, vehicle model, and sales classification
- Enabled business stakeholders to quickly identify top-performing regions and vehicle categories
Data Analysis Focus
- Identified key drivers behind βHighβ sales classifications through exploratory data analysis
- Applied structured data profiling and preprocessing using Excel and Python
- Delivered actionable insights supporting sales performance optimization and strategic decision-making
Skills Demonstrated
- Data pipeline design and ETL automation
- Workflow orchestration with Prefect
- Relational database schema design and SQL integration
- Business intelligence dashboard development
- Data auditing and structured data transformation
Goal: Build a scalable data pipeline that automates sales data processing while enabling business stakeholders to analyze regional and product-level sales performance through interactive dashboards.
https://github.com/nagasantoshchavvakula/Personal-Finance-Tracker-and-Investment-Portfolio-Analyzer
A Python-based personal finance analytics system designed to track expenses, manage budgets, and analyze investment portfolios through automated data workflows. The project demonstrates the integration of Python programming, financial modeling, and workflow orchestration to transform raw financial data into actionable insights.
This system focuses on financial data analysis, automation, and reproducible data pipelines, enabling users to monitor spending behavior, evaluate investment performance, and generate analytical reports for better financial decision-making.
Status: β
Completed
Core Stack: Python, Object-Oriented Programming (OOP), NumPy, pandas, Matplotlib, Prefect, DVC, Git
Financial Analytics & Data Processing Focus
- Tracking financial transactions, expenses, and budget allocations using structured data models
- Performing statistical and time-series analysis on spending patterns and investment performance
- Processing financial datasets using Pandas and NumPy for numerical and analytical computations
- Generating reports and visualizations to evaluate budget adherence and portfolio growth
- Extracting insights on spending trends and investment returns through analytical workflows
System Architecture & Engineering Practices
- Developed modular Python architecture using Object-Oriented Programming (OOP) principles
- Designed reusable classes for transactions, accounts, and investment portfolios
- Implemented data validation and structured financial data models
- Built reproducible data pipelines using Data Version Control (DVC)
- Automated scheduled financial analysis using Prefect workflow orchestration
Skills Demonstrated
- Python programming and OOP system design
- Financial data analysis and modeling
- Pandas and NumPy for structured data processing
- Workflow automation using Prefect
- Reproducible data pipelines with DVC
- Data visualization and reporting with Matplotlib
- Git-based version control and project organization
Goal: Demonstrate the design of an automated financial analytics system that combines data engineering, statistical analysis, and workflow automation to support personal finance monitoring and investment decision-making.
π Exploratory Data Analysis & Visualization of Student Performance β Interactive Analytics Dashboard
https://github.com/nagasantoshchavvakula/EDA_Student_Performance
An end-to-end data analysis and visualization project focused on exploring the factors influencing student exam performance using a Kaggle dataset. The project demonstrates how Python-based data analytics workflows can transform raw educational datasets into meaningful insights through exploratory data analysis, statistical visualization, and interactive dashboards.
The project integrates data preprocessing, statistical analysis, and interactive visualization to enable dynamic exploration of student performance metrics, helping identify patterns across demographic, socioeconomic, and academic variables.
Status: β
Completed
Core Stack: Python, pandas, NumPy, matplotlib, seaborn, plotly, Streamlit, GitHub, Cloud Deployment
Data Analysis & EDA Focus
- Cleaned and preprocessed the Kaggle Student Performance in Exams dataset by handling missing values, duplicates, and data inconsistencies
- Conducted exploratory data analysis to uncover patterns and relationships between exam scores and demographic variables
- Performed statistical analysis and correlation analysis across math, reading, and writing scores
- Engineered meaningful features and calculated key performance indicators (KPIs) for student performance evaluation
Data Visualization & Insights
- Built multiple visualization types including histograms, bar charts, scatter plots, box plots, and heatmaps
- Used Seaborn, Matplotlib, and Plotly to visually communicate trends, correlations, and distribution patterns
- Identified key performance drivers such as parental education level, lunch type, and test preparation courses
Interactive Dashboard Development
- Developed a Streamlit-based interactive analytics dashboard for dynamic exploration of student performance data
- Implemented filters and KPIs allowing users to explore results by gender, race/ethnicity, education level, and exam category
- Designed an intuitive interface to make complex data insights accessible for non-technical users
Deployment & Data Application Engineering
- Deployed the Streamlit application using Streamlit Community Cloud
- Integrated GitHub for version control and automated deployment workflows
- Enabled cloud-based access for interactive exploration of insights from anywhere
Skills Demonstrated
- Exploratory Data Analysis (EDA)
- Data Cleaning & Preprocessing
- Statistical Analysis & Correlation Analysis
- Interactive Data Visualization
- Dashboard Development with Streamlit
- Cloud Deployment & Version Control
Goal: Demonstrate the ability to perform end-to-end exploratory data analysis, build interactive dashboards, and deploy Python-based data applications that transform raw datasets into accessible and actionable insights.
https://github.com/nagasantoshchavvakula/Data_Version_Control_With-DVC_And_Git
A practical data version control project demonstrating how to integrate DVC (Data Version Control) with Git to manage and track changes in structured CSV datasets. The project highlights how modern data teams maintain reproducibility, data integrity, and collaborative workflows by versioning datasets alongside code.
This project focuses on data versioning workflows commonly used in MLOps and data engineering pipelines, ensuring that dataset changes are traceable, reproducible, and synchronized with source code repositories.
Status: β
Completed
Core Stack: Git, GitHub, DVC (Data Version Control), CSV Data Management, MLOps Fundamentals, Reproducibility, Command-Line Tools
Data Versioning & MLOps Focus
- Initialized and configured DVC within a Git repository to manage dataset versioning
- Tracked structured CSV datasets using DVC data tracking mechanisms
- Maintained reproducible dataset history while separating data artifacts from source code
- Demonstrated reproducible workflows commonly used in machine learning and data science pipelines
Workflow & Engineering Practices
- Used DVC commands such as
dvc initanddvc addto track dataset changes - Managed dataset metadata files (.dvc) and repository configurations
- Committed DVC metadata and configuration files to Git for version control
- Simulated dataset updates and maintained version history through Git commits
- Pushed repository updates to GitHub to enable collaborative data project workflows
Skills Demonstrated
- Git & GitHub version control
- Data Version Control (DVC)
- Reproducible data science workflows
- CSV dataset management
- Command-line based data engineering workflows
- Collaboration practices in data-driven projects
Goal: Demonstrate how to build reproducible data science workflows by integrating Git and DVC for dataset versioning, enabling scalable collaboration and reliable data pipeline management.
https://github.com/nagasantoshchavvakula/Student-Performance-Analysis
A Python-based data analysis project designed to explore and summarize student performance datasets using Pandas. The project demonstrates how structured CSV data can be programmatically processed to extract meaningful insights through statistical analysis and exploratory data techniques.
This project focuses on data exploration, statistical computation, and structured data handling, highlighting how Python can be used to quickly analyze datasets and generate performance insights for decision-making.
Status: β
Completed
Core Stack: Python, Pandas, CSV Handling, Data Analysis, Descriptive Statistics, Exception Handling
Data Analysis Focus
- Reading and processing structured student datasets from CSV files using Pandas
- Performing exploratory data analysis (EDA) on student attributes such as scores and age
- Calculating statistical metrics including mean, median, standard deviation, minimum, and maximum values
- Extracting key information such as student names and previewing top records for quick dataset inspection
- Generating descriptive summaries to identify performance patterns within the dataset
Data Engineering & Code Quality Practices
- Implemented structured data handling using Pandas DataFrames
- Added robust exception handling for missing files and column inconsistencies
- Designed reusable analysis scripts for quick dataset preview and insight generation
- Demonstrated practical workflows for cleaning and summarizing real-world tabular datasets
Skills Demonstrated
- Python programming for data analysis
- Pandas DataFrame manipulation
- CSV data ingestion and transformation
- Descriptive statistical analysis
- Error handling and robust data processing
Goal: Demonstrate the ability to analyze structured datasets using Python and Pandas, perform statistical analysis, and extract actionable insights through efficient data exploration workflows.
- Gained hands-on experience in full-stack development using Spring Boot & React
- Demonstrated strong problem-solving, technical skills, and project execution
- Awarded Certificate & Letter of Recommendation π View Certificate & LOR
Skills:
Web Development Projects:
- Built a JWT-based authentication system with login & registration
- Implemented BCrypt password hashing and secure session handling
- Designed Spring Security-based protected APIs
- Integrated React frontend with protected routes
https://github.com/nagasantoshchavvakula/Secure-User-Authentication-System.git
- Developed a full-stack CRUD application with JWT authentication
- Admin can manage employee records securely
- Implemented Spring Boot REST APIs + React frontend
- Includes validation and role-based access control
https://github.com/nagasantoshchavvakula/Employee-Management-System.git
- Built a full-stack app with posts, likes, comments, and follow system
- Implemented JWT authentication and secure REST APIs
- Developed responsive UI with React and backend with Spring Boot
https://github.com/nagasantoshchavvakula/Social-Media-App.git
- Developed a real-time chat system using WebSockets
- Supports multiple chat rooms and persistent chat history
- Implemented JWT-based authentication
- Tech: Spring Boot, WebSocket, React, SockJS, StompJS
https://github.com/nagasantoshchavvakula/Real-Time-Chat-Application.git
I enjoy building data-driven systems that combine analytics, engineering, artificial intelligence and machine learning to solve real-world business problems.
Current interests include:
- Data analytics & visualization
- Data engineering pipelines
- Machine learning & AI applications
- Deep learning & neural networks
- MLOps and scalable ML systems
- Cloud-based analytics and AI platforms
Here are some of my featured projects demonstrating expertise in Data Analytics, Machine Learning, Deep Learning, MLOps, Data Engineering, and Full-Stack Development:
- Image Classification for Medical Diagnosis β End-to-end deep learning pipeline for pneumonia detection using CNNs, Transfer Learning (MobileNetV2, VGG16, ResNet50), Grad-CAM explainability, ensemble learning, DVC, and CI/CD workflows.
- Document QA ChatBot β Retrieval-Augmented Generation (RAG) application leveraging Groq Llama3, LangChain, HuggingFace Embeddings, and FAISS for intelligent PDF document question answering, semantic search, context-aware response generation, and enterprise knowledge retrieval.
- Real Estate Price Prediction Engine β Production-ready ML system combining regression models, clustering, recommendation systems, ensemble learning, Streamlit deployment, and automated testing.
- Telco Customer Churn Prediction β End-to-end ML pipeline using XGBoost, SMOTE, hyperparameter tuning, CI/CD, and deployment-ready model serialization.
- ML Lifecycle & MLOps Sentiment Analysis System β Transformer-based NLP solution using DistilBERT, DVC, MLflow, Docker, Flask APIs, and GitHub Actions.
- Ecommerce Fraud Detection Pipeline β Automated ETL pipeline using Python, Prefect, MySQL, and fraud analytics for detecting suspicious e-commerce transactions.
- YouTubeβTikTok Short Form Video Analytics β End-to-end analytics platform with MySQL, DVC, Streamlit, and Power BI dashboards for social media engagement analysis.
- Sales Performance Optimization Pipeline β Automated ETL and BI reporting solution leveraging Python, Prefect, MySQL, and Power BI.
- DVC CSV Tracker β Demonstrates reproducible data science workflows using Git and DVC for dataset versioning and management.
- Personal Finance Tracker & Investment Portfolio Analyzer β Python-based financial analytics system using OOP, NumPy, Pandas, Prefect, and DVC for budgeting and portfolio analysis.
- Student Performance Analysis β Exploratory data analysis and statistical insights generation using Python and Pandas.
- Student Performance Interactive Dashboard β Streamlit-powered analytics dashboard featuring interactive visualizations and educational performance insights.
-
Employee Management System β Full-stack CRUD application with Spring Boot, React, JWT authentication, validation, and role-based access control.
-
Secure User Authentication System β Spring Security and JWT-based authentication platform featuring secure login, registration, BCrypt password hashing, and protected REST APIs.
-
Social Media Application β Full-stack social networking platform built with Spring Boot and React, supporting posts, likes, comments, user profiles, follow/unfollow functionality, JWT authentication, and secure REST APIs.
-
Real-Time Chat Application β Real-time messaging platform using Spring Boot, WebSockets, React, SockJS, and STOMP, featuring multiple chat rooms, persistent chat history, live communication, and JWT-based authentication.
β Thanks for visiting my GitHub! Feel free to explore my projects and connect if you'd like to collaborate.

