Skip to content
View nagasantoshchavvakula's full-sized avatar
🏠
Working from home
🏠
Working from home
  • Texas, United States
  • 09:52 (UTC -05:00)
  • LinkedIn in/nagas914

Block or report nagasantoshchavvakula

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

πŸ‘‹ Hello, I'm Nagasantosh!

  • πŸ” Passionate about Data Science, Machine Learning, Deep Learning, Generative AI, Computer Vision, NLP, MLOps, Data Engineering, and Full-Stack Development
  • πŸš€ Currently building end-to-end AI, Machine Learning, Deep Learning, Generative AI (RAG), Data Engineering, and Analytics solutions using modern tools, frameworks, and cloud-ready architectures
  • 🌱 Continuously expanding expertise in Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Transformer-based NLP, Transfer Learning, MLOps, Cloud AI Services, and Scalable Data Pipelines
  • 🀝 Open to collaborating on Machine Learning, Deep Learning, Generative AI, Computer Vision, NLP, Data Engineering, ETL Pipelines, Analytics Platforms, and Real-World AI Applications
  • πŸ’‘ Experienced in developing projects involving Predictive Analytics, Recommendation Systems, Fraud Detection, Medical Image Classification, Document Intelligence, Sentiment Analysis, Workflow Automation, and Interactive Dashboards
  • πŸ“Š Skilled in Python, SQL, Scikit-learn, TensorFlow, PyTorch, Hugging Face, LangChain, Streamlit, Power BI, Tableau, MLflow, DVC, Docker, Git, GitHub, and Cloud-Based Data & AI Technologies
  • πŸ—οΈ Strong focus on building scalable, production-ready applications that combine data engineering, machine learning, deep learning, MLOps, and software engineering best practices
  • 🎯 Aspiring to contribute to innovative teams solving complex business problems through AI-driven, data-centric solutions

🌟 Academic Background

  • πŸŽ“ Master of Science in Computer Science
    University of Central Missouri, USA | Jan 2023 - May 2024
  • πŸŽ“ Bachelor of Technology in Electrical and Electronics Engineering
    JB Institute of Engineering and Technology, India | Aug 2015 - July 2019

πŸ† Certifications

AWS ML Engineer Associate AWS Cloud Practitioner

Certified In

  • ☁️ AWS Machine Learning Engineering
  • πŸš€ AWS Cloud Practitioner

🀝 Let’s Connect

LinkedIn GitHub Email


🧰 Core Tech Stack

Tools & Technologies

Python SQL R Pandas NumPy SciPy Scikit-Learn TensorFlow Keras PyTorch Matplotlib Seaborn Plotly ggplot Tableau Power BI Excel Apache Airflow Prefect MySQL SQL Server MongoDB PostgreSQL Snowflake Redshift Apache Spark AWS Azure Machine Learning MLOps MLflow DVC HuggingFace Generative AI LLMs LangChain RAG Prompt Engineering FAISS Vector Databases Sentence Transformers Groq NLP Streamlit Git Git Bash GitHub Actions Docker Jupyter Anaconda VS Code Linux PowerShell XGBoost LightGBM OpenCV Computer Vision Deep Learning Transfer Learning Ensemble Learning Explainable AI Grad-CAM CNN MobileNetV2 VGG16 ResNet50 PCA PyTest CI/CD Joblib

🧠 Key Areas of Expertise

  • Generative AI & Large Language Models (LLMs) (Retrieval-Augmented Generation (RAG), Prompt Engineering, LangChain, Vector Databases, Semantic Search, Document Intelligence, AI-Powered Knowledge Retrieval)
  • Data Analytics & Business Intelligence (Exploratory Data Analysis, Statistical Analysis, KPI Reporting, Dashboard Development, Business Insights)
  • Machine Learning & Predictive Modeling (Regression, Classification, Clustering, Recommendation Systems, Fraud Detection, Churn Prediction)
  • Deep Learning & Computer Vision (CNNs, DNNs, Transfer Learning, Medical Image Classification, Ensemble Learning, Explainable AI)
  • Natural Language Processing (NLP) (Sentiment Analysis, Transformer Models, Hugging Face, Text Preprocessing, Information Retrieval, Context-Aware AI Systems)
  • Data Engineering & ETL Pipelines (Data Ingestion, Transformation, Workflow Automation, Batch Processing, Data Pipelines, Workflow Orchestration)
  • MLOps & ML Lifecycle Management (DVC, MLflow, CI/CD Pipelines, Automated Testing, Experiment Tracking, Model Versioning, Reproducibility)
  • Model Optimization & Evaluation (Hyperparameter Tuning, Cross Validation, ROC-AUC, Confusion Matrix, Performance Analysis)
  • Cloud & Scalable AI Systems (AWS, Azure, Docker, Streamlit, Cloud-Based ML Workflows, Deployment-Ready AI Applications)
  • Data Visualization & Interactive Dashboards (Power BI, Tableau, Plotly, Streamlit, Matplotlib, Seaborn)
  • Software Engineering & Application Development (Python, Java, Spring Boot, React, REST APIs, Git, GitHub Actions, Modular Architecture, Workflow Automation)

πŸ“Š Featured Projects As Data Science Intern


πŸ€– Document QA ChatBot β€” Retrieval-Augmented Generation (RAG) Application

https://github.com/nagasantoshchavvakula/Document-QA-ChatBot.git

A comprehensive Retrieval-Augmented Generation (RAG) based Document Question Answering application designed to enable users to interact with PDF documents using natural language. The system combines Large Language Models (LLMs), vector databases, semantic search, document processing pipelines, and Generative AI techniques to deliver accurate, context-aware responses directly from uploaded documents.

This project simulates a real-world Enterprise Knowledge Assistant capable of extracting, indexing, retrieving, and generating insights from unstructured document data. By integrating modern AI frameworks and vector search technologies, the application demonstrates how organizations can build intelligent document understanding systems for research, compliance, legal analysis, customer support, and enterprise knowledge management use cases.

The project follows a complete end-to-end Generative AI and Retrieval-Augmented Generation lifecycle, integrating document ingestion, text extraction, text chunking, embedding generation, vector indexing, semantic retrieval, prompt engineering, LLM-powered response generation, environment management, debugging, and deployment-ready application development.

Status: βš™οΈ In Progress
Core Stack: Python, Streamlit, LangChain, Groq Llama3, HuggingFace Embeddings, FAISS, PyPDFLoader, Sentence Transformers, python-dotenv, Git, GitHub

Generative AI, NLP & Document Intelligence Focus

  • Built an end-to-end Document QA ChatBot using Retrieval-Augmented Generation (RAG) architecture
  • Implemented PDF ingestion, text extraction, chunking, and semantic retrieval using PyPDFLoader, LangChain, and FAISS
  • Generated vector embeddings using HuggingFace BAAI/bge-small-en-v1.5 for efficient similarity search
  • Developed an interactive Streamlit interface for document-based question answering
  • Engineered prompt templates and retrieval workflows to deliver accurate, context-aware responses
  • Applied secure configuration management using python-dotenv and environment variables

LLM Integration & AI Application Development

  • Integrated Groq-hosted Llama3-8B-8192 for document-grounded answer generation
  • Built semantic search pipelines using LangChain Retrieval Chains, vector search, and prompt engineering
  • Optimized document retrieval through chunking strategies and embedding-based nearest-neighbor search
  • Developed scalable PDF processing and AI-powered knowledge retrieval workflows
  • Followed software engineering best practices including modular architecture, dependency management, version control, and reproducible development workflows

Problem Solving & Engineering Highlights

  • Resolved LangChain version compatibility, Python 3.13 dependency conflicts, and Streamlit configuration issues
  • Migrated the project to Python 3.11 for stable package compatibility
  • Implemented secure API key management by moving credentials from source code to environment variables
  • Improved maintainability through structured project organization and dependency pinning

Skills Demonstrated

  • Generative AI: RAG, Prompt Engineering, LLM Integration
  • NLP: Semantic Search, Information Retrieval, Document Understanding
  • LLMs: Groq Llama3, Context-Aware Response Generation
  • Vector Databases: FAISS, Embedding-Based Search
  • AI Frameworks: LangChain, HuggingFace Embeddings
  • Document Processing: PDF Parsing, Text Chunking
  • Development: Python, Streamlit, API Integration
  • DevOps & Collaboration: Virtual Environments, Git, GitHub, Dependency Management

Goal: Deliver a production-ready Document Intelligence solution that combines Retrieval-Augmented Generation, semantic search, vector databases, and Large Language Models to demonstrate real-world AI engineering, knowledge retrieval, and enterprise-scale document understanding capabilities.


🩺 Image Classification for Medical Diagnosis β€” Deep Learning Pipeline Project

https://github.com/nagasantoshchavvakula/Image-Classification-For-Medical-Diagnosis.git

A comprehensive deep learning-based medical image classification pipeline designed to detect Pneumonia from Chest X-ray images using advanced neural networks, transfer learning architectures, explainable AI, and ensemble learning techniques. This project simulates a real-world healthcare diagnostic system for assisting medical professionals in making data-driven clinical predictions.

The project follows a complete end-to-end deep learning lifecycle, integrating image preprocessing, neural network training, transfer learning, optimization strategies, explainable AI visualizations, and ensemble-based prediction systems for improved classification performance and model interpretability.

Status: βš™οΈ In Progress
Core Stack: Python, TensorFlow, Keras, NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn, DVC, PyTest, GitHub Actions (CI/CD), Jupyter Notebook, Git

Deep Learning & Data Pipeline Focus

  • Implemented structured image data loading pipelines using TensorFlow and Keras ImageDataGenerator
  • Performed image preprocessing, normalization, augmentation, and dataset balancing using class weights
  • Built scalable and reusable deep learning workflows for training, evaluation, and inference
  • Applied data augmentation techniques including rotation, zoom, flipping, width/height shifting for robust generalization
  • Managed dataset versioning and reproducibility using DVC (Data Version Control)
  • Conducted comprehensive experimentation and notebook-driven analysis for model comparison and optimization
  • Maintained modular project architecture with separate components for data loading, model building, training, and evaluation

Model Development, Evaluation & MLOps

  • Developed and optimized Baseline MLP, Deep Neural Networks, and Transfer Learning models using TensorFlow and Keras for medical image classification
  • Implemented pretrained architectures including MobileNetV2, VGG16, and ResNet50 with fine-tuning for enhanced model performance
  • Applied advanced training strategies using EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard, and custom learning rate scheduling
  • Performed experimentation with activation functions, optimizers, regularization techniques, and hyperparameter optimization
  • Built ensemble learning models using soft-voting techniques to improve prediction accuracy and robustness
  • Designed end-to-end evaluation pipelines using Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix, and ROC Curves
  • Implemented Grad-CAM based Explainable AI (XAI) techniques for model interpretability and medical image visualization
  • Integrated PyTest, DVC, Git, GitHub, modular pipelines, and MLOps best practices for scalable and reproducible deep learning workflows

Skills Demonstrated

  • Deep Learning: Neural Networks, Transfer Learning, Ensemble Learning
  • Computer Vision: Medical Image Classification, Chest X-ray Analysis ?* Explainable AI: Grad-CAM Visualization
  • Machine Learning: Model Evaluation, Hyperparameter Optimization
  • Data Processing: Image Augmentation, Preprocessing, Dataset Balancing
  • Programming: Python, TensorFlow, Keras, NumPy, Pandas
  • Visualization: Matplotlib, Seaborn
  • MLOps: DVC, Automated Testing (PyTest), Workflow Management
  • Version Control: Git, GitHub
  • Analytical Thinking: Model Optimization, Diagnostic Performance Analysis

Goal: Deliver a production-ready medical image classification engine that combines deep learning, transfer learning, explainable AI, and ensemble learning techniques to demonstrate scalable healthcare AI workflows, strong analytical capabilities, and deployment-ready deep learning solutions.


🏑 Real Estate Price Prediction Engine β€” Machine Learning Pipeline Project

https://github.com/nagasantoshchavvakula/Real-Estate-Price-Prediction.git

A comprehensive regression-based machine learning pipeline designed to predict real estate property prices using advanced statistical modeling, unsupervised learning, and recommendation systems. This project simulates a real-world pricing platform for buyers, sellers, and agents to make data-driven property valuation decisions.

The project follows a complete end-to-end ML lifecycle, integrating regression modeling, clustering for market segmentation, recommendation systems, and ensemble learning for improved predictive performance.

Status: βœ… Completed
Core Stack: Python, Pandas, NumPy, Scikit-learn, XGBoost, LightGBM, Matplotlib, Seaborn, PCA (Dimensionality Reduction), PyTest, GitHub Actions (CI/CD), Streamlit, Joblib, Jupyter Notebook, Git

MLOps & Data Focus

  • Implemented structured data pipelines for preprocessing, feature engineering, and transformation
  • Performed comprehensive exploratory data analysis (EDA) to uncover pricing trends and feature relationships
  • Handled missing values, categorical encoding, and feature scaling for robust model performance
  • Applied Principal Component Analysis (PCA) for dimensionality reduction and multicollinearity handling
  • Built modular and reusable ML components for regression, clustering, and recommendation systems
  • Ensured reproducibility through organized workflows, version control, and notebook-based experimentation

Experimentation & Deployment

  • Trained and evaluated multiple regression models including Linear, Ridge, Lasso, ElasticNet, Random Forest, and Gradient Boosting
  • Performed hyperparameter tuning using GridSearchCV and RandomizedSearchCV
  • Designed evaluation pipelines using RMSE, MAE, and RΒ² for regression performance comparison
  • Implemented clustering workflows (K-Means, Hierarchical, DBSCAN) with validation using Silhouette Score
  • Built recommendation systems (content-based, collaborative, hybrid) for personalized property suggestions
  • Developed ensemble models (Voting & Stacking) to improve predictive accuracy
  • Created an interactive Streamlit dashboard for real-time predictions and insights
  • Integrated automated testing using PyTest and CI/CD pipelines via GitHub Actions

Skills Demonstrated

  • Machine Learning: Regression Modeling, Clustering, Recommendation Systems, Ensemble Learning
  • Data Science: EDA, Feature Engineering, Dimensionality Reduction (PCA)
  • Programming: Python (Pandas, NumPy, Scikit-learn)
  • Visualization: Matplotlib, Seaborn
  • MLOps: CI/CD Pipelines, Automated Testing (PyTest), Workflow Management
  • Deployment: Streamlit Dashboard, Model Serialization (Joblib)
  • Analytical Thinking: Model Evaluation, Performance Optimization, Business Insights

Goal: Deliver a production-ready real estate pricing engine that combines regression modeling, market segmentation, and recommendation systems demonstrating scalable machine learning workflows, strong analytical capabilities, and deployment-ready solutions.


πŸ“‰ Telco Customer Churn Prediction β€” Machine Learning Pipeline Project

https://github.com/nagasantoshchavvakula/Customer-Churn-Prediction

An end-to-end machine learning pipeline project to predict customer churn for a telecommunications company using real-world business data. The project follows the complete ML lifecycle, from exploratory data analysis and preprocessing to model development, hyperparameter tuning, and deployment-ready model serialization.

This project emphasizes reproducibility, automated testing, and CI/CD workflows, demonstrating best practices for building production-ready ML systems.

Status: βœ… Completed
Core Stack: Python, Pandas, NumPy, Scikit-learn (Preprocessing, Model Development & Evaluation), XGBoost, Matplotlib, Seaborn, SMOTE (Imbalanced-learn), PyTest, GitHub Actions (CI/CD), Joblib (Model Serialization), Jupyter Notebook, Git

Machine Learning & Analytics Focus

  • Conducted Exploratory Data Analysis (EDA) to understand customer behavior and churn patterns
  • Computed statistical summaries including distributions, correlations, and hypothesis testing
  • Built preprocessing pipelines for missing value handling, categorical encoding, and feature scaling
  • Implemented stratified train-test splitting to preserve class distribution
  • Developed multiple ML models: Logistic Regression, KNN, Decision Tree, Random Forest, XGBoost
  • Addressed class imbalance using SMOTE techniques
  • Evaluated models using Accuracy, Precision, Recall, F1-score, ROC-AUC, confusion matrices, and ROC curves
  • Performed hyperparameter tuning with GridSearchCV and RandomizedSearchCV
  • Selected and serialized the best-performing model and scaler for production use

Systems & MLOps Direction

  • Automated testing using PyTest
  • CI/CD pipelines implemented with GitHub Actions for reproducibility and deployment readiness
  • Data visualization for model insights using Matplotlib and Seaborn
  • Workflow version control and experiment tracking using Git and Jupyter Notebook

Skills Demonstrated

  • Python programming for ML pipelines
  • Data preprocessing & feature engineering
  • Model evaluation & hyperparameter optimization
  • Ensemble learning with XGBoost
  • Imbalanced data handling and statistical analysis
  • CI/CD for ML workflows
  • Model serialization and deployment readiness

Goal: Build a production-ready ML pipeline that predicts customer churn, provides actionable business insights, and demonstrates reproducible, automated, and deployment-ready ML workflows.


πŸ€– ML Lifecycle & MLOps Sentiment Analysis System β€” End-to-End NLP Pipeline

https://github.com/nagasantoshchavvakula/Sentiment-Analysis-MLOps

An end-to-end Machine Learning Lifecycle (MLOps) project for sentiment analysis using HuggingFace transformer models. The project demonstrates industry-standard practices including data versioning, experiment tracking, automated CI/CD pipelines, and cloud-ready deployment. Transfer learning is applied to fine-tune a pre-trained NLP model for binary sentiment classification, ensuring scalable, reproducible, and production-ready ML workflows.

Status: βš™οΈ In Progress
Core Stack: Python, HuggingFace Transformers (DistilBERT), PyTorch, Scikit-learn, DVC, MLflow, Flask, Docker, GitHub Actions (CI/CD)

MLOps & Data Focus

  • Implemented data lifecycle management using DVC for dataset and model artifact versioning
  • Built preprocessing pipelines for text cleaning and tokenization
  • Performed exploratory data analysis (EDA) on text data
  • Fine-tuned DistilBERT for sentiment classification
  • Developed reproducible ML pipelines with parameter tracking and dependency management

Experimentation & Deployment

  • Integrated MLflow for experiment tracking, hyperparameter logging, and model versioning
  • Designed evaluation workflows including accuracy, precision, recall, F1-score, and confusion matrix
  • Built RESTful Flask API for real-time and batch sentiment predictions
  • Created automated CI/CD pipelines using GitHub Actions for testing, training, and deployment
  • Containerized the application using Docker for consistent deployments
  • Implemented monitoring-ready endpoints for production health checks and model performance tracking

Skills Demonstrated

  • Machine Learning & NLP: Transfer Learning, Sentiment Analysis, Transformer Models
  • MLOps Tools: DVC, MLflow, CI/CD pipelines
  • Programming: Python
  • Deployment & DevOps: Flask API, Docker, GitHub Actions
  • Data Engineering: Versioning, Pipeline Orchestration
  • Cloud & Production Concepts: Model Serving, Reproducibility

Goal: Deliver an industry-standard, production-ready MLOps pipeline for NLP sentiment analysis that demonstrates best practices in experiment tracking, deployment, and reproducibility.


πŸ€– Introduction to Machine Learning β€” Core ML Implementation Project

https://github.com/nagasantoshchavvakula/Intro_ML_Starter_Code_Implementation

A foundational machine learning implementation project designed to demonstrate the core components of a typical ML workflow, including data preprocessing, dataset splitting, model training, prediction, and evaluation.

The project focuses on building reusable Python functions that implement key machine learning operations using NumPy and Scikit-learn, providing a hands-on understanding of how regression and classification models are trained and evaluated in real-world data science pipelines.

This project emphasizes clean data preparation, modular ML pipeline design, and reliable model evaluation, reflecting practical machine learning development workflows used in analytics and AI systems.

Status: βœ… Completed
Core Stack: Python, NumPy, scikit-learn, pytest, Git, GitHub

Machine Learning Pipeline Focus

  • Implemented feature normalization using min–max scaling to standardize dataset inputs
  • Designed flexible missing value imputation strategies including mean, median, and zero replacement
  • Built dataset train-test splitting functions to support proper model validation
  • Trained Linear Regression models for continuous value prediction tasks
  • Developed Logistic Regression classifiers for binary classification problems
  • Created reusable prediction functions to generate outputs on unseen data

Model Evaluation & Validation

  • Implemented regression evaluation using Mean Squared Error (MSE)
  • Calculated classification performance using Accuracy metrics
  • Validated ML pipeline functionality using automated unit testing with pytest
  • Ensured reproducibility and correctness across preprocessing, training, and prediction stages

Engineering & Development Practices

  • Developed modular Python functions for reusable ML workflows
  • Applied automated testing to validate model pipelines
  • Used Git and GitHub for version control and project management
  • Maintained clean documentation with structured function docstrings and examples

Skills Demonstrated

  • Python programming for machine learning
  • Data preprocessing and feature scaling
  • Regression and classification model development
  • Model performance evaluation and testing
  • Building reliable and modular ML pipelines

Goal: Demonstrate a practical understanding of machine learning fundamentals by implementing a complete ML workflow from data preprocessing to model evaluation using industry-standard Python libraries.


πŸ›’ Ecommerce Fraud Detection β€” End-to-End Pipeline

https://github.com/nagasantoshchavvakula/Ecommerce-Fraud-Detection-End-to-End-Data-Pipeline

A production-style data engineering project that implements an automated pipeline for detecting fraudulent e-commerce transactions. The system processes raw transaction data, performs cleaning and feature engineering, and loads analytics-ready datasets into a MySQL analytics layer to support fraud monitoring and BI dashboards.

This project simulates a real-world enterprise data pipeline, incorporating ETL orchestration, modular workflow design, and monitoring to ensure reliable processing of transactional data used for fraud analysis.

Status: βœ… Completed
Core Stack: Python (Pandas, SQLAlchemy), Prefect (Workflow Orchestration), MySQL (Staging & Analytics), SQL (Analytical Queries), Data Engineering (ETL Pipelines, Feature Engineering, Data Modeling)


Data Engineering Focus

  • Designed an automated ETL pipeline to ingest, transform, and load e-commerce transaction data
  • Processed raw CSV datasets and stored them in MySQL staging tables for controlled transformation
  • Built modular ETL tasks (Extract β†’ Transform β†’ Load) orchestrated using Prefect workflows
  • Developed analytics-ready datasets optimized for fraud detection queries and BI dashboards
  • Implemented data validation, schema standardization, and feature engineering during transformation

Fraud Analytics & Feature Engineering

  • Engineered fraud detection features such as promo misuse, device-location mismatch, and transaction anomalies
  • Generated key fraud metrics including fraud rate, suspicious user patterns, and high-risk country indicators
  • Designed SQL analytical queries to detect abnormal transaction behaviors and high-risk merchant categories
  • Created aggregated KPIs enabling drill-down analysis at the transaction and user levels

Workflow Automation & Monitoring

  • Automated pipeline orchestration using Prefect with scheduling, retries, and workflow logging
  • Implemented modular task-based ETL architecture for scalable data processing
  • Enabled real-time monitoring and failure recovery using Prefect UI
  • Designed pipelines to support reliable batch processing for large transaction datasets

Skills Demonstrated

  • Python for data engineering (Pandas, SQLAlchemy)
  • ETL pipeline development and workflow orchestration
  • Prefect for scheduling, monitoring, and pipeline automation
  • MySQL database design (staging and analytics schemas)
  • SQL analytics and fraud detection metrics
  • Designing BI-ready datasets for reporting and dashboards

Goal: Build a scalable data engineering pipeline for fraud detection analytics, demonstrating how automated ETL workflows, feature engineering, and SQL analytics can transform raw transactional data into actionable fraud insights for business intelligence systems.


πŸ“± YouTube–TikTok Short Form Video Analytics β€” End-to-End Data Analytics Pipeline

https://github.com/nagasantoshchavvakula/YouTube-TikTok-Short-Form-Video-Analytics

An end-to-end data analytics pipeline designed to analyze and visualize engagement trends from short-form video platforms such as YouTube Shorts and TikTok. The project demonstrates the complete analytics lifecycle β€” from raw data ingestion to interactive dashboard visualization β€” using MySQL, Python, DVC, and Streamlit.

The system focuses on transforming raw social media datasets into actionable engagement insights, enabling exploration of metrics such as views, likes, shares, comments, and trending patterns through interactive dashboards.

Status: βœ… Completed
Core Stack: Python, Pandas, NumPy, MySQL, Streamlit, Plotly, Matplotlib, DVC, Git, GitHub, Virtualenv, PowerBI

Data Engineering & Analytics Focus

  • Designed an automated pipeline to ingest raw CSV datasets from Kaggle into a structured MySQL analytics database
  • Performed data cleaning, transformation, and feature engineering using Python (Pandas)
  • Built modular Python scripts for ingestion, processing, and analysis workflows
  • Conducted exploratory data analysis (EDA) to identify engagement patterns across short-form content
  • Generated analytics-ready datasets for dashboard visualization and reporting

Dashboard & Visualization Layer

  • Developed an interactive Streamlit dashboard to visualize engagement trends and performance metrics
  • Built a Power BI dashboard connected to MySQL, enabling deeper business insights through KPI tracking and visual analytics
  • Designed visualizations including trend lines, engagement comparisons, and category-based insights

Data Management & Reproducibility

  • Implemented Data Version Control (DVC) to track dataset versions and ensure reproducible data workflows
  • Organized project structure with modular scripts and reproducible pipelines
  • Managed source code, collaboration, and versioning using Git and GitHub

Engineering Practices

  • Modular pipeline architecture separating ingestion, processing, and analysis layers
  • Version-controlled datasets and scripts for reproducible analytics workflows
  • Designed the project to scale for larger social media datasets and additional analytics dashboards

Goal: Demonstrate an end-to-end data analytics workflow that integrates data engineering, analysis, and dashboard visualization, transforming raw social media datasets into actionable engagement insights.


πŸš— Sales Performance Optimization Pipeline β€” End-to-End ETL & BI Analytics System

https://github.com/nagasantoshchavvakula/bmw-car-sales_Performance-Optimization-Pipeline

An end-to-end data engineering and business intelligence project designed to automate the extraction, transformation, and loading (ETL) of vehicle sales data into a structured MySQL database while generating interactive analytics dashboards in Power BI.

The project focuses on identifying key regional and vehicle factors that drive high sales performance, enabling data-driven decision-making through automated data pipelines and real-time business intelligence reporting.

Status: βœ… Completed
Core Stack: Python (Pandas, SQLAlchemy), Prefect, MySQL, Power BI, Excel

Data Engineering & ETL Pipeline

  • Designed and implemented a complete ETL pipeline using Python, Prefect, and MySQL for automated data ingestion and transformation
  • Built modular Prefect workflows using @task and @flow decorators to manage pipeline dependencies and data processing steps
  • Performed comprehensive data auditing and schema design to ensure clean, consistent, and analytics-ready database structures
  • Automated loading of transformed datasets into a centralized MySQL data warehouse

Analytics & Business Intelligence

  • Developed an interactive Power BI dashboard connected to the MySQL database for real-time sales analytics
  • Created KPI visualizations to track sales trends by region, vehicle model, and sales classification
  • Enabled business stakeholders to quickly identify top-performing regions and vehicle categories

Data Analysis Focus

  • Identified key drivers behind β€œHigh” sales classifications through exploratory data analysis
  • Applied structured data profiling and preprocessing using Excel and Python
  • Delivered actionable insights supporting sales performance optimization and strategic decision-making

Skills Demonstrated

  • Data pipeline design and ETL automation
  • Workflow orchestration with Prefect
  • Relational database schema design and SQL integration
  • Business intelligence dashboard development
  • Data auditing and structured data transformation

Goal: Build a scalable data pipeline that automates sales data processing while enabling business stakeholders to analyze regional and product-level sales performance through interactive dashboards.


πŸ’° Personal Finance Tracker & Investment Portfolio Analyzer β€” Financial Analytics System

https://github.com/nagasantoshchavvakula/Personal-Finance-Tracker-and-Investment-Portfolio-Analyzer

A Python-based personal finance analytics system designed to track expenses, manage budgets, and analyze investment portfolios through automated data workflows. The project demonstrates the integration of Python programming, financial modeling, and workflow orchestration to transform raw financial data into actionable insights.

This system focuses on financial data analysis, automation, and reproducible data pipelines, enabling users to monitor spending behavior, evaluate investment performance, and generate analytical reports for better financial decision-making.

Status: βœ… Completed
Core Stack: Python, Object-Oriented Programming (OOP), NumPy, pandas, Matplotlib, Prefect, DVC, Git

Financial Analytics & Data Processing Focus

  • Tracking financial transactions, expenses, and budget allocations using structured data models
  • Performing statistical and time-series analysis on spending patterns and investment performance
  • Processing financial datasets using Pandas and NumPy for numerical and analytical computations
  • Generating reports and visualizations to evaluate budget adherence and portfolio growth
  • Extracting insights on spending trends and investment returns through analytical workflows

System Architecture & Engineering Practices

  • Developed modular Python architecture using Object-Oriented Programming (OOP) principles
  • Designed reusable classes for transactions, accounts, and investment portfolios
  • Implemented data validation and structured financial data models
  • Built reproducible data pipelines using Data Version Control (DVC)
  • Automated scheduled financial analysis using Prefect workflow orchestration

Skills Demonstrated

  • Python programming and OOP system design
  • Financial data analysis and modeling
  • Pandas and NumPy for structured data processing
  • Workflow automation using Prefect
  • Reproducible data pipelines with DVC
  • Data visualization and reporting with Matplotlib
  • Git-based version control and project organization

Goal: Demonstrate the design of an automated financial analytics system that combines data engineering, statistical analysis, and workflow automation to support personal finance monitoring and investment decision-making.


πŸ“Š Exploratory Data Analysis & Visualization of Student Performance β€” Interactive Analytics Dashboard

https://github.com/nagasantoshchavvakula/EDA_Student_Performance

An end-to-end data analysis and visualization project focused on exploring the factors influencing student exam performance using a Kaggle dataset. The project demonstrates how Python-based data analytics workflows can transform raw educational datasets into meaningful insights through exploratory data analysis, statistical visualization, and interactive dashboards.

The project integrates data preprocessing, statistical analysis, and interactive visualization to enable dynamic exploration of student performance metrics, helping identify patterns across demographic, socioeconomic, and academic variables.

Status: βœ… Completed
Core Stack: Python, pandas, NumPy, matplotlib, seaborn, plotly, Streamlit, GitHub, Cloud Deployment

Data Analysis & EDA Focus

  • Cleaned and preprocessed the Kaggle Student Performance in Exams dataset by handling missing values, duplicates, and data inconsistencies
  • Conducted exploratory data analysis to uncover patterns and relationships between exam scores and demographic variables
  • Performed statistical analysis and correlation analysis across math, reading, and writing scores
  • Engineered meaningful features and calculated key performance indicators (KPIs) for student performance evaluation

Data Visualization & Insights

  • Built multiple visualization types including histograms, bar charts, scatter plots, box plots, and heatmaps
  • Used Seaborn, Matplotlib, and Plotly to visually communicate trends, correlations, and distribution patterns
  • Identified key performance drivers such as parental education level, lunch type, and test preparation courses

Interactive Dashboard Development

  • Developed a Streamlit-based interactive analytics dashboard for dynamic exploration of student performance data
  • Implemented filters and KPIs allowing users to explore results by gender, race/ethnicity, education level, and exam category
  • Designed an intuitive interface to make complex data insights accessible for non-technical users

Deployment & Data Application Engineering

  • Deployed the Streamlit application using Streamlit Community Cloud
  • Integrated GitHub for version control and automated deployment workflows
  • Enabled cloud-based access for interactive exploration of insights from anywhere

Skills Demonstrated

  • Exploratory Data Analysis (EDA)
  • Data Cleaning & Preprocessing
  • Statistical Analysis & Correlation Analysis
  • Interactive Data Visualization
  • Dashboard Development with Streamlit
  • Cloud Deployment & Version Control

Goal: Demonstrate the ability to perform end-to-end exploratory data analysis, build interactive dashboards, and deploy Python-based data applications that transform raw datasets into accessible and actionable insights.


πŸ”„ DVC CSV Tracker β€” Data Version Control with Git & DVC

https://github.com/nagasantoshchavvakula/Data_Version_Control_With-DVC_And_Git

A practical data version control project demonstrating how to integrate DVC (Data Version Control) with Git to manage and track changes in structured CSV datasets. The project highlights how modern data teams maintain reproducibility, data integrity, and collaborative workflows by versioning datasets alongside code.

This project focuses on data versioning workflows commonly used in MLOps and data engineering pipelines, ensuring that dataset changes are traceable, reproducible, and synchronized with source code repositories.

Status: βœ… Completed
Core Stack: Git, GitHub, DVC (Data Version Control), CSV Data Management, MLOps Fundamentals, Reproducibility, Command-Line Tools

Data Versioning & MLOps Focus

  • Initialized and configured DVC within a Git repository to manage dataset versioning
  • Tracked structured CSV datasets using DVC data tracking mechanisms
  • Maintained reproducible dataset history while separating data artifacts from source code
  • Demonstrated reproducible workflows commonly used in machine learning and data science pipelines

Workflow & Engineering Practices

  • Used DVC commands such as dvc init and dvc add to track dataset changes
  • Managed dataset metadata files (.dvc) and repository configurations
  • Committed DVC metadata and configuration files to Git for version control
  • Simulated dataset updates and maintained version history through Git commits
  • Pushed repository updates to GitHub to enable collaborative data project workflows

Skills Demonstrated

  • Git & GitHub version control
  • Data Version Control (DVC)
  • Reproducible data science workflows
  • CSV dataset management
  • Command-line based data engineering workflows
  • Collaboration practices in data-driven projects

Goal: Demonstrate how to build reproducible data science workflows by integrating Git and DVC for dataset versioning, enabling scalable collaboration and reliable data pipeline management.


πŸ“Š Student Performance Analysis β€” Python Data Exploration Project

https://github.com/nagasantoshchavvakula/Student-Performance-Analysis

A Python-based data analysis project designed to explore and summarize student performance datasets using Pandas. The project demonstrates how structured CSV data can be programmatically processed to extract meaningful insights through statistical analysis and exploratory data techniques.

This project focuses on data exploration, statistical computation, and structured data handling, highlighting how Python can be used to quickly analyze datasets and generate performance insights for decision-making.

Status: βœ… Completed
Core Stack: Python, Pandas, CSV Handling, Data Analysis, Descriptive Statistics, Exception Handling

Data Analysis Focus

  • Reading and processing structured student datasets from CSV files using Pandas
  • Performing exploratory data analysis (EDA) on student attributes such as scores and age
  • Calculating statistical metrics including mean, median, standard deviation, minimum, and maximum values
  • Extracting key information such as student names and previewing top records for quick dataset inspection
  • Generating descriptive summaries to identify performance patterns within the dataset

Data Engineering & Code Quality Practices

  • Implemented structured data handling using Pandas DataFrames
  • Added robust exception handling for missing files and column inconsistencies
  • Designed reusable analysis scripts for quick dataset preview and insight generation
  • Demonstrated practical workflows for cleaning and summarizing real-world tabular datasets

Skills Demonstrated

  • Python programming for data analysis
  • Pandas DataFrame manipulation
  • CSV data ingestion and transformation
  • Descriptive statistical analysis
  • Error handling and robust data processing

Goal: Demonstrate the ability to analyze structured datasets using Python and Pandas, perform statistical analysis, and extract actionable insights through efficient data exploration workflows.


πŸš€ Full-Stack Projects & Internship Experience β€” Role: Full Stack Web Development Intern

  • Gained hands-on experience in full-stack development using Spring Boot & React
  • Demonstrated strong problem-solving, technical skills, and project execution
  • Awarded Certificate & Letter of Recommendation πŸ“„ View Certificate & LOR

Skills:

Java Spring Boot Spring Security REST API WebSockets Apache Maven React JavaScript npm HTML5 CSS3 Postman MySQL JSON JWT Git Bash VS Code IntelliJ IDE

Web Development Projects:

πŸ” Secure User Authentication System

  • Built a JWT-based authentication system with login & registration
  • Implemented BCrypt password hashing and secure session handling
  • Designed Spring Security-based protected APIs
  • Integrated React frontend with protected routes

https://github.com/nagasantoshchavvakula/Secure-User-Authentication-System.git

πŸ‘¨β€πŸ’Ό Employee Management System (EMS)

  • Developed a full-stack CRUD application with JWT authentication
  • Admin can manage employee records securely
  • Implemented Spring Boot REST APIs + React frontend
  • Includes validation and role-based access control

https://github.com/nagasantoshchavvakula/Employee-Management-System.git

🌐 Social Media Application

  • Built a full-stack app with posts, likes, comments, and follow system
  • Implemented JWT authentication and secure REST APIs
  • Developed responsive UI with React and backend with Spring Boot

https://github.com/nagasantoshchavvakula/Social-Media-App.git

πŸ’¬ Real-Time Chat Application

  • Developed a real-time chat system using WebSockets
  • Supports multiple chat rooms and persistent chat history
  • Implemented JWT-based authentication
  • Tech: Spring Boot, WebSocket, React, SockJS, StompJS

https://github.com/nagasantoshchavvakula/Real-Time-Chat-Application.git


🎯 Career Focus

I enjoy building data-driven systems that combine analytics, engineering, artificial intelligence and machine learning to solve real-world business problems.

Current interests include:

  • Data analytics & visualization
  • Data engineering pipelines
  • Machine learning & AI applications
  • Deep learning & neural networks
  • MLOps and scalable ML systems
  • Cloud-based analytics and AI platforms

πŸ“‚ Projects & Portfolio

Here are some of my featured projects demonstrating expertise in Data Analytics, Machine Learning, Deep Learning, MLOps, Data Engineering, and Full-Stack Development:

πŸ€– AI, Machine Learning & Deep Learning

  • Image Classification for Medical Diagnosis – End-to-end deep learning pipeline for pneumonia detection using CNNs, Transfer Learning (MobileNetV2, VGG16, ResNet50), Grad-CAM explainability, ensemble learning, DVC, and CI/CD workflows.
  • Document QA ChatBot – Retrieval-Augmented Generation (RAG) application leveraging Groq Llama3, LangChain, HuggingFace Embeddings, and FAISS for intelligent PDF document question answering, semantic search, context-aware response generation, and enterprise knowledge retrieval.
  • Real Estate Price Prediction Engine – Production-ready ML system combining regression models, clustering, recommendation systems, ensemble learning, Streamlit deployment, and automated testing.
  • Telco Customer Churn Prediction – End-to-end ML pipeline using XGBoost, SMOTE, hyperparameter tuning, CI/CD, and deployment-ready model serialization.
  • ML Lifecycle & MLOps Sentiment Analysis System – Transformer-based NLP solution using DistilBERT, DVC, MLflow, Docker, Flask APIs, and GitHub Actions.

βš™οΈ Data Engineering & Analytics

πŸ“Š Data Analytics & Financial Systems

🌐 Full-Stack Development

  • Employee Management System – Full-stack CRUD application with Spring Boot, React, JWT authentication, validation, and role-based access control.

  • Secure User Authentication System – Spring Security and JWT-based authentication platform featuring secure login, registration, BCrypt password hashing, and protected REST APIs.

  • Social Media Application – Full-stack social networking platform built with Spring Boot and React, supporting posts, likes, comments, user profiles, follow/unfollow functionality, JWT authentication, and secure REST APIs.

  • Real-Time Chat Application – Real-time messaging platform using Spring Boot, WebSockets, React, SockJS, and STOMP, featuring multiple chat rooms, persistent chat history, live communication, and JWT-based authentication.


πŸ“Š GitHub Stats

Repo Stats Commit Stats


⭐ Thanks for visiting my GitHub! Feel free to explore my projects and connect if you'd like to collaborate.

Popular repositories Loading

  1. nagasantoshchavvakula nagasantoshchavvakula Public

    Data Analyst | Python, SQL, Tableau, Power BI | Data Engineering, ETL Pipelines, Machine Learning & Cloud Analytics

  2. Sentiment-Analysis-MLOps Sentiment-Analysis-MLOps Public

  3. Agentic_AI_using_LangGraph Agentic_AI_using_LangGraph Public

    Forked from mohd-faizy/Agentic_AI_using_LangGraph

    Agentic AI framework built using LangGraph and Multi-Agent Control Plane (MCP) for building structured, goal-driven multi-agent systems.

    Jupyter Notebook

  4. Image-Classification-For-Medical-Diagnosis Image-Classification-For-Medical-Diagnosis Public

  5. awesome-data-analysis awesome-data-analysis Public

    Forked from PavelGrigoryevDS/awesome-data-analysis

    πŸš€ 500+ curated resources for Data Analysis & Data Science: Python, SQL, Statistics, ML, AI, Visualization, Cheatsheets, Roadmaps, Interview Prep. For beginners and experts.

  6. AI-For-Beginners AI-For-Beginners Public

    Forked from microsoft/AI-For-Beginners

    12 Weeks, 24 Lessons, AI for All!

    Jupyter Notebook