A complete, hands-on DevOps learning journey — from Linux fundamentals to a fully automated, production-grade IoT platform deployed on Kubernetes. Built by Aaryan Dadhich | BTech CSE (IoT) @ MLVTEC, Bhilwara
| Phase | Topic | Status | Key Tools |
|---|---|---|---|
| Phase 0 | Linux, Bash, Git, Python | ✅ Complete | Bash, Python, Git, Linux |
| Phase 1 | Docker & Containerisation | ✅ Complete | Docker, docker-compose, Docker Hub |
| Phase 2 | CI/CD with GitHub Actions | ✅ Complete | GitHub Actions, pytest, flake8 |
| Phase 3 | AWS Fundamentals | ✅ Complete | AWS CLI, EC2, S3, IAM, VPC |
| Phase 4 | Infrastructure as Code | ✅ Complete | Terraform, HCL |
| Phase 5 | Kubernetes | ✅ Complete | minikube, kubectl, Helm |
| Phase 6 | Monitoring | ✅ Complete | Prometheus, Grafana, Alertmanager |
| Phase 7 | Capstone — NexusIoT | ✅ Complete | MQTT, Kafka, SHAP, FastAPI, K8s |
cloud-ops/
├── phase-0/ # Linux, Bash scripting, Python CLI
│ ├── bash/
│ │ ├── system_monitor.sh # CPU/memory/disk monitor with logging
│ │ └── backup_manager.sh # Timestamped backup with auto-pruning
│ └── python/
│ ├── fetcher.py # CLI API fetcher with argparse + logging
│ └── requirements.txt
│
├── phase-1/ # Docker & Containerisation
│ ├── Dockerfile
│ ├── docker-compose.yml
│ └── README.md
│
├── phase-2/ # CI/CD with GitHub Actions
│ ├── calculator.py
│ ├── test_calculator.py
│ ├── requirements.txt
│ └── README.md
│
├── phase-3/ # AWS Fundamentals
│ └── README.md
│
├── phase-4/ # Terraform — Infrastructure as Code
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── README.md
│
├── phase-5/ # Kubernetes
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── configmap.yaml
│ └── README.md
│
├── phase-6/ # Monitoring — Prometheus + Grafana
│ ├── alert-rules.yaml
│ └── README.md
│
├── phase-7/ # Capstone — NexusIoT Platform
│ └── README.md
│
├── .github/
│ └── workflows/
│ └── python-tests.yml # CI pipeline: lint → test → build
│
└── README.md
Goal: Build strong foundations before touching any DevOps tool.
bash/system_monitor.sh — System health monitor
- Checks disk usage, CPU processes, and available memory
- Prints
WARNINGif disk > 80% or memory < 200MB - Timestamps and appends every check to
monitor.log - Loops 3 times with 5-second intervals
chmod +x bash/system_monitor.sh
./bash/system_monitor.shbash/backup_manager.sh — Automated backup tool
- Creates timestamped
.tar.gzarchives of a target directory - Auto-prunes backups older than 2 minutes to manage disk space
python/fetcher.py — CLI API data fetcher
- Fetches posts from a public API with
requests - Filter by
--user-idargument viaargparse - Full error handling:
ConnectionError,Timeout, invalid inputs - Structured logging: INFO / WARNING / ERROR levels
cd python
pip install -r requirements.txt
python fetcher.py --user-id 3- Linux file permissions (
chmod), process management (ps,lsof), disk inspection (df,du) - Bash scripting: functions, loops, conditionals,
awkfor text parsing - Python error handling with
requests— graceful failures vs crashes argparsefor CLI tools + Pythonloggingmodule patterns- Git workflow: feature branches, rebasing,
--no-ffmerges,git revertvsgit reset
Goal: Package any application to run consistently anywhere.
Dockerfile— Production-ready image usingpython:3.11-slim- Layer caching optimisation — deps installed before code copy
docker-compose.yml— Multi-container setup (app + Postgres)- Volume persistence — data survives container restarts
- Docker Hub push — image available publicly
docker build -t cloud-ops:v1 .
docker run cloud-ops:v1
docker-compose up
docker-compose down -v- Container vs VM — why containers are faster and lighter
- Dockerfile layer caching — why
COPY requirements.txtcomes beforeCOPY . . ENTRYPOINTvsCMD— fixed executable vs default arguments- Docker bridge networking — containers finding each other by name, not IP
- Named volumes vs bind mounts — when to use each
Goal: Every code push automatically lints, tests, builds, and ships.
git push → flake8 lint → pytest (Python 3.10 + 3.11 matrix) → Docker build + push to Docker Hub
- lint-and-test —
flake8+pyteston Python 3.10 AND 3.11 simultaneously - build-and-push — runs only if tests pass; builds Docker image, tags with
:latestand commit SHA, pushes to Docker Hub
on:
push:
branches: [main]
pull_request:Branch protection enabled on main — no merge without passing CI.
- CI vs CD (Delivery vs Deployment) — the real difference
- GitHub Actions YAML:
workflow → job → step → action - Why
actions/checkout@v3must be the first step - Matrix builds — testing multiple Python versions in parallel
- GitHub Secrets — storing credentials securely
needs:keyword — job dependency chains
Goal: Deploy and manage cloud infrastructure using only the AWS CLI — zero console clicks.
Identity & Access Management (IAM): Enforced least privilege by provisioning dedicated IAM admin users and generating access keys — never touched the root account.
Compute (EC2): Provisioned, configured, and SSH-accessed instances from the terminal using cryptographic key pairs (.pem).
Networking (VPC): Engineered a custom VPC from scratch — public/private subnets, internet gateways, and custom route tables for network isolation.
Security Groups: Configured zero-trust ingress rules for HTTP (80) and SSH (22) — all other traffic denied.
Object Storage (S3): Provisioned buckets and synced local directories to cloud storage via CLI.
FinOps: Implemented CloudWatch + SNS billing alarm — email alert triggered when costs exceed $1.00.
# Identity verification
aws configure
aws sts get-caller-identity
# Launch EC2
aws ec2 run-instances --image-id <ami> --instance-type t3.micro \
--key-name devops-key --security-group-ids <sg-id>
# Billing alarm
aws cloudwatch put-metric-alarm --alarm-name "Billing-1USD" \
--metric-name EstimatedCharges --namespace AWS/Billing \
--statistic Maximum --threshold 1 \
--comparison-operator GreaterThanThreshold \
--alarm-actions <SNS_TOPIC_ARN>- IAM is AWS security rule #1 — never use root for daily work
- Public subnet = route to internet gateway. Private subnet = no internet route
- CLI-first approach forces understanding of every parameter
- Always terminate resources — t2.micro costs money even idle
Goal: Translate manual AWS CLI architecture into version-controlled, declarative HCL.
Declarative Provisioning: Replaced manual scripting with stateful deployments — VPCs, Security Groups, EC2, and S3 defined in centralized .tf configuration.
Modular Architecture: Dynamic, reusable file structure: main.tf, variables.tf, outputs.tf — no hardcoded values.
Secrets Management: Isolated sensitive parameters into .tfvars files excluded from version control via .gitignore.
Remote State: Migrated terraform.tfstate to a centralized, encrypted AWS S3 backend — enabling multi-developer collaboration and CI/CD integration.
Infrastructure Lifecycle: Full Terraform workflow — plan dry-runs, safe applies, and clean destroys for zero-waste cloud usage.
# Initialize workspace and S3 backend
terraform init
# Validate and dry-run
terraform fmt
terraform plan
# Deploy
terraform apply -var-file="secrets.tfvars"
# Clean teardown
terraform destroy- Terraform state is the brain — never manually edit
.tfstate terraform planshows drift between desired and actual stateterraform refreshsyncs state with reality after manual AWS changes- Remote state in S3 is essential for team workflows
- Variables + outputs make configs reusable and safe
Goal: Deploy containerised applications on a Kubernetes cluster with self-healing, rolling updates, and proper resource management.
Cluster Setup: Provisioned a local single-node cluster using minikube. Configured kubectl for cluster interaction.
Workloads: Deployed applications using Deployments with 3 replicas. Validated self-healing — deleted pods restart automatically via ReplicaSet controller.
Networking: Exposed applications using Services (ClusterIP, NodePort). Understood label selectors for pod targeting.
Configuration Management: Injected non-sensitive config via ConfigMaps and sensitive data via Secrets as environment variables — never hardcoded in images.
Rolling Updates & Rollbacks: Updated deployments to new image versions with zero downtime. Simulated failed deployments and executed instant rollbacks.
Resource Management: Set CPU/memory requests and limits. Configured liveness and readiness probes for production-grade health checking.
Namespaces: Organised cluster workloads into dev and staging namespaces for environment isolation.
# Start cluster
minikube start
# Deploy application
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# Monitor
kubectl get pods -w
kubectl describe pod <name>
kubectl logs <name>
# Rolling update
kubectl set image deployment/my-app my-app=image:v2
kubectl rollout status deployment/my-app
# Rollback
kubectl rollout undo deployment/my-app
# Scale
kubectl scale deployment my-app --replicas=5
# Access app
minikube service my-app-svc --url- Pod vs Deployment — never run bare pods in production
- ReplicaSet controller maintains desired replica count automatically (self-healing)
- Services provide stable network identity — pods have dynamic IPs
- ConfigMaps for config, Secrets for sensitive data — never hardcode in images
- Rolling updates replace pods one at a time — zero downtime deployments
- Resource limits prevent one app from starving others on the cluster
- Liveness probe = restart on failure. Readiness probe = remove from Service on failure
Goal: Full observability for the Kubernetes cluster — metrics collection, dashboards, and automated alerting.
Stack Deployment: Installed the full kube-prometheus-stack via Helm — Prometheus, Grafana, and Alertmanager deployed as a single chart in a dedicated monitoring namespace.
PromQL Queries: Wrote and executed production-grade queries — CPU usage rates, per-pod memory consumption, container resource utilisation, and pod status tracking.
Grafana Dashboards: Built custom dashboards with 3+ panels (CPU time series, memory gauges, pod status stats). Imported community dashboards (ID 3119) for cluster-wide visibility.
Alert Rules: Authored PrometheusRule CRDs to fire alerts when CPU usage exceeds thresholds for sustained periods. Validated alerting by stress-testing pods.
Architecture Understanding: Prometheus scrapes /metrics endpoints (pull-based). Alertmanager handles routing, deduplication, and silencing. Grafana is the visualisation layer.
# Install via Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace
# Access Prometheus UI
kubectl port-forward -n monitoring \
svc/prometheus-kube-prometheus-prometheus 9090:9090
# Access Grafana
kubectl port-forward -n monitoring \
svc/prometheus-grafana 3000:80
# Apply custom alert rules
kubectl apply -f alert-rules.yaml -n monitoring
# Stress test to trigger alerts
kubectl run stress --image=progrium/stress -- --cpu 2# Target health
up
# CPU usage rate (5 min window)
rate(node_cpu_seconds_total{mode="idle"}[5m])
# Container CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)
# Memory usage by pod (in MB)
sum(container_memory_usage_bytes{namespace="default"}) by (pod) / 1048576
App pods ──── /metrics endpoint
│
▼ (scrape every 15s)
Prometheus
│
┌─────────┴─────────┐
│ │
Grafana Alertmanager
(dashboards) (routes alerts)
│
┌─────────┴─────────┐
Slack Email
- Prometheus is pull-based — it scrapes
/metricsendpoints, apps don't push - Metric types: Counter (only up), Gauge (up/down), Histogram (distribution)
- PromQL
rate()function calculates per-second rate over a time window - Grafana is just a visualisation layer — Prometheus is the data source
- Alertmanager handles routing, deduplication, and silencing — separate from Prometheus
- Pending → Firing: alert must breach threshold for
for:duration before firing
Goal: Bring every skill together — build a production-grade industrial IoT platform from scratch.
NexusIoT — A production-grade industrial IoT telemetry platform with real-time streaming, explainable anomaly detection, and full Kubernetes orchestration.
🔗 Full Project Repository: github.com/MrDadhich456/NexusIoT
git push
→ GitHub Actions (lint + test + matrix build)
→ Docker image built + pushed (tagged with commit SHA)
→ Terraform provisions AWS infrastructure
→ Kubernetes deploys the new image (rolling update)
→ Prometheus monitors it
→ Grafana dashboard shows it's healthy
| Layer | Technology | Purpose |
|---|---|---|
| IoT Protocol | MQTT (Mosquitto) | Device-to-cloud messaging with QoS-1 |
| Message Bus | Apache Kafka | Durable, replayable event streaming |
| Database | TimescaleDB | Time-series optimised PostgreSQL |
| API | FastAPI + WebSocket | REST endpoints + real-time live streams |
| ML/XAI | SHAP + IsolationForest | Explainable anomaly detection |
| Orchestration | Kubernetes (minikube) | Container orchestration |
| IaC | Terraform | AWS infrastructure as code |
| CI/CD | GitHub Actions | Automated lint → test → build → deploy |
| Monitoring | Prometheus + Grafana | Metrics, dashboards, alerting |
Devices (MQTT) → Mosquitto → Kafka → Stream Processor → TimescaleDB
↓ ↓
SHAP Explainer FastAPI + WebSocket
↓ ↓
Anomaly Alerts Live Dashboard
Unlike typical IoT projects that just collect data, NexusIoT tells you why an anomaly was flagged — "spindle RPM drove 68% of this anomaly" — using SHAP explainability stored as JSONB alongside every alert.
- MQTT protocol — how IoT devices communicate with QoS guarantees
- Kafka as a message bus — decoupling producers from consumers, replay capability
- TimescaleDB hypertables — time-series partitioning for fast range queries
- WebSocket streaming — real-time data push from Kafka to browser (<50ms)
- SHAP explainability — making ML models interpretable (why was this flagged?)
- Kubernetes manifests — Deployments, StatefulSets, Services, HPA, PVCs
- Terraform provisioning — EC2 + security groups + IAM as code
- CI/CD pipelines — multi-stage GitHub Actions with Docker Hub + SSH deploy
- Prometheus custom metrics — instrumenting application code for observability
Languages: Python, Bash
Version Control: Git, GitHub
Containerisation: Docker, Docker Compose
CI/CD: GitHub Actions
Cloud: AWS (EC2, S3, IAM, VPC)
IaC: Terraform
Orchestration: Kubernetes (kubectl, minikube, Helm)
Monitoring: Prometheus, Grafana, Alertmanager
IoT: MQTT (Mosquitto), Apache Kafka
ML/XAI: SHAP, IsolationForest
API: FastAPI, WebSocket
Database: TimescaleDB (PostgreSQL)
OS: Linux (Ubuntu)
Aaryan Dadhich — 2nd year BTech CSE (IoT) @ MLVTEC, Bhilwara
- 🐙 GitHub: MrDadhich456
- 💼 LinkedIn: linkedin.com/in/MrDadhich456
- 📧 Email: aaryandadhich2006@gmail.com
🎉 All 8 phases complete — from
chmod +xto a production IoT platform on Kubernetes. Star it if you found it useful. ⭐