SrilakshmiSripathi · SrilakshmiSripathi · Mar 14, 2026 · Mar 14, 2026
diff --git a/....github/ISSUE_TEMPLATE/feature_request.md → .github/ISSUE_TEMPLATE/feature_request.md b/....github/ISSUE_TEMPLATE/feature_request.md → .github/ISSUE_TEMPLATE/feature_request.md
diff --git a/pipelines/ci/.github/bug_report.md → .github/bug_report.md b/pipelines/ci/.github/bug_report.md → .github/bug_report.md
diff --git a/infra/terraform/outpu.tf → .github/workflows/ci-cd.yml b/infra/terraform/outpu.tf → .github/workflows/ci-cd.yml
diff --git a/pipelines/ci/.github/workflows/*.yml → .github/workflows/cleanup.yml b/pipelines/ci/.github/workflows/*.yml → .github/workflows/cleanup.yml
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,4 @@
 .ds_store
-.venv
+.venv
+.env
+uv.lock
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -0,0 +1,174 @@
+HelixScale: HelixScale: BioNeMo-AI HPC Fabric Orchestration Platform
+Version: 2.0 (Professional Portfolio)
+Classification: Enterprise-Simulation / Production Blueprint
+Target Roles: HPC Engineer, Platform Engineering, Data Platform Specialist
+License: MIT
+Date: March 2026
+
+## 📖 1. Executive Summary
+HelixScale is a hybrid infrastructure platform designed to orchestrate High-Performance Computing (HPC) and Artificial Intelligence (AI) workloads efficiently.
+
+This architecture demonstrates how to separate the Control Plane (local development on a non-linux and non-nvidia CUDA GPU Hardware, security scanning, orchestration logic) from the Compute Plane (utilizing cloud GPU nodes, Kubernetes schedulers, NVIDIA drivers), ensuring enterprise-grade reliability, cost control, and reproducibility across macOS, Linux, and Windows (WSL2).
+
+### Key Architectural Pillars:
+
+- Infrastructure-as-Code (IaC): Reproducible environments via Terraform and Ansible.
+- Supply-Chain Security: Automated vulnerability scanning (trivy, snyk) and secrets management integration.
+- FinOps Governance: Automatic resource reclamation to prevent budget overruns (Zero-Cost Idle Policy).
+- Hybrid Scheduling: Support for legacy HPC schedulers (Slurm) alongside cloud-native orchestration (Kubernetes/Volcano).
+
+## 🏗️ 2. High-Level System Design
+HelixScale operates on a Control vs Compute Plane model. This separation ensures that my local machine (Mac M-Series, Linux Workstation) never burns expensive resources while managing the cloud environment remotely.
+
+graph TD
+    Local[Control Plane] -->|SSH/Session Manager| Infra [Terraform IaC & Ansible]
+
+    subgraph "Cloud Compute Plane"
+        Terra -->|Create| VPC[AWS VPC + EKS Cluster]
+        VPC -->|Deploy| K8s[Kubernetes Nodes + GPU Drivers]
+        K8s -->|Schedule| Slurm[Slurm + Volcano Scheduler]
+        Slurm -->|Queue Jobs| Workload[BioNeMo Inference / MPI Jobs]
+        Workload -->|Store Results| S3[AWS S3 Bucket]
+        Infra -->|Monitor| Grafana[Grafana Dashboard + Prometheus Metrics]
+    end
+
+    subgraph "Local Orchestration"
+        Dev -->|Push Code| CI[GitHub Actions CI/CD Pipeline]
+        CI -->|Run Scans| Security[Trivy/Snyk Security Scan]
+        CI -->|Deploy Artifacts| Registry[AWS ECR Private Registry]
+        Security -->|Block If Failed| Dev
+    end
+
+    subgraph "Cost & Safety Controls"
+        FinOps[Cleanup Agent Script] -.->|Auto-terminate idle nodes| K8s
+        Budget[AWS Budget Alert ($2 Limit)] -.->|Stop New Provisions| Terra
+        Policy[Security Policy Rules] -.->|Enforce Pod Standards| K8s
+    end
+
+    S3 -->|Data Ingestion| Workload
+    Dev -->|View Results| Dashboard[Grafana Visualization]
+
+
+
+
+## 🔹 Control Plane (Local Development Machine)
+Responsible for managing infrastructure without heavy compute load.
+
+- OS: macOS (M-Series)
+- Tools: Terraform, Ansible, GitHub Actions, OrbStack/Docker.
+- Tasks: Code development, dependency locking (uv), security scanning, CI/CD triggers.
+🔸 Compute Plane (Cloud Host - AWS/Azure)
+Responsible for executing workloads with GPU acceleration.
+
+OS: RHEL 9 / Ubuntu Server 22.04 LTS.
+Tools: Kubernetes (EKS), Slurm, NVIDIA Container Toolkit.
+Tasks: HPC job scheduling (sbatch), BioNeMo inference, model training.
+
+🔶 Data & Storage Layer
+Manages biological datasets and artifacts securely.
+
+Primary Store: AWS S3 (Encrypted, Object Lock enabled).
+Cache Storage: EFS / GP2 for ephemeral file access during job runtime.
+Security: IAM Roles with least privilege, encryption at rest/in-transit.
+
+🔄 Architecture Diagram
+The following flow represents how data and tasks move through HelixScale:
+
+graph TD
+    Dev[Developer / Control Plane] -->|SSH/Session Manager| Infra[Terraform IaC & Ansible]
+
+    subgraph "Cloud Compute Plane"
+        Terra -->|Create| VPC[AWS VPC + EKS Cluster]
+        VPC -->|Deploy| K8s[Kubernetes Nodes + GPU Drivers]
+        K8s -->|Schedule| Slurm[Slurm + Volcano Scheduler]
+        Slurm -->|Queue Jobs| Workload[BioNeMo Inference / MPI Jobs]
+        Workload -->|Store Results| S3[AWS S3 Bucket]
+        Infra -->|Monitor| Grafana[Grafana Dashboard + Prometheus Metrics]
+    end
+
+    subgraph "Local Orchestration"
+        Dev -->|Push Code| CI[GitHub Actions CI/CD Pipeline]
+        CI -->|Run Scans| Security[Trivy/Snyk Security Scan]
+        CI -->|Deploy Artifacts| Registry[AWS ECR Private Registry]
+        Security -->|Block If Failed| Dev
+    end
+
+    subgraph "Cost & Safety Controls"
+        FinOps[Cleanup Agent Script] -.->|Auto-terminate idle nodes| K8s
+        Budget[AWS Budget Alert ($2 Limit)] -.->|Stop New Provisions| Terra
+        Policy[Security Policy Rules] -.->|Enforce Pod Standards| K8s
+    end
+
+    S3 -->|Data Ingestion| Workload
+    Dev -->|View Results| Dashboard[Grafana Visualization]
+
+
+🛠️ Technology Stack Breakdown
+Infrastructure as Code (IaC)
+Terraform: Defines VPC, EKS clusters, GPU instances, and S3 buckets.
+Ansible: Applies node hardening, driver installation, and Slurm configs via playbooks.
+AWS CLI: Manages regional resource deployment and billing alerts.
+Computing & Scheduling
+Orchestrator: Kubernetes (EKS).
+Scheduler: Slurm integrated with Volcano for hybrid workloads.
+Container Runtimes: Docker (Dev) + Apptainer (Production HPC compatibility).
+GPU Drivers: NVIDIA Container Toolkit (nvidia-docker) on managed nodes.
+Data & AI
+BioNeMo Stack: Integrated via NGC containers for biological model inference.
+Data Ingestion: Python scripts converting local PDB/FASTA files to cloud-friendly formats (Parquet/HDF5).
+Model Registry: MLFlow / S3 versioning for trained model weights.
+DevOps & CI/CD
+CI/CD: GitHub Actions with ci-cd.yaml workflow for linting, scanning, and deploying.
+Secrets Management: AWS Secrets Manager integration (no keys in code).
+Cleanup Automation: Scheduled GitHub Actions that scale GPU nodes to zero after N hours idle.
+Security & Observability
+Vulnerability Scanning: trivy integrated into CI pipeline for all Docker images.
+Network Policies: Restricts pod-to-pod traffic (CNI) and egress paths.
+Metrics: Prometheus + Grafana dashboards showing GPU utilization, job QPS, and cost metrics.
+🔐 Security & Compliance Model
+HelixScale implements enterprise-grade security policies to ensure production readiness:
+
+1. Supply Chain Security
+All Docker/Apptainer images are scanned for CVEs (Critical vulnerabilities) before deployment.
+uv.lock ensures Python dependency reproducibility and audit trails.
+2. Data Privacy
+PII/PHI Filtering: No user data is processed on local machines; all sensitive biological data resides in encrypted S3 buckets.
+IAM Least Privilege: Node Groups use minimal IAM roles with strict permissions (no admin access).
+3. Infrastructure Protection
+.gitignore: Automatically generated to exclude Terraform state files (*.tfstate) and AWS credential files (~/.aws/credentials).
+Terraform State Locking: Uses DynamoDB backend locks to prevent concurrent state modifications.
+💰 Cost Control (FinOps Strategy)
+Enterprise HPC platforms must optimize spend, especially when using Spot Instances:
+
+Spot Fleet Utilization: GPU nodes run on Spot Instances for up to 90% savings with auto-recovery logic if interruptions occur.
+Idle Detection: Cleanup agents monitor GPU utilization metrics; nodes are terminated after 30 minutes of inactivity.
+Budget Hard Limits: AWS Budgets set at $2/month with immediate notifications at 80% threshold.
+🚀 Local vs Cloud Development Workflow
+Phase 1: Blueprinting (Local)
+Goal: Create architecture, write Terraform plans, test Python logic locally on Mac/WSL.
+Resources: Zero cloud resources created; all terraform plan outputs reviewed first.
+Phase 2: Controlled Provisions (Cloud)
+Goal: Spin up specific compute resources for testing with approval gates (bootstrap.sh).
+Safety Gate: Must confirm via script or GitHub Actions before spinning GPU instances.
+Phase 3: Production Simulation (CI/CD)
+Goal: Full pipeline automation where code is built, scanned, and deployed automatically on push.
+📚 Missing Components & Future Roadmap
+While this blueprint is fully functional for Phase 1, here are recommended additions for Phase 2:
+
+Feature	Status	Reason for Future Addition
+Multi-Cluster Support	Incomplete	Add support for scaling to Azure or GCP clusters in future phases.
+MLflow Registry	Incomplete	Integrate full model versioning and experiment tracking for AI roles.
+InfiniBand Networking	Not Implemented	Simulate RDMA over Converged Ethernet (RoCE) performance testing for HPC.
+Disaster Recovery	Not Implemented	Implement cross-region S3 replication and EKS backup policies.
+📝 Version Control & Contributions
+Commit Message Standard: Use conventional commits (feat: add BioNeMo support, fix: update cleanup logic).
+License: MIT License (Free for personal portfolio use).
+Contributors: Add open-source community badges if PRs are merged to upstream projects (Volcano, Slurm-k8s).
+🧠 Summary for Hiring Managers
+HelixScale demonstrates:
+
+Control Plane Mastery: You know how to manage infrastructure without running expensive hardware locally.
+Cost Control: You understand FinOps and budget safety protocols before deploying any resources.
+Security Awareness: You automatically prevent secrets leakage via .gitignore and scanning tools.
+Enterprise-Ready: Your pipeline is built on industry standards (Terraform, Ansible, Kubernetes) that work across Linux/Windows/macOS environments.
+This portfolio piece proves you can design production-grade systems while adhering to the safety, cost, and compliance requirements demanded by enterprise HPC teams.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,52 @@
+# HelixScale Changelog
+
+All notable changes to the HelixScale Platform will be documented in this file.
+
+> Not auto-generated
+
+# Version 2.0: HelixScale: BioNeMo-AI HPC Fabric Orchestration Platform
+
+Mainly, the idea evolved to differ streamlit interface to version 3.0. 
+
+┌──────────────────────────────────────────────────────────────┐
+│                     Orchestration Engine                     │
+│         Pipeline DAG  ←→  Scheduler Abstraction              │
+│        (dependency      (Slurm |  Local)                 │
+│         resolution)        GPU Allocator                     │
+├──────────────────────────────────────────────────────────────┤
+│                     Container Runtimes                       │
+│              Docker (T0/T1)  |  Apptainer (T2)               │
+├──────────────┬──────────────┬────────────────────────────────┤
+│  Terraform   │   Ansible    │   Chef InSpec                  │
+│  (provision) │  (configure) │   (validate)                   │
+├──────────────┴──────────────┴────────────────────────────────┤
+│                     Observability                            │
+│    Prometheus → Grafana    DCGM Exporter    Loki             │
+│    Slurm Exporter          Triton Metrics                    │
+└──────────────────────────────────────────────────────────────┘
+
+
+
+# [Version - 1.0] - Mar-04-2026 HelixScale: HPC-optimized BioNemo Orchestration  (MVP Phase 1)
+
+### 🎯 Features: 
+
+┌──────────────────────────────────────────────────────────────┐
+│                        CLI / Streamlit UI                    │
+│              helixscale [cluster|jobs|infra|monitor]         │
+├──────────────────────────────────────────────────────────────┤
+│                     Orchestration Engine                     │
+│         Pipeline DAG  ←→  Scheduler Abstraction              │
+│        (dependency      (Slurm | SGE | Local)                │
+│         resolution)        GPU Allocator                     │
+├──────────────────────────────────────────────────────────────┤
+│                     Container Runtimes                       │
+│              Docker (T0/T1)  |  Apptainer (T2)               │
+├──────────────┬──────────────┬────────────────────────────────┤
+│  Terraform   │   Ansible    │   Chef InSpec                  │
+│  (provision) │  (configure) │   (validate)                   │
+├──────────────┴──────────────┴────────────────────────────────┤
+│                     Observability                            │
+│    Prometheus → Grafana    DCGM Exporter    Loki             │
+│    Slurm Exporter          Triton Metrics                    │
+└──────────────────────────────────────────────────────────────┘
diff --git a/Claude.md b/Claude.md
@@ -1,4 +1,5 @@
-# Claude.md — HelixScale: BioNemo HPC Orchestration Showcase
+# Claude.md — HelixScale: BioNeMo-AI HPC Fabric Orchestration Platform
+
 
 ## Project Identity
 

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# HelixScale: BioNeMo HPC Orchestration Platform
+# HelixScale: BioNeMo-AI HPC Fabric Orchestration Platform
 
 > Production-grade HPC platform engineering for GPU-accelerated computational biology workloads — from bare metal to cloud, single node to multi-cluster.
 

diff --git a/automation/ansible/cleanup.sh b/automation/ansible/cleanup.sh