Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
File renamed without changes.
File renamed without changes.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
.ds_store
.venv
.venv
.env
uv.lock
174 changes: 174 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
HelixScale: HelixScale: BioNeMo-AI HPC Fabric Orchestration Platform
Version: 2.0 (Professional Portfolio)
Classification: Enterprise-Simulation / Production Blueprint
Target Roles: HPC Engineer, Platform Engineering, Data Platform Specialist
License: MIT
Date: March 2026

## 📖 1. Executive Summary
HelixScale is a hybrid infrastructure platform designed to orchestrate High-Performance Computing (HPC) and Artificial Intelligence (AI) workloads efficiently.

This architecture demonstrates how to separate the Control Plane (local development on a non-linux and non-nvidia CUDA GPU Hardware, security scanning, orchestration logic) from the Compute Plane (utilizing cloud GPU nodes, Kubernetes schedulers, NVIDIA drivers), ensuring enterprise-grade reliability, cost control, and reproducibility across macOS, Linux, and Windows (WSL2).

### Key Architectural Pillars:

- Infrastructure-as-Code (IaC): Reproducible environments via Terraform and Ansible.
- Supply-Chain Security: Automated vulnerability scanning (trivy, snyk) and secrets management integration.
- FinOps Governance: Automatic resource reclamation to prevent budget overruns (Zero-Cost Idle Policy).
- Hybrid Scheduling: Support for legacy HPC schedulers (Slurm) alongside cloud-native orchestration (Kubernetes/Volcano).

## 🏗️ 2. High-Level System Design
HelixScale operates on a Control vs Compute Plane model. This separation ensures that my local machine (Mac M-Series, Linux Workstation) never burns expensive resources while managing the cloud environment remotely.

graph TD
Local[Control Plane] -->|SSH/Session Manager| Infra [Terraform IaC & Ansible]

subgraph "Cloud Compute Plane"
Terra -->|Create| VPC[AWS VPC + EKS Cluster]
VPC -->|Deploy| K8s[Kubernetes Nodes + GPU Drivers]
K8s -->|Schedule| Slurm[Slurm + Volcano Scheduler]
Slurm -->|Queue Jobs| Workload[BioNeMo Inference / MPI Jobs]
Workload -->|Store Results| S3[AWS S3 Bucket]
Infra -->|Monitor| Grafana[Grafana Dashboard + Prometheus Metrics]
end

subgraph "Local Orchestration"
Dev -->|Push Code| CI[GitHub Actions CI/CD Pipeline]
CI -->|Run Scans| Security[Trivy/Snyk Security Scan]
CI -->|Deploy Artifacts| Registry[AWS ECR Private Registry]
Security -->|Block If Failed| Dev
end

subgraph "Cost & Safety Controls"
FinOps[Cleanup Agent Script] -.->|Auto-terminate idle nodes| K8s
Budget[AWS Budget Alert ($2 Limit)] -.->|Stop New Provisions| Terra
Policy[Security Policy Rules] -.->|Enforce Pod Standards| K8s
end

S3 -->|Data Ingestion| Workload
Dev -->|View Results| Dashboard[Grafana Visualization]




## 🔹 Control Plane (Local Development Machine)
Responsible for managing infrastructure without heavy compute load.

- OS: macOS (M-Series)
- Tools: Terraform, Ansible, GitHub Actions, OrbStack/Docker.
- Tasks: Code development, dependency locking (uv), security scanning, CI/CD triggers.
🔸 Compute Plane (Cloud Host - AWS/Azure)
Responsible for executing workloads with GPU acceleration.

OS: RHEL 9 / Ubuntu Server 22.04 LTS.
Tools: Kubernetes (EKS), Slurm, NVIDIA Container Toolkit.
Tasks: HPC job scheduling (sbatch), BioNeMo inference, model training.

🔶 Data & Storage Layer
Manages biological datasets and artifacts securely.

Primary Store: AWS S3 (Encrypted, Object Lock enabled).
Cache Storage: EFS / GP2 for ephemeral file access during job runtime.
Security: IAM Roles with least privilege, encryption at rest/in-transit.

🔄 Architecture Diagram
The following flow represents how data and tasks move through HelixScale:

graph TD
Dev[Developer / Control Plane] -->|SSH/Session Manager| Infra[Terraform IaC & Ansible]

subgraph "Cloud Compute Plane"
Terra -->|Create| VPC[AWS VPC + EKS Cluster]
VPC -->|Deploy| K8s[Kubernetes Nodes + GPU Drivers]
K8s -->|Schedule| Slurm[Slurm + Volcano Scheduler]
Slurm -->|Queue Jobs| Workload[BioNeMo Inference / MPI Jobs]
Workload -->|Store Results| S3[AWS S3 Bucket]
Infra -->|Monitor| Grafana[Grafana Dashboard + Prometheus Metrics]
end

subgraph "Local Orchestration"
Dev -->|Push Code| CI[GitHub Actions CI/CD Pipeline]
CI -->|Run Scans| Security[Trivy/Snyk Security Scan]
CI -->|Deploy Artifacts| Registry[AWS ECR Private Registry]
Security -->|Block If Failed| Dev
end

subgraph "Cost & Safety Controls"
FinOps[Cleanup Agent Script] -.->|Auto-terminate idle nodes| K8s
Budget[AWS Budget Alert ($2 Limit)] -.->|Stop New Provisions| Terra
Policy[Security Policy Rules] -.->|Enforce Pod Standards| K8s
end

S3 -->|Data Ingestion| Workload
Dev -->|View Results| Dashboard[Grafana Visualization]


🛠️ Technology Stack Breakdown
Infrastructure as Code (IaC)
Terraform: Defines VPC, EKS clusters, GPU instances, and S3 buckets.
Ansible: Applies node hardening, driver installation, and Slurm configs via playbooks.
AWS CLI: Manages regional resource deployment and billing alerts.
Computing & Scheduling
Orchestrator: Kubernetes (EKS).
Scheduler: Slurm integrated with Volcano for hybrid workloads.
Container Runtimes: Docker (Dev) + Apptainer (Production HPC compatibility).
GPU Drivers: NVIDIA Container Toolkit (nvidia-docker) on managed nodes.
Data & AI
BioNeMo Stack: Integrated via NGC containers for biological model inference.
Data Ingestion: Python scripts converting local PDB/FASTA files to cloud-friendly formats (Parquet/HDF5).
Model Registry: MLFlow / S3 versioning for trained model weights.
DevOps & CI/CD
CI/CD: GitHub Actions with ci-cd.yaml workflow for linting, scanning, and deploying.
Secrets Management: AWS Secrets Manager integration (no keys in code).
Cleanup Automation: Scheduled GitHub Actions that scale GPU nodes to zero after N hours idle.
Security & Observability
Vulnerability Scanning: trivy integrated into CI pipeline for all Docker images.
Network Policies: Restricts pod-to-pod traffic (CNI) and egress paths.
Metrics: Prometheus + Grafana dashboards showing GPU utilization, job QPS, and cost metrics.
🔐 Security & Compliance Model
HelixScale implements enterprise-grade security policies to ensure production readiness:

1. Supply Chain Security
All Docker/Apptainer images are scanned for CVEs (Critical vulnerabilities) before deployment.
uv.lock ensures Python dependency reproducibility and audit trails.
2. Data Privacy
PII/PHI Filtering: No user data is processed on local machines; all sensitive biological data resides in encrypted S3 buckets.
IAM Least Privilege: Node Groups use minimal IAM roles with strict permissions (no admin access).
3. Infrastructure Protection
.gitignore: Automatically generated to exclude Terraform state files (*.tfstate) and AWS credential files (~/.aws/credentials).
Terraform State Locking: Uses DynamoDB backend locks to prevent concurrent state modifications.
💰 Cost Control (FinOps Strategy)
Enterprise HPC platforms must optimize spend, especially when using Spot Instances:

Spot Fleet Utilization: GPU nodes run on Spot Instances for up to 90% savings with auto-recovery logic if interruptions occur.
Idle Detection: Cleanup agents monitor GPU utilization metrics; nodes are terminated after 30 minutes of inactivity.
Budget Hard Limits: AWS Budgets set at $2/month with immediate notifications at 80% threshold.
🚀 Local vs Cloud Development Workflow
Phase 1: Blueprinting (Local)
Goal: Create architecture, write Terraform plans, test Python logic locally on Mac/WSL.
Resources: Zero cloud resources created; all terraform plan outputs reviewed first.
Phase 2: Controlled Provisions (Cloud)
Goal: Spin up specific compute resources for testing with approval gates (bootstrap.sh).
Safety Gate: Must confirm via script or GitHub Actions before spinning GPU instances.
Phase 3: Production Simulation (CI/CD)
Goal: Full pipeline automation where code is built, scanned, and deployed automatically on push.
📚 Missing Components & Future Roadmap
While this blueprint is fully functional for Phase 1, here are recommended additions for Phase 2:

Feature Status Reason for Future Addition
Multi-Cluster Support Incomplete Add support for scaling to Azure or GCP clusters in future phases.
MLflow Registry Incomplete Integrate full model versioning and experiment tracking for AI roles.
InfiniBand Networking Not Implemented Simulate RDMA over Converged Ethernet (RoCE) performance testing for HPC.
Disaster Recovery Not Implemented Implement cross-region S3 replication and EKS backup policies.
📝 Version Control & Contributions
Commit Message Standard: Use conventional commits (feat: add BioNeMo support, fix: update cleanup logic).
License: MIT License (Free for personal portfolio use).
Contributors: Add open-source community badges if PRs are merged to upstream projects (Volcano, Slurm-k8s).
🧠 Summary for Hiring Managers
HelixScale demonstrates:

Control Plane Mastery: You know how to manage infrastructure without running expensive hardware locally.
Cost Control: You understand FinOps and budget safety protocols before deploying any resources.
Security Awareness: You automatically prevent secrets leakage via .gitignore and scanning tools.
Enterprise-Ready: Your pipeline is built on industry standards (Terraform, Ansible, Kubernetes) that work across Linux/Windows/macOS environments.
This portfolio piece proves you can design production-grade systems while adhering to the safety, cost, and compliance requirements demanded by enterprise HPC teams.
52 changes: 52 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# HelixScale Changelog

All notable changes to the HelixScale Platform will be documented in this file.

> Not auto-generated

# Version 2.0: HelixScale: BioNeMo-AI HPC Fabric Orchestration Platform

Mainly, the idea evolved to differ streamlit interface to version 3.0.

┌──────────────────────────────────────────────────────────────┐
│ Orchestration Engine │
│ Pipeline DAG ←→ Scheduler Abstraction │
│ (dependency (Slurm | Local) │
│ resolution) GPU Allocator │
├──────────────────────────────────────────────────────────────┤
│ Container Runtimes │
│ Docker (T0/T1) | Apptainer (T2) │
├──────────────┬──────────────┬────────────────────────────────┤
│ Terraform │ Ansible │ Chef InSpec │
│ (provision) │ (configure) │ (validate) │
├──────────────┴──────────────┴────────────────────────────────┤
│ Observability │
│ Prometheus → Grafana DCGM Exporter Loki │
│ Slurm Exporter Triton Metrics │
└──────────────────────────────────────────────────────────────┘



# [Version - 1.0] - Mar-04-2026 HelixScale: HPC-optimized BioNemo Orchestration (MVP Phase 1)

### 🎯 Features:

┌──────────────────────────────────────────────────────────────┐
│ CLI / Streamlit UI │
│ helixscale [cluster|jobs|infra|monitor] │
├──────────────────────────────────────────────────────────────┤
│ Orchestration Engine │
│ Pipeline DAG ←→ Scheduler Abstraction │
│ (dependency (Slurm | SGE | Local) │
│ resolution) GPU Allocator │
├──────────────────────────────────────────────────────────────┤
│ Container Runtimes │
│ Docker (T0/T1) | Apptainer (T2) │
├──────────────┬──────────────┬────────────────────────────────┤
│ Terraform │ Ansible │ Chef InSpec │
│ (provision) │ (configure) │ (validate) │
├──────────────┴──────────────┴────────────────────────────────┤
│ Observability │
│ Prometheus → Grafana DCGM Exporter Loki │
│ Slurm Exporter Triton Metrics │
└──────────────────────────────────────────────────────────────┘
3 changes: 2 additions & 1 deletion Claude.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Claude.md — HelixScale: BioNemo HPC Orchestration Showcase
# Claude.md — HelixScale: BioNeMo-AI HPC Fabric Orchestration Platform


## Project Identity

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# HelixScale: BioNeMo HPC Orchestration Platform
# HelixScale: BioNeMo-AI HPC Fabric Orchestration Platform

> Production-grade HPC platform engineering for GPU-accelerated computational biology workloads — from bare metal to cloud, single node to multi-cluster.

Expand Down
Empty file added automation/ansible/cleanup.sh
Empty file.
Loading