Skip to content

wonitatts/AI-Observability-Platform-Architecture.md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 

Repository files navigation

AI-Observability-Platform-Architecture.md

This is my architectural plan to build out an AI network observability/security agent

AI-Driven Infrastructure Observability Platform

Engineering Architecture Document


1. Full System Architecture

The platform is a four-layer stack: Collection β†’ Transport β†’ Intelligence β†’ Presentation.

Collection Layer β€” Lightweight agents and sensors deployed on every monitored VM. Node Exporter exposes host metrics over HTTP. Filebeat tails syslog, application logs, and auth logs. Zeek (or Suricata) performs passive network analysis on a mirrored interface, producing structured connection logs (conn.log, dns.log, http.log, etc.).

Transport Layer β€” All telemetry converges on a centralized ingestion bus. Prometheus scrapes metric endpoints on a 15-second interval. Filebeat ships logs to Elasticsearch via Logstash (which handles parsing, enrichment, and routing). Zeek logs are ingested either via Filebeat's Zeek module or by writing directly to a shared NFS mount consumed by the ML pipeline.

Intelligence Layer β€” A dedicated ML node runs scheduled Python jobs (orchestrated by Airflow or cron). Three model pipelines operate independently: an Isolation Forest anomaly detector on network telemetry, a Prophet time-series forecaster on system metrics, and a TF-IDF + Logistic Regression classifier on log text. Model outputs (anomaly scores, forecasts, log labels) are written back to Elasticsearch as enriched indices and exposed as Prometheus custom metrics via a Python pushgateway exporter.

Presentation Layer β€” Grafana serves as the single pane of glass. It queries Prometheus for real-time metrics and forecasts, Elasticsearch for log classification results and anomaly events, and renders alerting rules that trigger on ML-derived thresholds.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      PRESENTATION LAYER                             β”‚
β”‚                  Grafana (dashboards + alerts)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                    β”‚
     Prometheus            Elasticsearch
           β”‚                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     INTELLIGENCE LAYER                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ Anomaly Det. β”‚  β”‚ Capacity Forcast β”‚  β”‚ Log Classifier     β”‚     β”‚
β”‚  β”‚ IsolationFor β”‚  β”‚ Prophet          β”‚  β”‚ TF-IDF + LogReg    β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                    β”‚                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      TRANSPORT LAYER                                 β”‚
β”‚  Prometheus scrape  β”‚  Logstash pipeline  β”‚  Zeek log shipping      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                    β”‚                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      COLLECTION LAYER                                β”‚
β”‚  Node Exporter  β”‚  Filebeat  β”‚  Zeek/Suricata  β”‚  Custom exporters  β”‚
β”‚           (deployed per VM)                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Virtual Machine Topology

Six VMs are recommended. All run Ubuntu 22.04 LTS.

VM Role Hostname vCPU RAM Disk Purpose
Observability Server obs-server 4 8 GB 100 GB SSD Prometheus, Grafana, Alertmanager
Log Server log-server 4 12 GB 200 GB SSD Elasticsearch (single-node), Logstash, Kibana
Network Sensor net-sensor 2 4 GB 50 GB Zeek/Suricata, 2 NICs (mgmt + mirror/span)
ML Node ml-node 4 8 GB 80 GB SSD Python ML pipelines, Airflow scheduler, model storage
Telemetry Node A telem-a 2 4 GB 40 GB Simulated workload, Node Exporter, Filebeat
Telemetry Node B telem-b 2 4 GB 40 GB Simulated workload, Node Exporter, Filebeat

Total resource requirement: 18 vCPU, 40 GB RAM, 510 GB disk. Fits on a workstation with 32+ GB RAM if telem nodes are kept lean.

Networking: All VMs on a shared VMware NAT or host-only network (172.16.0.0/24). The network sensor's second NIC is connected to a port mirror/promiscuous VLAN segment carrying traffic from both telemetry nodes. VMware Workstation supports promiscuous mode on virtual switches to enable this.


3. Telemetry Pipeline

Metrics Pipeline

  1. Node Exporter (port 9100) on every monitored VM exposes ~800 metrics: CPU per-core utilization, memory pressure, disk I/O latency, network bytes/packets, filesystem usage.
  2. Prometheus scrapes all exporters every 15s. Recording rules pre-compute 5m rate averages for CPU, memory, and disk. Retention set to 30 days.
  3. Custom Python exporter on ml-node pushes forecast metrics and anomaly scores to Prometheus Pushgateway, making ML outputs queryable via PromQL.

Log Pipeline

  1. Filebeat on each telemetry node tails /var/log/syslog, /var/log/auth.log, /var/log/kern.log, and any application logs. Each event is tagged with host, log_source, and timestamp.
  2. Logstash (on log-server) receives Filebeat output, applies grok filters to extract structured fields (severity, service, PID, message body), and writes to Elasticsearch index logs-YYYY.MM.DD.
  3. ML enrichment: The log classifier reads raw logs from Elasticsearch via the Python client, classifies them, and writes results back to a logs-classified index with an added ml_category field.

Network Telemetry Pipeline

  1. Zeek on net-sensor runs in cluster mode (single worker for homelab scale) on the mirrored NIC. It produces structured TSV logs: conn.log (connection 5-tuples, durations, byte counts), dns.log, http.log, ssl.log, weird.log.
  2. Filebeat Zeek module ships these logs to Elasticsearch for archival and dashboard queries.
  3. ML consumption: The anomaly detector reads conn.log features (duration, orig_bytes, resp_bytes, protocol, service, conn_state) directly from disk or Elasticsearch, computes feature vectors, and scores them.

4. Machine Learning Architecture

4.1 Network Anomaly Detection β€” Isolation Forest

Why Isolation Forest: Anomaly detection on network traffic is an unsupervised problem β€” you don't have labeled "attack" data in a homelab. Isolation Forest is specifically designed for this: it isolates outliers by randomly partitioning feature space, and anomalies require fewer partitions to isolate. It handles high-dimensional numeric data well, trains fast, and requires no distributional assumptions.

Features (derived from Zeek conn.log):

  • duration (connection length)
  • orig_bytes, resp_bytes (data volume)
  • orig_pkts, resp_pkts (packet counts)
  • orig_ip_bytes / duration (throughput rate)
  • service (one-hot encoded: dns, http, ssl, other)
  • conn_state (encoded: S0, S1, SF, REJ, RSTO, etc.)
  • hour_of_day, day_of_week (temporal features)

Training: Fit on 7 days of "normal" baseline traffic. Contamination parameter set to 0.01–0.05 (tuned via domain knowledge). Retrain weekly.

Output: Anomaly score per connection (-1 = anomaly, 1 = normal). Connections scoring below the threshold are flagged and written to an anomalies Elasticsearch index with the original connection metadata.

4.2 Capacity Forecasting β€” Facebook Prophet

Why Prophet: It handles daily/weekly seasonality natively (which infrastructure metrics exhibit β€” think cron jobs at midnight, lower weekend utilization). It's robust to missing data points, requires minimal hyperparameter tuning, and produces interpretable confidence intervals β€” exactly what an SRE needs for capacity planning.

Metrics forecasted:

  • CPU utilization (%) β€” 5-minute averaged
  • Memory usage (%) β€” absolute used vs. total
  • Disk usage (%) β€” per mount point
  • Network throughput (bytes/sec) β€” interface level

Training: 30 days of Prometheus data exported via the HTTP API (/api/v1/query_range). Each metric gets its own Prophet model. Models are retrained daily.

Output: 7-day and 30-day forecasts with 80% and 95% confidence intervals. A "days to exhaustion" metric is computed by extrapolating when the upper confidence bound crosses 85% (warning) or 95% (critical). These values are pushed to Prometheus as custom gauge metrics.

4.3 Log Classification β€” TF-IDF + Logistic Regression

Why TF-IDF + Logistic Regression: For structured log classification, this pipeline is well-proven and production-grade. TF-IDF converts log messages into sparse feature vectors that capture discriminative terms ("segfault," "OOM," "connection refused" vs. "session opened," "cron started"). Logistic Regression is fast to train, interpretable (you can inspect feature weights), and performs well with high-dimensional sparse inputs. More complex models (BERT, etc.) are overkill for this domain and impractical in a homelab.

Classes:

Label Description Example patterns
critical Service-impacting failures OOM killer, segfault, disk full, kernel panic
warning Degradation indicators high latency, retries, connection timeout
informational Normal operational events service started, user login, cron executed
noise Non-actionable entries DHCP renewal, NTP sync, routine heartbeats

Training data: Manually label 2,000–5,000 log lines from your homelab (this is realistic β€” a weekend of labeling). Use stratified sampling to ensure class balance. Augment with public datasets like the Loghub collection (https://github.com/logpai/loghub).

Pipeline: raw_text β†’ regex cleanup (strip timestamps, PIDs) β†’ TF-IDF vectorizer (max_features=10000, ngram_range=(1,2)) β†’ Logistic Regression (C=1.0, class_weight='balanced').

Output: Each log line gets a predicted label and confidence score. Low-confidence predictions (< 0.6) are flagged for human review, creating a feedback loop for model improvement.


5. Data Flow

TELEMETRY NODES                NETWORK SENSOR
  β”‚ Node Exporter                  β”‚ Zeek conn.log
  β”‚ Filebeat (logs)                β”‚ Filebeat (Zeek logs)
  β”‚                                β”‚
  β–Ό                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              TRANSPORT / STORAGE                 β”‚
β”‚                                                  β”‚
β”‚  Prometheus ◄── metric scrape                    β”‚
β”‚  (TSDB, 30d)                                     β”‚
β”‚                                                  β”‚
β”‚  Logstash ◄── log events ──► Elasticsearch       β”‚
β”‚  (parse/enrich)              (raw logs index)    β”‚
β”‚                              (zeek logs index)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                          β”‚
        β–Ό                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                ML NODE                           β”‚
β”‚                                                  β”‚
β”‚  1. Query Prometheus HTTP API ──► Prophet         β”‚
β”‚     (30d CPU/RAM/disk/net)       (forecast)      β”‚
β”‚        β”‚                                         β”‚
β”‚        └──► Push forecast metrics to Pushgateway β”‚
β”‚                                                  β”‚
β”‚  2. Query Elasticsearch ──► TF-IDF + LogReg      β”‚
β”‚     (raw log text)           (classify)          β”‚
β”‚        β”‚                                         β”‚
β”‚        └──► Write classified logs back to ES     β”‚
β”‚                                                  β”‚
β”‚  3. Read Zeek conn.log ──► Isolation Forest      β”‚
β”‚     (from ES or NFS)        (score anomalies)    β”‚
β”‚        β”‚                                         β”‚
β”‚        └──► Write anomaly events to ES           β”‚
β”‚            Push anomaly scores to Pushgateway    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              GRAFANA                             β”‚
β”‚                                                  β”‚
β”‚  Data sources:                                   β”‚
β”‚    Prometheus β†’ metrics + forecasts + anomaly    β”‚
β”‚    Elasticsearch β†’ logs + classifications +      β”‚
β”‚                    anomaly events + Zeek data     β”‚
β”‚                                                  β”‚
β”‚  Dashboards β†’ Alerts β†’ PagerDuty/Slack webhook   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6. Implementation Roadmap

Phase 1: Foundation (Week 1–2)

  1. Provision all 6 VMs in VMware Workstation. Assign static IPs on 172.16.0.0/24.
  2. Install and configure Node Exporter on telem-a and telem-b.
  3. Deploy Prometheus on obs-server. Validate scrape targets are up.
  4. Install Grafana on obs-server. Connect Prometheus data source. Import the Node Exporter Full dashboard (ID: 1860).
  5. Validate end-to-end: generate CPU load with stress-ng, confirm it appears in Grafana within 30 seconds.

Phase 2: Log Pipeline (Week 3)

  1. Deploy Elasticsearch (single-node) and Logstash on log-server.
  2. Install Filebeat on telem-a/b. Configure syslog and auth.log inputs.
  3. Build Logstash grok pipeline for syslog parsing. Validate structured output in Elasticsearch.
  4. Connect Elasticsearch as a Grafana data source. Build a basic log explorer dashboard.

Phase 3: Network Monitoring (Week 4)

  1. Configure net-sensor's second NIC in promiscuous mode.
  2. Install Zeek. Tune local.zeek for your network (set Site::local_nets).
  3. Validate Zeek is producing conn.log, dns.log, http.log.
  4. Ship Zeek logs to Elasticsearch via Filebeat Zeek module.
  5. Generate synthetic traffic with hping3, nmap, or curl loops to create a baseline dataset.

Phase 4: ML β€” Log Classification (Week 5)

  1. Export 5,000 log lines from Elasticsearch. Manually label them (use a simple CSV + spreadsheet workflow).
  2. Build the TF-IDF + Logistic Regression pipeline in a Jupyter notebook on ml-node.
  3. Train/evaluate with 80/20 split. Target F1 > 0.85 on each class.
  4. Productionize: write a Python script that queries ES for new logs, classifies them, and writes results back. Schedule with cron (every 5 minutes).

Phase 5: ML β€” Capacity Forecasting (Week 6)

  1. Query Prometheus API for 30 days of CPU, RAM, disk, network metrics.
  2. Fit Prophet models per metric per host.
  3. Generate 7-day and 30-day forecasts. Compute days-to-exhaustion.
  4. Push forecasts to Prometheus Pushgateway. Build Grafana forecast overlay dashboards.
  5. Set Grafana alerts: warn at < 14 days to exhaustion, critical at < 7 days.

Phase 6: ML β€” Network Anomaly Detection (Week 7)

  1. Extract and featurize 7 days of Zeek conn.log data.
  2. Train Isolation Forest. Tune contamination parameter.
  3. Score new connections in batch (every 10 minutes via cron).
  4. Write anomaly events to Elasticsearch. Build anomaly dashboard in Grafana.

Phase 7: Integration and Hardening (Week 8)

  1. Unify all Grafana dashboards into a single "Platform Overview" home dashboard.
  2. Configure Alertmanager with routing rules and a Slack/Discord webhook.
  3. Write a stress_test.sh script that simulates failures (disk fill, CPU spike, port scan, log flood) to validate end-to-end detection.
  4. Document everything. Write the README. Record a demo walkthrough.

7. Recommended Software Stack

Layer Tool Version Purpose
Metrics collection Prometheus Node Exporter 1.7+ Host-level metrics
Metrics storage Prometheus 2.50+ TSDB, scraping, PromQL
Metrics gateway Prometheus Pushgateway 1.7+ ML-derived metric ingestion
Log shipper Filebeat 8.x Lightweight log forwarding
Log processing Logstash 8.x Parsing, enrichment, routing
Log/event store Elasticsearch 8.x (single-node) Full-text search, analytics
Network monitor Zeek 6.x Passive traffic analysis, structured logs
Alt network monitor Suricata 7.x IDS/IPS with EVE JSON output
Packet analysis Wireshark / tshark 4.x Ad hoc deep inspection
Visualization Grafana 10.x Dashboards, alerting
Alerting Alertmanager 0.27+ Alert routing, dedup, silencing
ML framework scikit-learn 1.4+ Isolation Forest, Logistic Regression, TF-IDF
Forecasting Prophet (via prophet) 1.1+ Time-series capacity forecasting
Data handling pandas, numpy latest Feature engineering, data wrangling
ES client elasticsearch-py 8.x Query/write Elasticsearch from Python
Prometheus client prometheus-client 0.20+ Push custom metrics from Python
Scheduling cron or Apache Airflow β€” ML pipeline orchestration
Notebooks JupyterLab 4.x Model development, EDA
Traffic gen hping3, nmap, stress-ng β€” Synthetic workload/traffic for testing

8. GitHub Repository Structure

ai-infra-observability/
β”œβ”€β”€ README.md                          # Project overview, architecture, setup guide
β”œβ”€β”€ LICENSE
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ architecture.md                # This document
β”‚   β”œβ”€β”€ vm-setup.md                    # VM provisioning instructions
β”‚   β”œβ”€β”€ runbook.md                     # Operational procedures
β”‚   └── screenshots/                   # Dashboard screenshots for portfolio
β”‚
β”œβ”€β”€ infrastructure/
β”‚   β”œβ”€β”€ ansible/                       # (Optional) Automated VM provisioning
β”‚   β”‚   β”œβ”€β”€ inventory.yml
β”‚   β”‚   β”œβ”€β”€ playbook-prometheus.yml
β”‚   β”‚   β”œβ”€β”€ playbook-elk.yml
β”‚   β”‚   β”œβ”€β”€ playbook-zeek.yml
β”‚   β”‚   └── roles/
β”‚   β”œβ”€β”€ prometheus/
β”‚   β”‚   β”œβ”€β”€ prometheus.yml             # Scrape config
β”‚   β”‚   β”œβ”€β”€ recording_rules.yml        # Pre-computed PromQL expressions
β”‚   β”‚   └── alerting_rules.yml         # Alert definitions
β”‚   β”œβ”€β”€ logstash/
β”‚   β”‚   β”œβ”€β”€ pipelines.yml
β”‚   β”‚   └── conf.d/
β”‚   β”‚       β”œβ”€β”€ 01-filebeat-input.conf
β”‚   β”‚       β”œβ”€β”€ 02-syslog-filter.conf
β”‚   β”‚       └── 03-elasticsearch-output.conf
β”‚   β”œβ”€β”€ filebeat/
β”‚   β”‚   └── filebeat.yml               # Per-node Filebeat config
β”‚   β”œβ”€β”€ zeek/
β”‚   β”‚   β”œβ”€β”€ local.zeek                 # Zeek site policy
β”‚   β”‚   └── node.cfg                   # Zeek cluster config
β”‚   └── grafana/
β”‚       └── provisioning/
β”‚           β”œβ”€β”€ dashboards/            # JSON dashboard definitions
β”‚           └── datasources/           # Prometheus + ES datasource configs
β”‚
β”œβ”€β”€ ml/
β”‚   β”œβ”€β”€ requirements.txt               # Python dependencies
β”‚   β”œβ”€β”€ common/
β”‚   β”‚   β”œβ”€β”€ config.py                  # Shared configuration (ES hosts, Prometheus URL)
β”‚   β”‚   β”œβ”€β”€ es_client.py               # Elasticsearch helper functions
β”‚   β”‚   └── prom_client.py             # Prometheus query helper
β”‚   β”œβ”€β”€ anomaly_detection/
β”‚   β”‚   β”œβ”€β”€ feature_engineering.py     # Zeek conn.log β†’ feature vectors
β”‚   β”‚   β”œβ”€β”€ train.py                   # Isolation Forest training
β”‚   β”‚   β”œβ”€β”€ detect.py                  # Batch scoring script (cron target)
β”‚   β”‚   └── models/                    # Serialized model files (.joblib)
β”‚   β”œβ”€β”€ capacity_forecast/
β”‚   β”‚   β”œβ”€β”€ extract_metrics.py         # Prometheus API data extraction
β”‚   β”‚   β”œβ”€β”€ train.py                   # Prophet model training
β”‚   β”‚   β”œβ”€β”€ forecast.py                # Generate forecasts + push to Pushgateway
β”‚   β”‚   └── models/
β”‚   β”œβ”€β”€ log_classification/
β”‚   β”‚   β”œβ”€β”€ label_studio_export.py     # (Optional) export from labeling tool
β”‚   β”‚   β”œβ”€β”€ preprocess.py              # Text cleaning, TF-IDF fitting
β”‚   β”‚   β”œβ”€β”€ train.py                   # Logistic Regression training
β”‚   β”‚   β”œβ”€β”€ classify.py                # Batch classification (cron target)
β”‚   β”‚   └── models/
β”‚   └── notebooks/
β”‚       β”œβ”€β”€ 01_eda_system_metrics.ipynb
β”‚       β”œβ”€β”€ 02_eda_zeek_connections.ipynb
β”‚       β”œβ”€β”€ 03_log_labeling_analysis.ipynb
β”‚       β”œβ”€β”€ 04_anomaly_model_tuning.ipynb
β”‚       β”œβ”€β”€ 05_forecast_validation.ipynb
β”‚       └── 06_classifier_evaluation.ipynb
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ stress_test.sh                 # End-to-end validation: simulated failures
β”‚   β”œβ”€β”€ generate_traffic.sh            # Synthetic network traffic generation
β”‚   β”œβ”€β”€ export_training_data.py        # Bulk export from ES for labeling
β”‚   └── cron_setup.sh                  # Install cron jobs for ML pipelines
β”‚
└── tests/
    β”œβ”€β”€ test_feature_engineering.py
    β”œβ”€β”€ test_log_preprocessing.py
    └── test_es_integration.py

9. Observability Dashboards

Dashboard 1: Platform Health Overview (Home)

Single-pane summary. Four stat panels at top: total anomalies (24h), critical logs (24h), nearest capacity exhaustion date, overall system health score (composite). Below: a row of sparklines per host showing CPU, RAM, disk. A log volume time series broken down by ML classification. A network anomaly timeline. This is the "glance and know" dashboard.

Dashboard 2: Capacity Forecasting

One row per metric (CPU, RAM, disk, network). Each row contains: a time series panel showing 30 days of historical data overlaid with the Prophet forecast line and shaded confidence intervals (80% and 95%). A gauge showing current utilization. A stat panel showing computed "days to exhaustion." Alert annotations appear on the time series when thresholds are crossed.

Dashboard 3: Network Anomaly Detection

Top row: anomaly count over time (bar chart, 1-hour buckets), top 10 anomalous source IPs (table), protocol distribution of anomalous connections (pie chart). Bottom row: a scatter plot of connection duration vs. bytes transferred, colored by anomaly score. A filterable table of raw anomaly events with columns for timestamp, source IP, dest IP, port, service, duration, bytes, and anomaly score. Clicking a row links to the Zeek log entry in Elasticsearch.

Dashboard 4: Log Intelligence

Top row: log volume by classification over time (stacked area chart β€” critical in red, warning in amber, informational in blue, noise in gray). A stat panel showing classifier confidence distribution. Middle row: a table of critical and warning logs with timestamp, host, service, message, ML confidence. Bottom row: low-confidence predictions flagged for human review. A "classification accuracy" panel updated from periodic manual spot-checks.

Dashboard 5: Per-Host Deep Dive

Variable selector for hostname. Shows all Node Exporter metrics for that host: CPU per-core, memory breakdown (used/buffered/cached/free), disk I/O operations and latency, network interface traffic, filesystem usage per mount. Includes log volume from that host and any anomalies associated with its IP.


10. Evaluation Metrics

Network Anomaly Detection (Isolation Forest)

Since this is unsupervised with no ground-truth labels, evaluation uses a pragmatic approach:

  • Injection testing: Deliberately generate known-bad traffic (port scans via nmap, DNS exfiltration simulation, high-volume data transfers) and measure detection rate. Target: >90% of injected anomalies flagged.
  • False positive rate: Manually review 100 random flagged anomalies per week. Track the ratio of true anomalies to false alarms. Target: FPR < 20% (acceptable for alerting with human-in-the-loop).
  • Silhouette score: Measure cluster separation in the feature space to validate that the model is finding meaningful structure, not random noise.
  • Contamination sensitivity analysis: Sweep contamination from 0.005 to 0.10 and plot precision/recall on the injected anomaly set to find the optimal operating point.

Capacity Forecasting (Prophet)

  • MAPE (Mean Absolute Percentage Error): Primary metric. Compute on a held-out 7-day test window. Target: MAPE < 10% for CPU and memory, < 15% for disk I/O and network (inherently noisier signals).
  • Coverage probability: Verify that the 95% confidence interval actually contains the true value β‰₯ 90% of the time. If coverage is significantly below 95%, the model is overconfident.
  • Residual analysis: Plot residuals over time. Check for autocorrelation (Durbin-Watson test) β€” correlated residuals indicate the model is missing a pattern.
  • Backtesting: Use Prophet's built-in cross-validation (cross_validation() with initial='21 days', period='7 days', horizon='7 days') to compute rolling MAPE and coverage.

Log Classification (TF-IDF + Logistic Regression)

  • Per-class precision, recall, F1-score: Reported via classification_report. Critical class is the most important β€” target recall > 0.95 (never miss a critical log). Acceptable to trade precision on noise class.
  • Macro-averaged F1: Overall model quality across all four classes. Target: > 0.85.
  • Confusion matrix: Visualize where misclassifications occur. The most costly error is critical β†’ noise (missed incident); least costly is noise β†’ informational.
  • Confidence calibration: Plot predicted probability vs. actual accuracy (reliability diagram). Use Platt scaling if the model is poorly calibrated.
  • Drift monitoring: Track weekly F1 on new data. If F1 drops > 5% from baseline, trigger retraining. Log new vocabulary terms (unseen tokens) as an early indicator of distribution shift.
  • Human-in-the-loop validation: Randomly sample 50 classified logs per week for manual review. Compute agreement rate between model and human labels. This is the ground-truth feedback loop.

Cross-Cutting Evaluation Practices

  • All models are versioned with timestamps and stored as .joblib files in the models/ directory.
  • Every training run logs hyperparameters, dataset size, and evaluation metrics to a training_log.csv for reproducibility.
  • A monthly "model review" compares current model performance against the previous month to catch silent degradation.
  • The stress test script (scripts/stress_test.sh) runs an end-to-end integration test: inject known anomalies, fill disk to 90%, flood logs β€” then verify that all three ML systems detect and surface the events within their next processing cycle.

About

This is my architectural plan to build out an AI network observability/security agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors