This is my architectural plan to build out an AI network observability/security agent
The platform is a four-layer stack: Collection β Transport β Intelligence β Presentation.
Collection Layer β Lightweight agents and sensors deployed on every monitored VM. Node Exporter exposes host metrics over HTTP. Filebeat tails syslog, application logs, and auth logs. Zeek (or Suricata) performs passive network analysis on a mirrored interface, producing structured connection logs (conn.log, dns.log, http.log, etc.).
Transport Layer β All telemetry converges on a centralized ingestion bus. Prometheus scrapes metric endpoints on a 15-second interval. Filebeat ships logs to Elasticsearch via Logstash (which handles parsing, enrichment, and routing). Zeek logs are ingested either via Filebeat's Zeek module or by writing directly to a shared NFS mount consumed by the ML pipeline.
Intelligence Layer β A dedicated ML node runs scheduled Python jobs (orchestrated by Airflow or cron). Three model pipelines operate independently: an Isolation Forest anomaly detector on network telemetry, a Prophet time-series forecaster on system metrics, and a TF-IDF + Logistic Regression classifier on log text. Model outputs (anomaly scores, forecasts, log labels) are written back to Elasticsearch as enriched indices and exposed as Prometheus custom metrics via a Python pushgateway exporter.
Presentation Layer β Grafana serves as the single pane of glass. It queries Prometheus for real-time metrics and forecasts, Elasticsearch for log classification results and anomaly events, and renders alerting rules that trigger on ML-derived thresholds.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRESENTATION LAYER β
β Grafana (dashboards + alerts) β
ββββββββββββ¬βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β β
Prometheus Elasticsearch
β β
ββββββββββββ΄βββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββ
β INTELLIGENCE LAYER β
β ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ β
β β Anomaly Det. β β Capacity Forcast β β Log Classifier β β
β β IsolationFor β β Prophet β β TF-IDF + LogReg β β
β ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ β
ββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββ
β β β
ββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββ
β TRANSPORT LAYER β
β Prometheus scrape β Logstash pipeline β Zeek log shipping β
ββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββ
β β β
ββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββ
β COLLECTION LAYER β
β Node Exporter β Filebeat β Zeek/Suricata β Custom exporters β
β (deployed per VM) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Six VMs are recommended. All run Ubuntu 22.04 LTS.
| VM Role | Hostname | vCPU | RAM | Disk | Purpose |
|---|---|---|---|---|---|
| Observability Server | obs-server | 4 | 8 GB | 100 GB SSD | Prometheus, Grafana, Alertmanager |
| Log Server | log-server | 4 | 12 GB | 200 GB SSD | Elasticsearch (single-node), Logstash, Kibana |
| Network Sensor | net-sensor | 2 | 4 GB | 50 GB | Zeek/Suricata, 2 NICs (mgmt + mirror/span) |
| ML Node | ml-node | 4 | 8 GB | 80 GB SSD | Python ML pipelines, Airflow scheduler, model storage |
| Telemetry Node A | telem-a | 2 | 4 GB | 40 GB | Simulated workload, Node Exporter, Filebeat |
| Telemetry Node B | telem-b | 2 | 4 GB | 40 GB | Simulated workload, Node Exporter, Filebeat |
Total resource requirement: 18 vCPU, 40 GB RAM, 510 GB disk. Fits on a workstation with 32+ GB RAM if telem nodes are kept lean.
Networking: All VMs on a shared VMware NAT or host-only network (172.16.0.0/24). The network sensor's second NIC is connected to a port mirror/promiscuous VLAN segment carrying traffic from both telemetry nodes. VMware Workstation supports promiscuous mode on virtual switches to enable this.
- Node Exporter (port 9100) on every monitored VM exposes ~800 metrics: CPU per-core utilization, memory pressure, disk I/O latency, network bytes/packets, filesystem usage.
- Prometheus scrapes all exporters every 15s. Recording rules pre-compute 5m rate averages for CPU, memory, and disk. Retention set to 30 days.
- Custom Python exporter on ml-node pushes forecast metrics and anomaly scores to Prometheus Pushgateway, making ML outputs queryable via PromQL.
- Filebeat on each telemetry node tails
/var/log/syslog,/var/log/auth.log,/var/log/kern.log, and any application logs. Each event is tagged withhost,log_source, andtimestamp. - Logstash (on log-server) receives Filebeat output, applies grok filters to extract structured fields (severity, service, PID, message body), and writes to Elasticsearch index
logs-YYYY.MM.DD. - ML enrichment: The log classifier reads raw logs from Elasticsearch via the Python client, classifies them, and writes results back to a
logs-classifiedindex with an addedml_categoryfield.
- Zeek on net-sensor runs in cluster mode (single worker for homelab scale) on the mirrored NIC. It produces structured TSV logs:
conn.log(connection 5-tuples, durations, byte counts),dns.log,http.log,ssl.log,weird.log. - Filebeat Zeek module ships these logs to Elasticsearch for archival and dashboard queries.
- ML consumption: The anomaly detector reads
conn.logfeatures (duration, orig_bytes, resp_bytes, protocol, service, conn_state) directly from disk or Elasticsearch, computes feature vectors, and scores them.
Why Isolation Forest: Anomaly detection on network traffic is an unsupervised problem β you don't have labeled "attack" data in a homelab. Isolation Forest is specifically designed for this: it isolates outliers by randomly partitioning feature space, and anomalies require fewer partitions to isolate. It handles high-dimensional numeric data well, trains fast, and requires no distributional assumptions.
Features (derived from Zeek conn.log):
duration(connection length)orig_bytes,resp_bytes(data volume)orig_pkts,resp_pkts(packet counts)orig_ip_bytes / duration(throughput rate)service(one-hot encoded: dns, http, ssl, other)conn_state(encoded: S0, S1, SF, REJ, RSTO, etc.)hour_of_day,day_of_week(temporal features)
Training: Fit on 7 days of "normal" baseline traffic. Contamination parameter set to 0.01β0.05 (tuned via domain knowledge). Retrain weekly.
Output: Anomaly score per connection (-1 = anomaly, 1 = normal). Connections scoring below the threshold are flagged and written to an anomalies Elasticsearch index with the original connection metadata.
Why Prophet: It handles daily/weekly seasonality natively (which infrastructure metrics exhibit β think cron jobs at midnight, lower weekend utilization). It's robust to missing data points, requires minimal hyperparameter tuning, and produces interpretable confidence intervals β exactly what an SRE needs for capacity planning.
Metrics forecasted:
- CPU utilization (%) β 5-minute averaged
- Memory usage (%) β absolute used vs. total
- Disk usage (%) β per mount point
- Network throughput (bytes/sec) β interface level
Training: 30 days of Prometheus data exported via the HTTP API (/api/v1/query_range). Each metric gets its own Prophet model. Models are retrained daily.
Output: 7-day and 30-day forecasts with 80% and 95% confidence intervals. A "days to exhaustion" metric is computed by extrapolating when the upper confidence bound crosses 85% (warning) or 95% (critical). These values are pushed to Prometheus as custom gauge metrics.
Why TF-IDF + Logistic Regression: For structured log classification, this pipeline is well-proven and production-grade. TF-IDF converts log messages into sparse feature vectors that capture discriminative terms ("segfault," "OOM," "connection refused" vs. "session opened," "cron started"). Logistic Regression is fast to train, interpretable (you can inspect feature weights), and performs well with high-dimensional sparse inputs. More complex models (BERT, etc.) are overkill for this domain and impractical in a homelab.
Classes:
| Label | Description | Example patterns |
|---|---|---|
critical |
Service-impacting failures | OOM killer, segfault, disk full, kernel panic |
warning |
Degradation indicators | high latency, retries, connection timeout |
informational |
Normal operational events | service started, user login, cron executed |
noise |
Non-actionable entries | DHCP renewal, NTP sync, routine heartbeats |
Training data: Manually label 2,000β5,000 log lines from your homelab (this is realistic β a weekend of labeling). Use stratified sampling to ensure class balance. Augment with public datasets like the Loghub collection (https://github.com/logpai/loghub).
Pipeline: raw_text β regex cleanup (strip timestamps, PIDs) β TF-IDF vectorizer (max_features=10000, ngram_range=(1,2)) β Logistic Regression (C=1.0, class_weight='balanced').
Output: Each log line gets a predicted label and confidence score. Low-confidence predictions (< 0.6) are flagged for human review, creating a feedback loop for model improvement.
TELEMETRY NODES NETWORK SENSOR
β Node Exporter β Zeek conn.log
β Filebeat (logs) β Filebeat (Zeek logs)
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRANSPORT / STORAGE β
β β
β Prometheus βββ metric scrape β
β (TSDB, 30d) β
β β
β Logstash βββ log events βββΊ Elasticsearch β
β (parse/enrich) (raw logs index) β
β (zeek logs index) β
βββββββββ¬βββββββββββββββββββββββββββ¬ββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β ML NODE β
β β
β 1. Query Prometheus HTTP API βββΊ Prophet β
β (30d CPU/RAM/disk/net) (forecast) β
β β β
β ββββΊ Push forecast metrics to Pushgateway β
β β
β 2. Query Elasticsearch βββΊ TF-IDF + LogReg β
β (raw log text) (classify) β
β β β
β ββββΊ Write classified logs back to ES β
β β
β 3. Read Zeek conn.log βββΊ Isolation Forest β
β (from ES or NFS) (score anomalies) β
β β β
β ββββΊ Write anomaly events to ES β
β Push anomaly scores to Pushgateway β
βββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β GRAFANA β
β β
β Data sources: β
β Prometheus β metrics + forecasts + anomaly β
β Elasticsearch β logs + classifications + β
β anomaly events + Zeek data β
β β
β Dashboards β Alerts β PagerDuty/Slack webhook β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
- Provision all 6 VMs in VMware Workstation. Assign static IPs on 172.16.0.0/24.
- Install and configure Node Exporter on telem-a and telem-b.
- Deploy Prometheus on obs-server. Validate scrape targets are up.
- Install Grafana on obs-server. Connect Prometheus data source. Import the Node Exporter Full dashboard (ID: 1860).
- Validate end-to-end: generate CPU load with
stress-ng, confirm it appears in Grafana within 30 seconds.
- Deploy Elasticsearch (single-node) and Logstash on log-server.
- Install Filebeat on telem-a/b. Configure syslog and auth.log inputs.
- Build Logstash grok pipeline for syslog parsing. Validate structured output in Elasticsearch.
- Connect Elasticsearch as a Grafana data source. Build a basic log explorer dashboard.
- Configure net-sensor's second NIC in promiscuous mode.
- Install Zeek. Tune
local.zeekfor your network (setSite::local_nets). - Validate Zeek is producing conn.log, dns.log, http.log.
- Ship Zeek logs to Elasticsearch via Filebeat Zeek module.
- Generate synthetic traffic with
hping3,nmap, orcurlloops to create a baseline dataset.
- Export 5,000 log lines from Elasticsearch. Manually label them (use a simple CSV + spreadsheet workflow).
- Build the TF-IDF + Logistic Regression pipeline in a Jupyter notebook on ml-node.
- Train/evaluate with 80/20 split. Target F1 > 0.85 on each class.
- Productionize: write a Python script that queries ES for new logs, classifies them, and writes results back. Schedule with cron (every 5 minutes).
- Query Prometheus API for 30 days of CPU, RAM, disk, network metrics.
- Fit Prophet models per metric per host.
- Generate 7-day and 30-day forecasts. Compute days-to-exhaustion.
- Push forecasts to Prometheus Pushgateway. Build Grafana forecast overlay dashboards.
- Set Grafana alerts: warn at < 14 days to exhaustion, critical at < 7 days.
- Extract and featurize 7 days of Zeek conn.log data.
- Train Isolation Forest. Tune contamination parameter.
- Score new connections in batch (every 10 minutes via cron).
- Write anomaly events to Elasticsearch. Build anomaly dashboard in Grafana.
- Unify all Grafana dashboards into a single "Platform Overview" home dashboard.
- Configure Alertmanager with routing rules and a Slack/Discord webhook.
- Write a
stress_test.shscript that simulates failures (disk fill, CPU spike, port scan, log flood) to validate end-to-end detection. - Document everything. Write the README. Record a demo walkthrough.
| Layer | Tool | Version | Purpose |
|---|---|---|---|
| Metrics collection | Prometheus Node Exporter | 1.7+ | Host-level metrics |
| Metrics storage | Prometheus | 2.50+ | TSDB, scraping, PromQL |
| Metrics gateway | Prometheus Pushgateway | 1.7+ | ML-derived metric ingestion |
| Log shipper | Filebeat | 8.x | Lightweight log forwarding |
| Log processing | Logstash | 8.x | Parsing, enrichment, routing |
| Log/event store | Elasticsearch | 8.x (single-node) | Full-text search, analytics |
| Network monitor | Zeek | 6.x | Passive traffic analysis, structured logs |
| Alt network monitor | Suricata | 7.x | IDS/IPS with EVE JSON output |
| Packet analysis | Wireshark / tshark | 4.x | Ad hoc deep inspection |
| Visualization | Grafana | 10.x | Dashboards, alerting |
| Alerting | Alertmanager | 0.27+ | Alert routing, dedup, silencing |
| ML framework | scikit-learn | 1.4+ | Isolation Forest, Logistic Regression, TF-IDF |
| Forecasting | Prophet (via prophet) |
1.1+ | Time-series capacity forecasting |
| Data handling | pandas, numpy | latest | Feature engineering, data wrangling |
| ES client | elasticsearch-py | 8.x | Query/write Elasticsearch from Python |
| Prometheus client | prometheus-client | 0.20+ | Push custom metrics from Python |
| Scheduling | cron or Apache Airflow | β | ML pipeline orchestration |
| Notebooks | JupyterLab | 4.x | Model development, EDA |
| Traffic gen | hping3, nmap, stress-ng | β | Synthetic workload/traffic for testing |
ai-infra-observability/
βββ README.md # Project overview, architecture, setup guide
βββ LICENSE
βββ docs/
β βββ architecture.md # This document
β βββ vm-setup.md # VM provisioning instructions
β βββ runbook.md # Operational procedures
β βββ screenshots/ # Dashboard screenshots for portfolio
β
βββ infrastructure/
β βββ ansible/ # (Optional) Automated VM provisioning
β β βββ inventory.yml
β β βββ playbook-prometheus.yml
β β βββ playbook-elk.yml
β β βββ playbook-zeek.yml
β β βββ roles/
β βββ prometheus/
β β βββ prometheus.yml # Scrape config
β β βββ recording_rules.yml # Pre-computed PromQL expressions
β β βββ alerting_rules.yml # Alert definitions
β βββ logstash/
β β βββ pipelines.yml
β β βββ conf.d/
β β βββ 01-filebeat-input.conf
β β βββ 02-syslog-filter.conf
β β βββ 03-elasticsearch-output.conf
β βββ filebeat/
β β βββ filebeat.yml # Per-node Filebeat config
β βββ zeek/
β β βββ local.zeek # Zeek site policy
β β βββ node.cfg # Zeek cluster config
β βββ grafana/
β βββ provisioning/
β βββ dashboards/ # JSON dashboard definitions
β βββ datasources/ # Prometheus + ES datasource configs
β
βββ ml/
β βββ requirements.txt # Python dependencies
β βββ common/
β β βββ config.py # Shared configuration (ES hosts, Prometheus URL)
β β βββ es_client.py # Elasticsearch helper functions
β β βββ prom_client.py # Prometheus query helper
β βββ anomaly_detection/
β β βββ feature_engineering.py # Zeek conn.log β feature vectors
β β βββ train.py # Isolation Forest training
β β βββ detect.py # Batch scoring script (cron target)
β β βββ models/ # Serialized model files (.joblib)
β βββ capacity_forecast/
β β βββ extract_metrics.py # Prometheus API data extraction
β β βββ train.py # Prophet model training
β β βββ forecast.py # Generate forecasts + push to Pushgateway
β β βββ models/
β βββ log_classification/
β β βββ label_studio_export.py # (Optional) export from labeling tool
β β βββ preprocess.py # Text cleaning, TF-IDF fitting
β β βββ train.py # Logistic Regression training
β β βββ classify.py # Batch classification (cron target)
β β βββ models/
β βββ notebooks/
β βββ 01_eda_system_metrics.ipynb
β βββ 02_eda_zeek_connections.ipynb
β βββ 03_log_labeling_analysis.ipynb
β βββ 04_anomaly_model_tuning.ipynb
β βββ 05_forecast_validation.ipynb
β βββ 06_classifier_evaluation.ipynb
β
βββ scripts/
β βββ stress_test.sh # End-to-end validation: simulated failures
β βββ generate_traffic.sh # Synthetic network traffic generation
β βββ export_training_data.py # Bulk export from ES for labeling
β βββ cron_setup.sh # Install cron jobs for ML pipelines
β
βββ tests/
βββ test_feature_engineering.py
βββ test_log_preprocessing.py
βββ test_es_integration.py
Single-pane summary. Four stat panels at top: total anomalies (24h), critical logs (24h), nearest capacity exhaustion date, overall system health score (composite). Below: a row of sparklines per host showing CPU, RAM, disk. A log volume time series broken down by ML classification. A network anomaly timeline. This is the "glance and know" dashboard.
One row per metric (CPU, RAM, disk, network). Each row contains: a time series panel showing 30 days of historical data overlaid with the Prophet forecast line and shaded confidence intervals (80% and 95%). A gauge showing current utilization. A stat panel showing computed "days to exhaustion." Alert annotations appear on the time series when thresholds are crossed.
Top row: anomaly count over time (bar chart, 1-hour buckets), top 10 anomalous source IPs (table), protocol distribution of anomalous connections (pie chart). Bottom row: a scatter plot of connection duration vs. bytes transferred, colored by anomaly score. A filterable table of raw anomaly events with columns for timestamp, source IP, dest IP, port, service, duration, bytes, and anomaly score. Clicking a row links to the Zeek log entry in Elasticsearch.
Top row: log volume by classification over time (stacked area chart β critical in red, warning in amber, informational in blue, noise in gray). A stat panel showing classifier confidence distribution. Middle row: a table of critical and warning logs with timestamp, host, service, message, ML confidence. Bottom row: low-confidence predictions flagged for human review. A "classification accuracy" panel updated from periodic manual spot-checks.
Variable selector for hostname. Shows all Node Exporter metrics for that host: CPU per-core, memory breakdown (used/buffered/cached/free), disk I/O operations and latency, network interface traffic, filesystem usage per mount. Includes log volume from that host and any anomalies associated with its IP.
Since this is unsupervised with no ground-truth labels, evaluation uses a pragmatic approach:
- Injection testing: Deliberately generate known-bad traffic (port scans via nmap, DNS exfiltration simulation, high-volume data transfers) and measure detection rate. Target: >90% of injected anomalies flagged.
- False positive rate: Manually review 100 random flagged anomalies per week. Track the ratio of true anomalies to false alarms. Target: FPR < 20% (acceptable for alerting with human-in-the-loop).
- Silhouette score: Measure cluster separation in the feature space to validate that the model is finding meaningful structure, not random noise.
- Contamination sensitivity analysis: Sweep contamination from 0.005 to 0.10 and plot precision/recall on the injected anomaly set to find the optimal operating point.
- MAPE (Mean Absolute Percentage Error): Primary metric. Compute on a held-out 7-day test window. Target: MAPE < 10% for CPU and memory, < 15% for disk I/O and network (inherently noisier signals).
- Coverage probability: Verify that the 95% confidence interval actually contains the true value β₯ 90% of the time. If coverage is significantly below 95%, the model is overconfident.
- Residual analysis: Plot residuals over time. Check for autocorrelation (Durbin-Watson test) β correlated residuals indicate the model is missing a pattern.
- Backtesting: Use Prophet's built-in cross-validation (
cross_validation()withinitial='21 days',period='7 days',horizon='7 days') to compute rolling MAPE and coverage.
- Per-class precision, recall, F1-score: Reported via
classification_report. Critical class is the most important β target recall > 0.95 (never miss a critical log). Acceptable to trade precision on noise class. - Macro-averaged F1: Overall model quality across all four classes. Target: > 0.85.
- Confusion matrix: Visualize where misclassifications occur. The most costly error is
critical β noise(missed incident); least costly isnoise β informational. - Confidence calibration: Plot predicted probability vs. actual accuracy (reliability diagram). Use Platt scaling if the model is poorly calibrated.
- Drift monitoring: Track weekly F1 on new data. If F1 drops > 5% from baseline, trigger retraining. Log new vocabulary terms (unseen tokens) as an early indicator of distribution shift.
- Human-in-the-loop validation: Randomly sample 50 classified logs per week for manual review. Compute agreement rate between model and human labels. This is the ground-truth feedback loop.
- All models are versioned with timestamps and stored as
.joblibfiles in themodels/directory. - Every training run logs hyperparameters, dataset size, and evaluation metrics to a
training_log.csvfor reproducibility. - A monthly "model review" compares current model performance against the previous month to catch silent degradation.
- The stress test script (
scripts/stress_test.sh) runs an end-to-end integration test: inject known anomalies, fill disk to 90%, flood logs β then verify that all three ML systems detect and surface the events within their next processing cycle.