Drop a log file. Get ranked incidents, AI-explained verdicts, MITRE-mapped attack chains, and a forensic PDF report — in minutes.
NOCTRA AI is an open-source, browser-based Security Operations Center powered by Google Gemini AI. It ingests raw log files (CSV, JSON, syslog, EVTX, Windows Event, Apache, logfmt), runs 43 detection rules spanning the full MITRE ATT&CK kill-chain plus an XGBoost ML detector and a behavioral anomaly engine (UEBA), scores every alert with an explainable AI probability, collapses duplicate alerts before they ever reach the analyst, maps threats to MITRE techniques, and generates forensic PDF reports — all without storing a single byte to disk. A 5-phase ML self-upgrade pipeline continuously retrains thresholds and field aliases from real corpus data. Built for SOC analysts, blue teams, and cybersecurity learners who need enterprise-grade threat detection without enterprise-grade setup time.
Storageless · 43 rules across MITRE ATT&CK · XGBoost ML detector · Self-upgrading engine · Explainable AI · Evidence-bearing alerts · Auto-dedup · L1/L2 dual-mode · Dockerized
noctra-ai-autonomous-soc-platform.vercel.app
No signup required. Drop a log file or click "Run demo scenario" to see a synthetic multi-stage attack.
Note: The backend runs on Render's free tier — the first request after inactivity may take 30–50 seconds to wake up.
- What is a SOC?
- What NOCTRA does
- Why NOCTRA vs a normal SOC tool
- The detection pipeline
- Inside a detection rule (worked example)
- Anatomy of an alert
- The 43-rule catalogue at a glance
- Where AI is integrated
- How the AI attack score is calculated
- Noise reduction: how NOCTRA stops alert floods
- Walkthrough: log file → PDF report
- Architecture
- Deployment
- Local Development
- Glossary
- FAQ
A SOC (Security Operations Center) is the team and software inside a company that watches everything happening on the network — login attempts, file transfers, DNS queries, app errors — and tries to spot the activity that looks like an attacker rather than a normal user.
Think of a SOC like a hospital triage desk, but for cyber attacks. Most patients (events) walk in with a cold (noise). A few have something serious (an attack). The SOC's job is to figure out which is which, fast, with limited people.
| Tier | Role | Typical question |
|---|---|---|
| L1 — Triage Analyst | First responder. Decides if an alert is real (TP) or junk (FP). | "Is this worth waking someone up?" |
| L2 — Threat Analyst | Deep investigator. Reconstructs how an attacker moved. | "What did they touch, and how did they get in?" |
NOCTRA AI is a browser-based SOC that takes a raw log file (CSV / JSON / syslog / web access / EVTX / Windows Event / Apache / logfmt), runs 43 detection rules covering brute-force → lateral movement → exfiltration → cloud-identity abuse → EDR file-drops + an XGBoost ML detector + a behavioral anomaly engine (UEBA) + an AI classifier, collapses duplicates so one logical event = one alert, and gives the analyst a ranked queue of alerts with structured evidence and AI rationale. Behind the scenes, a 5-phase self-upgrade pipeline (corpus analyse → rule synthesise → parser extraction → model retrain) continuously improves thresholds, field aliases, and the ML model from labeled corpus data — triggered nightly or on demand via POST /admin/retrain. The analyst clicks through, the AI suggests verdicts and explains its reasoning, the platform auto-correlates related alerts into MITRE-mapped attack chains, and a one-click PDF incident report lands at the end. Nothing is stored on disk — all data lives in RAM and is wiped when the session ends.
| Traditional SOC stack | NOCTRA AI | |
|---|---|---|
| Deployment | Days to weeks — clusters, licenses, ingestion pipelines | Browser tab. No install. |
| Cost per investigation | $$ per GB ingested | Free per session |
| AI scoring | Usually a black-box "risk score" | 0–100 TP probability with the actual signals that produced it |
| Why this score? | Rarely shown | Click any score → list of weighted signals |
| MITRE ATT&CK mapping | Add-on / paid module | Built-in. Every rule maps to a technique + tactic |
| Attack-chain correlation | Custom SPL / KQL queries | Automatic. Related alerts stitched into kill-chain narratives |
| L1 vs L2 split | Same UI for everyone | Two purpose-built lenses |
| Behavioral profiling (UEBA) | Separate product | Built-in. Per-user + per-IP baselines with σ-deviation |
| Storage / compliance | Petabytes on disk | Zero bytes stored. Session lives in RAM, cleared on end |
Trade-off: NOCTRA is built for one log file per session — not a full enterprise SIEM. Best for: incident response, learning the SOC analyst role, demos, blue-team exercises, post-breach triage.
flowchart LR
A[01<br/>Ingest] --> B[02<br/>Normalize]
B --> C[03<br/>Detect]
C --> D[04<br/>ML Scan]
D --> E[05<br/>Score]
E --> F[06<br/>Enrich]
F --> G[07<br/>Chain]
G --> X[08<br/>Dedup]
X --> H[09<br/>Triage]
H --> I[10<br/>Report]
classDef stage fill:#1c1c20,stroke:#e11d48,color:#fff
class A,B,C,D,E,F,G,X,H,I stage
| # | Stage | What happens |
|---|---|---|
| 01 | Ingest | Auto-detect format (CSV/TSV, JSON/JSONL, Apache, syslog, Windows Event, logfmt) — format-detection signals from parser_hints.json (corpus-learned) are also consulted. Any unknown log falls back to a generic line parser, so ingestion never fails. |
| 02 | Normalize | Standardise columns to a canonical schema: timestamp, source_ip, dest_ip, dest_host, user, event_type, status, port, bytes. 95+ field aliases (40 built-in + 55 corpus-learned from parser_hints.json) cover camelCase cloud variants. Nested JSON is flattened so rules can read fields like alert_signature_id from a Suricata payload. |
| 03 | Detect | Run 43 deterministic rules (R001–R043) + UEBA IsolationForest + cross-event correlation. Rules group events by attacker context (IP, user, device) — one logical attack = one alert, not one per packet. Thresholds are hot-reloaded from rule_config.json (no restart needed). |
| 04 | ML Scan | XGBoost ML detector (ml_detector.py) scores every row with a 519-feature vector (500 TF-IDF + 12 hand-crafted + 7 format one-hots). Rows ≥ 70% confidence that weren't caught by deterministic rules emit additional ML-* alerts. |
| 05 | Score | AI assigns each alert a 0–1 TP probability with structured rationale + SHAP feature attribution. Heuristic fallback runs if Gemini is unavailable. |
| 06 | Enrich | IP reputation (AbuseIPDB / VirusTotal), geo, ASN, hash → MITRE technique. Lazy — only called when the analyst opens the alert. |
| 07 | Chain | Group related alerts into attack chains. Example: failed-login burst → successful login → privilege escalation → exfiltration = one kill-chain narrative. |
| 08 | Dedup | Safety net. Collapse identical alerts across rules and repeated uploads using (rule_id, source_ip, user, dest_ip) keys. Summed event_count, earliest timestamp, highest severity, and rolled_up_count surfaced in extra. |
| 09 | Triage | L1 queue with drawer, playbook, AI suggestion, keyboard nav. |
| 10 | Report | Generate L1 shift handover or L2 forensic dossier as PDF. |
A separate 5-phase pipeline runs nightly (UTC 03:00) or on demand via POST /admin/retrain:
flowchart LR
P1[Phase 1<br/>corpus_analyser] --> P2[Phase 2<br/>rule_synthesiser]
P2 --> P3[Phase 3<br/>parser_pattern_extractor]
P3 --> P4[Phase 4<br/>train_model]
P4 -->|hot-reload| E[(Engine)]
classDef ph fill:#1c1c20,stroke:#3b82f6,color:#fff
class P1,P2,P3,P4 ph
| Phase | Script | Output |
|---|---|---|
| 1 | corpus_analyser.py |
rule_insights.json — F1-optimised thresholds + discriminative bigrams per rule |
| 2 | rule_synthesiser.py |
Patches rule_config.json — only applies changes that improve F1 by ≥ 0.02 |
| 3 | parser_pattern_extractor.py |
parser_hints.json — corpus-learned field aliases + format-detection signals |
| 4 | train_model.py |
models/ml_detector.pkl — retrained XGBoost bundle (tfidf + clf keys) |
Poll progress: GET /admin/retrain. All admin endpoints require Authorization: Bearer <ADMIN_SECRET>.
Every NOCTRA rule follows the same three-step shape: filter → aggregate → emit. Here's R001 — "Credential brute force":
filter events where status == FAILED and source_ip is set
group by source_ip + 60-second sliding window
threshold ≥ 5 failed logins in the same window
emit ONE alert per (source_ip, window)
severity = HIGH
mitre_technique = T1110
evidence = list of the log indices that triggered it
Why this shape matters:
- Per-row alert loops (the anti-pattern: emit one alert per failed login) are how SOC tools generate floods. NOCTRA never iterates
for row in failed_logins:— it always groups first. - Sliding time windows rule out coincidence. 5 failed logins over 6 months is not brute force; 5 in 60 seconds is.
- Evidence indices let the UI jump straight to the raw log lines that produced the alert — no "trust me" black box.
Want to write your own? Use the in-app Rule Builder or drop a YAML rule into the DSL — same filter/group/threshold model, no Python required.
Every alert returned by POST /ingest is a JSON object with this shape:
{
"alert_id": "a-7f3c12",
"rule_id": "R001",
"rule_name": "Credential Brute Force",
"severity": "HIGH",
"tp_probability": 0.92,
"description": "8 failed logins from 203.0.113.66 in a 60-second window — credential compromise: SUCCEEDED",
"timestamp": "2026-05-25T02:31:14Z",
"source_ip": "203.0.113.66",
"user": "jdoe",
"event_count": 8,
"mitre_technique": "T1110",
"mitre_tactic": "Credential Access",
"related_log_indices": [12, 13, 15, 17, 19, 21, 22, 24],
"extra": {
"window_seconds": 60,
"succeeded_after": true,
"rolled_up_count": 1
},
"ai_rationale": "Burst of failed logins followed by success from same IP is a classic brute-force pattern.",
"shap_features": [
{"feature": "failed_login_count", "contribution": 0.41},
{"feature": "success_after_failures", "contribution": 0.28},
{"feature": "source_ip_reputation", "contribution": 0.13}
]
}| Field | What it tells the analyst |
|---|---|
tp_probability |
"How likely is this real?" — 0–1, blended from heuristic + Gemini. |
event_count |
How many raw log events were folded into this one alert. |
related_log_indices |
The exact rows of the source log that triggered this rule — click in the UI to jump to them. |
mitre_technique / mitre_tactic |
What attacker behaviour this is, in industry-standard ATT&CK vocabulary. |
extra.rolled_up_count |
If > 1, this alert is the merge of N near-identical alerts (dedup stage). |
shap_features |
Top signals the AI used to score this alert. Removes "black box" doubt. |
ai_rationale |
One-sentence English explanation tailored to this specific alert. |
| Family | Rule IDs | Examples | MITRE tactic |
|---|---|---|---|
| Credential & Identity | R001, R006, R007, R010, R013, R015, R016, R020, R033 | Brute force, off-hours login, new admin account, multi-service attack, LSASS dump, cleartext creds, account lockout storm, RDP brute, Kerberoasting | Credential Access |
| Privilege Escalation | R003 | Normal user → admin within window | Privilege Escalation |
| Lateral Movement & Recon | R002, R004, R008, R022 | Port scan, multi-host auth, web fuzzing 404 burst, impossible travel | Discovery, Lateral Movement |
| Exfiltration & C2 | R005, R014, R021, R026, R027 | Large outbound transfer, DNS tunneling, C2 beaconing, port-knocking, internal scan | Exfiltration, Command & Control |
| Web & App Attacks | R024, R025, R043 | SQL injection, web shell / recon UA, IDOR enumeration (sequential ID access) | Initial Access, Discovery |
| Endpoint & EDR | R011, R012, R017, R018, R019, R023, R031, R032 | Suspicious PowerShell, process injection, suspicious persistence, event log cleared, security tool tampering, ransomware file writes, masquerading, script drops EXE | Execution, Defense Evasion, Impact |
| Email & Phishing | R028, R029 | Suspicious email auth fail, phishing with risky attachment | Initial Access |
| Cloud Identity (AWS / Entra / M365) | R030, R034, R035, R036, R037, R038, R039, R040, R042 | Cloud admin grant, console root login, CloudTrail tampering, OAuth consent grant, AWS API without MFA, S3 anomalous volume, SharePoint mass download, cloud recon | Persistence, Defense Evasion, Collection |
| Geo & Behavioral Anomaly | R041 | Sign-in from unexpected country (configurable baseline via rule_config.json) |
Initial Access |
| Behavioral (UEBA) | UEBA-* |
IsolationForest per-user/IP σ-deviation from baseline | Multiple |
| ML Detector | ML-* |
XGBoost model catches attacks that regex rules miss — 519-feature vector, ≥ 70% confidence threshold | Multiple |
| # | Where | What the AI does | Fallback if unavailable |
|---|---|---|---|
| 1 | Detect | IsolationForest UEBA model scores each user/IP for deviation from baseline | Deterministic threshold rules |
| 2 | ML Scan | XGBoost classifier (trained on 68k labeled records) catches attack patterns rule regexes miss — 519 features, ≥ 70% threshold | Rule engine covers most detections |
| 3 | Score | Gemini classifier returns a 0–1 TP probability + rationale per alert | 10-signal heuristic scorer |
| 4 | Triage | AI generates alert-specific TP/FP reasons + tailored response playbook | Static reason library |
| 5 | Investigate | Autonomous agent produces verdict recommendation, key findings, reasoning steps | Manual investigation tabs |
| 6 | Chain | LLM writes a plain-English kill-chain narrative | Structured chain summary |
| 7 | Self-Upgrade | 5-phase pipeline (corpus analyse → rule synthesise → parser extraction → retrain) auto-tunes thresholds and retrains XGBoost nightly | Engine runs on last good config |
The ML detector (backend/engine/ml_detector.py) is a second, independent detection pass that runs after all 43 deterministic rules. It catches attack patterns that regexes can't express.
| Attribute | Value |
|---|---|
| Total labeled records | 68,655 |
| Log formats covered | syslog, JSON, WAF, CSV, Zeek, EVTX, generic |
| Label distribution | Balanced attack / benign split |
| Training script | noctra_training_data/train_model.py |
| Model output | backend/models/ml_detector.pkl (tfidf + clf keys) |
| Group | Count | Description |
|---|---|---|
| TF-IDF text features | 500 | Top 500 n-grams from the raw log line (first 1000 chars) |
| Hand-crafted features | 12 | Line length, digit ratio, special-char ratio, IP count, has_error, has_privesc, has_exfil, has_injection, has_user, has_timestamp, uppercase ratio, space ratio |
| Format one-hots | 7 | syslog, json, waf, csv, zeek, evtx, generic |
| Confidence | Severity | Meaning |
|---|---|---|
| ≥ 92% | CRITICAL |
High-certainty attack pattern |
| ≥ 80% | HIGH |
Strong attack signal |
| ≥ 70% | MEDIUM |
Probable attack — warrants review |
| < 70% | (not fired) | Below threshold — suppressed |
ML alerts carry rule IDs of the form ML-Rxxx (e.g. ML-R001) and include ml_confidence and raw_snippet in alert.extra. They are emitted only for rows not already covered by a deterministic rule — so the ML layer adds signal without duplicating.
The ML detector infers tactic/technique from the raw line using priority-ordered regex signals (credential failure → injection → privilege escalation → block/deny action → cloud events → exfiltration → PowerShell → discovery). Default fallback: Command and Control / T1071.
POST /admin/retrain
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ retrain_orchestrator.py │
│ │
│ Phase 1 → corpus_analyser.py │
│ • Reads 68k records from normalized/training_corpus.ndjson │
│ • Grid-searches threshold params (min_failures, min_ports…) │
│ to maximise per-rule F1 │
│ • Mines discriminative bigrams per rule (lift ≥ 30.0) │
│ • Outputs: rule_insights.json │
│ │
│ Phase 2 → rule_synthesiser.py │
│ • Reads rule_insights.json │
│ • Only applies threshold changes where ΔF1 ≥ 0.02 │
│ • Guards against generic words as IoC patterns │
│ • Patches rule_config.json + writes synthesis_report.json │
│ │
│ Phase 3 → parser_pattern_extractor.py │
│ • Mines field aliases per format (logfmt, json, csv…) │
│ • Generates format-detection signals (≥ 85% format purity) │
│ • Outputs: backend/engine/parser_hints.json │
│ │
│ Phase 4 → train_model.py │
│ • Rebuilds TF-IDF + XGBoost pipeline on full corpus │
│ • Saves backend/models/ml_detector.pkl │
│ │
│ Hot-reload → engine picks up new config + model on next call │
└─────────────────────────────────────────────────────────────────┘
Safety guards:
- Minimum F1 improvement gate (
MIN_F1_IMPROVEMENT = 0.02) — no regression from noisy corpus - Generic word blocklist prevents common tokens ("failed", "password", "scan", "type") from being injected as IoC patterns
- Minimum lift threshold (
MIN_LIFT_PATTERN = 30.0) — only patterns 30× more likely in attacks than benign are added - Concurrent retrain rejected — status polled via
GET /admin/retrain - Each script has a 600-second timeout to prevent hung pipeline
Monitoring:
# Trigger a retrain
curl -X POST https://your-backend/admin/retrain \
-H "Authorization: Bearer $ADMIN_SECRET"
# Poll progress
curl https://your-backend/admin/retrain \
-H "Authorization: Bearer $ADMIN_SECRET"
# → {"running": true, "phase": "corpus_analyser", "progress_pct": 25, ...}Every alert receives a 0–100 TP probability.
| Signal | Weight |
|---|---|
Severity = CRITICAL |
+25 |
Severity = HIGH |
+15 |
| Deterministic rule match | +10 |
| UEBA baseline deviation (>2σ) | +18 |
| Cross-event correlation hit | +12 |
| ≥ 2 MITRE techniques chained | +15 |
| Single MITRE technique mapped | +5 |
| IsolationForest anomaly > 0.6 | +10 |
| ≥ 5 correlated events on the same alert | +8 |
These are summed, clamped to 0–100, then blended with the Gemini classifier (70% AI / 30% heuristic when available).
| Score | Tier |
|---|---|
| ≥ 75% | HIGH CONFIDENCE TP |
| 45–74% | LIKELY TP |
| < 45% | LOW CONFIDENCE |
The #1 reason SOC analysts ignore their tools is alert fatigue — when one logical attack produces 100 alerts and the real signal drowns in repetition. NOCTRA fights this in four layers:
Every rule groups its matching events by attacker context (source_ip, user, device, sender) and emits one alert per group, not one per row. A ransomware run that drops 200 files = 1 alert with event_count: 200 and a sample of filenames in extra.
Volume-based rules (R001 brute force, R002 port scan, R008 fuzzing) require the threshold be hit inside a narrow window (60s, 30s, 5min). 20 HTTP 404s spread across a week is normal browsing noise; 20 in five minutes is fuzzing. This single check kills most "log file spans 7 days" false positives.
After all rules run, the ingest pipeline does one final sweep. Any alerts sharing (rule_id, source_ip, user, dest_ip) get merged into the earliest one — keeping the higher severity, the higher confidence, and recording rolled_up_count so the UI can show "5 duplicates suppressed". This catches anything the per-rule aggregation missed and prevents repeated uploads of the same activity from compounding.
The other half of "too many alerts" is "wrong alerts because fields were misparsed". NOCTRA's parser:
- Re-runs the status heuristic when the column is present-but-empty (a common CSV quirk where
keep_default_na=Falsemakes empty cells look populated). - Carries 95+ field aliases per canonical name (40 built-in + 55 corpus-learned from
parser_hints.json) —sourceIPAddress,source_ip,srcip,ClientIp,remote_addr,caller_ip_address,initiatedBy.user.ipAddress,hostname,destinationand many more all collapse to their canonical counterparts. - Flattens nested JSON so Suricata
alert.signature.idand AWSuserIdentity.arnend up as flat columns rules can read. - Normalises every empty/
"none"/"null"string to PythonNoneso.notna()checks behave consistently across cloud schemas.
Net effect: a real attack fires the expected handful of distinct alerts. A clean log fires nothing. Repeated uploads don't multiply.
sequenceDiagram
autonumber
actor A as Analyst
participant UI as Browser (NOCTRA UI)
participant API as FastAPI Backend
participant AI as Gemini AI
A->>UI: Drop log file on Upload page
UI->>API: POST /ingest
API-->>UI: Session ready — ranked alerts
A->>UI: Open Triage queue
A->>UI: Click alert → drawer opens
UI->>API: GET /verdict-assist
A->>UI: Confirm TP / Dismiss FP
A->>UI: Click "Run AI Agent"
UI->>API: POST /agent-investigate
API->>AI: Multi-step reasoning
AI-->>API: Verdict + findings
A->>UI: Export Report
UI-->>A: PDF incident dossier
flowchart TB
subgraph Browser["Browser (Vite + React 18)"]
L[Landing] & U[Upload] & T[Triage] & I[Investigation] & H[Hunt] & Rb[Rule Builder] & D[Dashboard]
end
subgraph Backend["FastAPI Backend (Python 3.11)"]
R[Routers] & E[Detection Engine] & S[Session Store] & AIS[AI Service] & TIS[Threat Intel]
end
subgraph External["External APIs"]
G[Google Gemini] & AB[AbuseIPDB] & VT[VirusTotal]
end
Browser <-->|REST / JSON| R
R --> E & S & AIS & TIS
AIS --> G
TIS --> AB & VT
| Layer | Platform | URL |
|---|---|---|
| Frontend | Vercel | noctra-ai-autonomous-soc-platform.vercel.app |
| Backend | Render | https://noctra-ai-autonomous-soc-platform.onrender.com |
| Setting | Value |
|---|---|
| Root Directory | frontend |
| Build Command | npm run build |
| Output Directory | dist |
| Install Command | npm install |
Environment variables (Vercel):
| Key | Value |
|---|---|
VITE_API_URL |
Your Render backend URL |
| Setting | Value |
|---|---|
| Root Directory | backend |
| Runtime | Python 3 |
| Build Command | pip install -r requirements.txt |
| Start Command | uvicorn main:app --host 0.0.0.0 --port $PORT |
Environment variables (Render):
| Key | Description |
|---|---|
GEMINI_API_KEY |
Google AI Studio key |
ABUSEIPDB_API_KEY |
AbuseIPDB key |
VIRUSTOTAL_API_KEY |
VirusTotal key |
CORS_ORIGIN |
Your Vercel frontend URL |
SESSION_TTL_MINUTES |
30 |
MAX_UPLOAD_MB |
25 |
ADMIN_SECRET |
Bearer token for POST /admin/retrain (optional — leave unset to disable auth) |
RETRAIN_SCHEDULE_HOUR_UTC |
UTC hour for nightly retrain (default 3) |
# Copy and fill in your API keys
cp backend/.env.example backend/.env
# Start both services
docker compose up --buildFrontend: http://localhost:3000 · Backend: http://localhost:8000
See SETUP.txt for full manual setup instructions.
# Backend
cd backend
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
uvicorn main:app --reload --port 8000
# Frontend (new terminal)
cd frontend
npm install
npm run devOpen http://localhost:5173.
cp .env.example .env.prod
# Fill in .env.prod with real API keys and URLs
docker compose --env-file .env.prod -f docker-compose.prod.yml up -d| Term | Meaning |
|---|---|
| Alert | The platform flagging "this looks suspicious." A grouped event, not a single log line. |
| TP / FP | True Positive (real attack) / False Positive (noise). |
| Triage | Quickly sorting alerts into TP vs FP. |
| MITRE ATT&CK | Industry catalogue of attacker techniques. Every NOCTRA rule maps to one. |
| Technique vs Tactic | A tactic is the attacker's goal ("Credential Access"); a technique is how they do it ("T1110 – Brute Force"). |
| UEBA | User & Entity Behavior Analytics — flags deviations from baseline using IsolationForest. |
| Attack chain | A sequence of related alerts that together describe one attack story (e.g. brute-force → escalation → exfil). |
| Kill chain | Conceptual model of an attack's stages: recon → weaponise → deliver → exploit → install → C2 → actions on objectives. |
| IOC | Indicator of Compromise — an IP, domain, hash, or user seen in an attack. |
| SHAP | Technique that explains which features most affected an ML model's score. |
| XGBoost | Gradient-boosted tree ensemble used by the ML detector. 68k training records, 519 features (500 TF-IDF + 12 hand-crafted + 7 format one-hots), ≥70% confidence threshold. |
| TF-IDF | Term Frequency–Inverse Document Frequency — converts raw log text into a numeric vector. Top 500 n-grams form 96% of the ML feature vector. |
| Self-upgrade pipeline | 5-phase background job (corpus_analyser → rule_synthesiser → parser_pattern_extractor → train_model) that tunes detection automatically from labeled log data. Runs nightly or on demand via POST /admin/retrain. |
| L1 / L2 | Tier-1 (triage & respond) / Tier-2 (hunt & correlate). |
| Sliding window | A time range that moves with the events — "5 failed logins in any 60-second span" rather than "in the last fixed minute". |
| Aggregation | Collapsing many matching events into one alert with a count, instead of one alert per event. |
| Dedup / collapse | Pipeline-wide pass that merges alerts sharing rule + actor + target. Stops floods. |
| Evidence | The exact log-row indices that triggered a rule — lets the analyst verify, not just trust. |
| Field alias | Many log sources call the same thing different names (source_ip vs sourceIPAddress vs client_ip). Aliases collapse them to one canonical name. |
| Storageless | Nothing persists to disk. Session lives only in server RAM and is wiped after 30 min idle. |
Q. Does NOCTRA replace Splunk / Sentinel?
No. NOCTRA is for one log file per session — incident response, learning, demos, post-breach triage. Use a full SIEM for continuous enterprise monitoring.
Q. Does the AI send my raw logs to Google?
No. Only the alert envelope (rule name, MITRE tag, timestamps) is sent to Gemini. Raw log lines stay in your backend RAM.
Q. What if Gemini is down or I have no API key?
Everything still works. The platform falls back to a 10-signal deterministic scorer.
Q. How is "storageless" enforced?
Sessions live in a Python dict in process memory. A janitor task evicts them after 30 minutes of inactivity. No DB, no disk write.
Q. Can I add my own rules?
Yes — the Rule Builder ships with four templates. Compose multi-condition filters, assign severity, map a MITRE technique, and test-fire against the active session.
Q. I uploaded the same log twice and got the same alerts twice. Is that a bug?
No — each upload creates an independent session. Within a single session, NOCTRA dedups aggressively (Layer 3 above). Across sessions, history is intentionally isolated so demos and investigations don't bleed into each other.
Q. A rule didn't fire on a log I expected to trigger it. What do I check?
Three things, in order: (1) Did the parser map your column names correctly? Open the session detail page — if source_ip shows empty rows it means your log used a name not yet aliased. (2) Did the rule's threshold/window actually match? Volume rules need the burst inside their window. (3) Did the dedup pass collapse it into another alert? Look for extra.rolled_up_count > 1 on a neighbouring alert.
Q. Why "43 rules"? Will there be more?
43 is the current coverage across the MITRE ATT&CK matrix from credential access through cloud persistence, EDR detections, and IDOR enumeration (R001–R043). The ML self-upgrade pipeline (POST /admin/retrain) can synthesise new rule candidates from corpus data. Adding a rule manually is a single function in engine/rules.py.
Q. How does NOCTRA tell aggregation from suppression?
Aggregation happens inside a rule (group rows that match one rule together). Dedup happens across rules at the pipeline end (merge alerts that point at the same actor+target). Both preserve event_count so nothing is "lost" — only the per-row noise is.
Q. What log formats actually work today?
CSV / TSV (any delimiter, mixed case headers OK), JSON / JSONL / NDJSON (nested objects auto-flattened), Apache combined / common, syslog (RFC 3164 + 5424), Windows Event Log text export, logfmt key=value, generic free-text (line per event). Cloud-specific: AWS CloudTrail JSON, Entra Sign-In + Audit logs, M365 Unified Audit, Defender for Endpoint exports, Suricata EVE JSON.
NOCTRA AI · Autonomous SOC · v4.0 · 43 rules · XGBoost ML detector · Self-upgrading engine · Auto-dedup · Storageless by design · MIT License