Skip to content

S0UGATA/security-kg

Repository files navigation

security-kg

CI Dataset Update HuggingFace Python 3.12+ License Visualizer

Convert security data from 24 sources into Subject-Predicate-Object (SPO) knowledge-graph triples in Parquet format.

Sources: ATT&CK · CAPEC · CWE · CVE · CPE · D3FEND · ATLAS · CAR · ENGAGE · F3 · EPSS · KEV · Vulnrichment · GHSA · Sigma · ExploitDB · MISP Galaxies · LOLBAS · LOLDrivers · Atomic Red Team · NIST 800-53 · Nuclei · EUVD · OSV

Knowledge Graph Structure

---
config:
  layout: dagre
  theme: neutral
---
graph LR
    %% ATT&CK core
    C[Campaign]:::attack -->|attributed-to| G[Group]:::attack
    C -->|uses| T[Technique]:::attack
    G -->|uses| T
    G -->|uses| SW[Malware / Tool]:::attack
    SW -->|uses| T
    ST[Sub-technique]:::attack -->|subtechnique-of| T
    T -->|belongs-to-tactic| TAC[Tactic]:::attack
    MIT[Mitigation]:::attack -->|mitigates| T
    DC[DataComponent]:::attack -->|detects| T

    %% Defense & detection
    DT[DefensiveTechnique]:::d3fend -->|counters| T
    AN[Analytic]:::car -->|detects-technique| T
    AN -->|maps-to-d3fend| DT
    EA[EngagementActivity]:::engage -->|engages-technique| T
    AT[ATLAS Technique]:::atlas -->|related-attack-technique| T

    %% Red team, binary abuse & controls
    ART2[AtomicTest]:::atomic -->|tests-technique| T
    LB[LOLBinary]:::lolbas -->|maps-to-technique| T
    LD[LOLDriver]:::loldrivers -->|maps-to-technique| T
    SC[SecurityControl]:::nist -->|mitigates-technique| T

    %% Threat intel
    TA[ThreatActor]:::misp -->|related-attack-id| T
    TA -->|targets-country| CTR[Country]:::misp
    TA -->|targets-sector| SEC[Sector]:::misp

    %% CAPEC ↔ CWE bridge
    AP[Attack Pattern]:::capec -->|maps-to-technique| T
    AP <-->|related| W[Weakness]:::cwe

    %% Vulnerability chain
    V[Vulnerability]:::cve -->|related-weakness| W
    V -->|affects-cpe| P[Platform]:::cpe
    OV[OSVulnerability]:::osv -->|related-cve| V
    OV -->|affects-package| PKG[Package]:::osv
    V -.->|epss-score| ES((EPSS)):::epss
    V -.->|kev| KE((KEV)):::kev
    NT[NucleiTemplate]:::nuclei -->|related-cve| V
    EU[EUVulnerability]:::euvd -->|related-cve| V

    %% Standalone
    FT[F3 Technique]:::f3 -->|belongs-to-tactic| FTAC[F3 Tactic]:::f3

    classDef attack fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef capec fill:#fef3c7,stroke:#f59e0b,color:#78350f
    classDef cwe fill:#fce7f3,stroke:#ec4899,color:#831843
    classDef cve fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef cpe fill:#e0e7ff,stroke:#6366f1,color:#312e81
    classDef d3fend fill:#d1fae5,stroke:#10b981,color:#064e3b
    classDef car fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef engage fill:#ede9fe,stroke:#8b5cf6,color:#4c1d95
    classDef f3 fill:#fbcfe8,stroke:#ec4899,color:#831843
    classDef atlas fill:#cffafe,stroke:#06b6d4,color:#164e63
    classDef epss fill:#f3f4f6,stroke:#6b7280,color:#374151
    classDef kev fill:#f3f4f6,stroke:#6b7280,color:#374151
    classDef misp fill:#fdf2f8,stroke:#db2777,color:#831843
    classDef atomic fill:#fff7ed,stroke:#f97316,color:#7c2d12
    classDef lolbas fill:#fef2f2,stroke:#dc2626,color:#7f1d1d
    classDef loldrivers fill:#fef2f2,stroke:#b91c1c,color:#7f1d1d
    classDef nist fill:#f0fdf4,stroke:#16a34a,color:#14532d
    classDef nuclei fill:#eff6ff,stroke:#2563eb,color:#1e3a5f
    classDef euvd fill:#fdf4ff,stroke:#a855f7,color:#581c87
    classDef osv fill:#f0fdfa,stroke:#14b8a6,color:#134e4a
Loading

Legend: Blue = ATT&CK · Amber = CAPEC · Pink = CWE / F3 · Red = CVE · Indigo = CPE · Green = D3FEND / NIST · Cyan = ATLAS · Yellow = CAR · Violet = ENGAGE · Fuchsia = MISP Galaxies · Gray = EPSS / KEV · Orange = Atomic Red Team · Scarlet = LOLBAS / LOLDrivers · Royal Blue = Nuclei · Purple = EUVD · Teal = OSV

Usage

pip install -r requirements.txt

# Convert all 24 sources → output/*.parquet + combined.parquet
python src/convert.py

# Convert specific sources in parallel
python src/convert.py --sources cve epss kev --parallel --workers 8
All options
Option Description
--sources <src ...> Sources to convert (default: all). Values: attack capec cwe cve cpe d3fend atlas car engage f3 epss kev vulnrichment ghsa sigma exploitdb misp_galaxy lolbas loldrivers atomic nist_800_53 nuclei euvd osv
--domains <dom ...> ATT&CK domains: enterprise, mobile, ics (default: all)
--output-dir <dir> Output directory (default: output/)
--cache-dir <dir> Source file cache (default: source/)
--parquet-format v1|v2 v2 = Parquet 2.6 + snappy (default), v1 = 1.0 + gzip
--no-combined Skip combined.parquet generation
--parallel Run conversions in parallel
--workers <n> Parallel workers (default: 4)
--force-download Re-download source data even if cached version is up-to-date
--force-convert Re-convert even if source data hasn't changed
--limit <n> Limit each source to N triples (quick local testing)
--update-readme Update hf_dataset/README.md with triple counts
--no-stats Skip dashboard stats JSON generation
--log-dir <dir> Log file directory (default: logs/)

Individual converters also run standalone: python src/convert_attack.py, python src/convert_cve.py, etc.

Source files are cached in source/ by default. Files are versioned using Last-Modified or ETag headers and only re-downloaded when the source has been updated.

Output goes to output/:

File Source Est. Triples
enterprise.parquet ATT&CK Enterprise ~40-50K
mobile.parquet ATT&CK Mobile ~5-7K
ics.parquet ATT&CK ICS ~4-5K
attack-all.parquet ATT&CK combined (deduplicated) ~50-60K
capec.parquet CAPEC attack patterns ~8-10K
cwe.parquet CWE weaknesses ~14-16K
cve.parquet CVE vulnerabilities ~3-4M
cpe.parquet CPE platform enumeration ~10-15M
d3fend.parquet D3FEND defensive techniques ~8-10K
atlas.parquet ATLAS AI/ML techniques ~1-2K
car.parquet CAR analytics ~1-2K
engage.parquet ENGAGE adversary engagement ~1-2K
f3.parquet F3 fraud techniques & tactics ~1-2K
epss.parquet EPSS exploit prediction scores ~600-700K
kev.parquet KEV known exploited vulns ~15-20K
vulnrichment.parquet CISA Vulnrichment (SSVC, CVSS, CWE) ~500K-1M
ghsa.parquet GitHub Security Advisories ~300-400K
sigma.parquet Sigma detection rules ~30-40K
exploitdb.parquet ExploitDB public exploits ~300-400K
misp_galaxy.parquet MISP Galaxy clusters ~100-200K
lolbas.parquet LOLBAS living-off-the-land binaries ~5-10K
loldrivers.parquet LOLDrivers vulnerable drivers ~10-15K
atomic.parquet Atomic Red Team test procedures ~15-20K
nist_800_53.parquet NIST 800-53 ATT&CK mappings ~5-10K
nuclei.parquet Nuclei detection templates ~30-40K
euvd.parquet EUVD EU vulnerability database ~10-20K
osv.parquet OSV open-source vulnerabilities ~500K-1M+
combined.parquet All sources merged (deduplicated) ~16-22M

Cross-Source Links

ATT&CK <──> CAPEC <──> CWE <──> CVE <──> CPE
  ^                              ^
  ├── D3FEND (counters)          ├── EPSS (scores)
  ├── ATLAS (AI parallel)        ├── KEV (exploited)
  ├── CAR (detects)              ├── Vulnrichment (SSVC/CVSS)
  ├── ENGAGE (engages)           ├── GHSA (advisories)
  ├── F3 (fraud techniques)      ├── Nuclei (templates)
  ├── Sigma (detects)            ├── Sigma (related CVE)
  ├── LOLBAS (maps-to)           ├── ExploitDB (exploits)
  ├── LOLDrivers (maps-to)       ├── EUVD (EU vulns)
  ├── Atomic Red Team (tests)    └── OSV (open-source vulns) ──> Packages
  ├── NIST 800-53 (mitigates)
  └── MISP Galaxies (cross-refs)

Examples

Graph Traversals

The SPO triples support real graph queries via DuckDB recursive CTEs — multi-hop traversals, hierarchy walks, and cross-source chain analysis without a graph database.

python examples/graph_traversals.py                          # all 8 queries
python examples/graph_traversals.py --query exploit-to-defense  # single query
python examples/graph_traversals.py --list                   # list queries
Query Description
attack-path Technique → CAPEC → CWE multi-hop chain (recursive CTE)
defense-coverage All CAR/Sigma/D3FEND/Engage defenses per technique
cwe-hierarchy Walk CWE child-of tree to root pillar (recursive CTE)
vuln-risk CVE risk profile across EPSS, KEV, CVSS, Vulnrichment
exploit-to-defense Exploit → CVE → CWE → CAPEC → technique → defenses (5-hop)
threat-actor Threat actors → ATT&CK techniques → target platforms
sigma-gap ATT&CK techniques with vs without Sigma/CAR detection
stats Cross-source relationship density statistics

Cross-Source Analysis Notebook

The cross-source visualizations notebook demonstrates 16 analyses across all 24 sources — including SSVC patch prioritization, defensive gap analysis, kill chain coverage, exploit weaponization timelines, supply chain risk scoring, and more.

pip install -e ".[viz]"
jupyter notebook examples/cross_source_visualizations.ipynb

Visualizer

Explore the Parquet files interactively at security-kg-viz.

Tests

python -m pytest tests/ -v --ignore=tests/test_integration.py  # unit tests
python -m pytest tests/test_integration.py -v                   # integration (network)

HuggingFace Dataset

The dataset is published at s0u9ata/security-kg on HuggingFace Hub and auto-updated weekly via GitHub Actions.

See the dataset card for schema details, example queries, and usage with the datasets library.

Future Data Sources

The following sources were researched and evaluated for inclusion. They are deferred for now but may be added in future versions.

Medium-Value Candidates

Source Format Cross-links License Notes
GTFOBins YAML-in-Markdown (~400+ binaries) ATT&CK via Navigator layer GPL-3.0 Linux counterpart to LOLBAS. Parsing slightly awkward (YAML front-matter in Markdown).
DISARM CSV + STIX Mirrors ATT&CK structure CC-BY-SA-4.0 Disinformation tactics & techniques. Niche domain (info ops, not cyber). STIX format eases integration.
Caldera Stockpile YAML abilities ATT&CK technique IDs Apache-2.0 Adversary emulation abilities mapped to ATT&CK. Smaller than Atomic Red Team, some overlap.
RE&CT YAML (~200 actions) Response actions → ATT&CK techniques Apache-2.0 Defensive complement — incident response actions that counter specific ATT&CK techniques.
VERIS JSON Schema + CSV VERIS actions → ATT&CK mapping CC Incident taxonomy (Verizon DBIR vocabulary). Schema/vocabulary rather than entity database.
OWASP ASVS CSV CWE mappings per requirement CC-BY-SA-4.0 Web-app security verification requirements. CWE cross-links need confirmation.

International Sources Investigated

Source Country Status
JVN iPedia Japan RSS feeds available, CVE-linked, bilingual (JP/EN). Limited bulk structured data access.
ThaiCERT Thailand 504 APT group threat cards, structured. Niche coverage, limited API.
CNNVD / CNVD China Access restrictions for non-Chinese IPs, data quality concerns, significant latency vs NVD.
KrCERT / KNVD South Korea Limited public API, Korean-language only.
BSI Germany Advisories available, German-language, no bulk structured feed.
ANSSI France Advisories and IOC reports, French-language, limited machine-readable data.
CERT-In India CVE CNA, publishes advisories but no bulk structured data download.
AusCERT Australia RSS feeds available, English-language. Limited structured data beyond advisories.
CERT-EU EU Threat landscape reports, limited machine-readable data.
BDU (FSTEC) Russia Poor data quality, slow updates, access restrictions.

Evaluated and Excluded

Source Why Excluded
MAEC Malware attribute enumeration. Sparse community adoption, limited structured data available.
OVAL Compliance-focused XML definitions. Very large, focused on system configuration rather than threat context.
CCE Configuration enumeration (Excel format). Narrow scope, limited cross-linking potential.
Abuse.ch (ThreatFox/URLhaus/MalwareBazaar) IOC feeds are ephemeral/high-volume and don't produce stable entity relationships for a KG.
Ransomware.live API-only, rate-limited, no bulk download.
PhishTank No cross-links to ATT&CK/CVE/CWE. Pure IOC feed.
Metasploit Modules No machine-readable CVE mapping file. Would require Ruby AST parsing.
MITRE EMB3D Very niche (OT/embedded). Cross-links to ATT&CK/CWE unclear. Worth revisiting as it matures.
CIS Controls No freely downloadable machine-readable data. Proprietary.
VulnCheck KEV No confirmed public bulk data repository. Commercial.
AttackIQ / SCYTHE / ANY.RUN / Triage Commercial platforms, no open bulk data.

Related Work

BRON is a linked threat knowledge graph developed at MIT that bridges ATT&CK, CAPEC, CWE, CVE, and D3FEND into a unified graph structure. Its goals overlap significantly with this project — both aim to connect disparate security ontologies into a queryable knowledge graph. This project was started independently and covers a broader set of sources (24 vs BRON's 5) with a flat SPO triple design stored as Parquet rather than a property graph.

This project grew out of the author's master's thesis, which applied KEPLER (a joint knowledge-embedding and language model) to classify ATT&CK techniques from cyber threat intelligence reports. The thesis converted TRAM training data into KG triples enriched with ATT&CK metadata — tactics, technique names, and procedure examples — and showed that the KG-enhanced model outperformed a text-only baseline, validating the value of structured security knowledge for downstream ML tasks. security-kg extends that data pipeline to 24 sources, providing the broad, structured KG foundation that such models need at scale.

Source Licensing & Attribution

This project is licensed under Apache 2.0. The underlying source data is provided under various licenses as detailed below.

Source License Attribution
ATT&CK Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CAPEC Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CWE Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CVE Custom permissive (MITRE) © The MITRE Corporation. CVE® is a registered trademark of The MITRE Corporation.
CPE / NVD Public domain (NIST) This product uses data from the NVD API but is not endorsed or certified by the NVD.
D3FEND MIT License © The MITRE Corporation. MITRE D3FEND™ is a trademark of The MITRE Corporation.
ATLAS Apache 2.0 © MITRE.
CAR Apache 2.0 © The MITRE Corporation.
ENGAGE Apache 2.0 (GitHub repo) / Custom restrictive (website ToU) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation. Note: the GitHub repo is licensed Apache 2.0, but the website terms restrict use to internal/non-commercial purposes. Clarification pending with MITRE.
F3 Apache 2.0 © MITRE Engenuity, Center for Threat-Informed Defense.
EPSS Custom permissive (FIRST) Jacobs, Romanosky, Edwards, Roytman, Adjerid (2021), Exploit Prediction Scoring System, Digital Threats Research and Practice, 2(3). See first.org/epss.
KEV Public domain (U.S. Gov) Source: CISA Known Exploited Vulnerabilities Catalog.
Vulnrichment CC0 1.0 Universal Source: CISA Vulnrichment.
GHSA CC BY 4.0 Source: GitHub Advisory Database. Licensed under CC BY 4.0.
Sigma Detection Rule License 1.1 Source: SigmaHQ. Licensed under DRL 1.1. Rule author attribution is preserved in triples.
ExploitDB GPLv2+ Source: OffSec ExploitDB. Derived factual metadata (IDs, CVE mappings, dates) extracted under GPLv2+.
MISP Galaxies CC0 1.0 / BSD 2-Clause Source: MISP Project. Dual-licensed under CC0 1.0 and BSD 2-Clause.
LOLBAS GPL-3.0 Source: LOLBAS Project. Licensed under GPL-3.0.
LOLDrivers Apache 2.0 Source: magicsword-io.
Atomic Red Team MIT License Source: Red Canary. Licensed under MIT.
NIST 800-53 Mappings Apache 2.0 © MITRE Engenuity, Center for Threat-Informed Defense.
Nuclei Templates MIT License Source: ProjectDiscovery. Licensed under MIT.
EUVD Public (ENISA) Source: European Union Agency for Cybersecurity (ENISA).
OSV CC BY 4.0 Source: Google OSV. Licensed under CC BY 4.0.

License

Apache 2.0 — see Source Licensing & Attribution for individual source terms.