Convert security data from 24 sources into Subject-Predicate-Object (SPO) knowledge-graph triples in Parquet format.
Sources: ATT&CK · CAPEC · CWE · CVE · CPE · D3FEND · ATLAS · CAR · ENGAGE · F3 · EPSS · KEV · Vulnrichment · GHSA · Sigma · ExploitDB · MISP Galaxies · LOLBAS · LOLDrivers · Atomic Red Team · NIST 800-53 · Nuclei · EUVD · OSV
---
config:
layout: dagre
theme: neutral
---
graph LR
%% ATT&CK core
C[Campaign]:::attack -->|attributed-to| G[Group]:::attack
C -->|uses| T[Technique]:::attack
G -->|uses| T
G -->|uses| SW[Malware / Tool]:::attack
SW -->|uses| T
ST[Sub-technique]:::attack -->|subtechnique-of| T
T -->|belongs-to-tactic| TAC[Tactic]:::attack
MIT[Mitigation]:::attack -->|mitigates| T
DC[DataComponent]:::attack -->|detects| T
%% Defense & detection
DT[DefensiveTechnique]:::d3fend -->|counters| T
AN[Analytic]:::car -->|detects-technique| T
AN -->|maps-to-d3fend| DT
EA[EngagementActivity]:::engage -->|engages-technique| T
AT[ATLAS Technique]:::atlas -->|related-attack-technique| T
%% Red team, binary abuse & controls
ART2[AtomicTest]:::atomic -->|tests-technique| T
LB[LOLBinary]:::lolbas -->|maps-to-technique| T
LD[LOLDriver]:::loldrivers -->|maps-to-technique| T
SC[SecurityControl]:::nist -->|mitigates-technique| T
%% Threat intel
TA[ThreatActor]:::misp -->|related-attack-id| T
TA -->|targets-country| CTR[Country]:::misp
TA -->|targets-sector| SEC[Sector]:::misp
%% CAPEC ↔ CWE bridge
AP[Attack Pattern]:::capec -->|maps-to-technique| T
AP <-->|related| W[Weakness]:::cwe
%% Vulnerability chain
V[Vulnerability]:::cve -->|related-weakness| W
V -->|affects-cpe| P[Platform]:::cpe
OV[OSVulnerability]:::osv -->|related-cve| V
OV -->|affects-package| PKG[Package]:::osv
V -.->|epss-score| ES((EPSS)):::epss
V -.->|kev| KE((KEV)):::kev
NT[NucleiTemplate]:::nuclei -->|related-cve| V
EU[EUVulnerability]:::euvd -->|related-cve| V
%% Standalone
FT[F3 Technique]:::f3 -->|belongs-to-tactic| FTAC[F3 Tactic]:::f3
classDef attack fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
classDef capec fill:#fef3c7,stroke:#f59e0b,color:#78350f
classDef cwe fill:#fce7f3,stroke:#ec4899,color:#831843
classDef cve fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
classDef cpe fill:#e0e7ff,stroke:#6366f1,color:#312e81
classDef d3fend fill:#d1fae5,stroke:#10b981,color:#064e3b
classDef car fill:#fef9c3,stroke:#eab308,color:#713f12
classDef engage fill:#ede9fe,stroke:#8b5cf6,color:#4c1d95
classDef f3 fill:#fbcfe8,stroke:#ec4899,color:#831843
classDef atlas fill:#cffafe,stroke:#06b6d4,color:#164e63
classDef epss fill:#f3f4f6,stroke:#6b7280,color:#374151
classDef kev fill:#f3f4f6,stroke:#6b7280,color:#374151
classDef misp fill:#fdf2f8,stroke:#db2777,color:#831843
classDef atomic fill:#fff7ed,stroke:#f97316,color:#7c2d12
classDef lolbas fill:#fef2f2,stroke:#dc2626,color:#7f1d1d
classDef loldrivers fill:#fef2f2,stroke:#b91c1c,color:#7f1d1d
classDef nist fill:#f0fdf4,stroke:#16a34a,color:#14532d
classDef nuclei fill:#eff6ff,stroke:#2563eb,color:#1e3a5f
classDef euvd fill:#fdf4ff,stroke:#a855f7,color:#581c87
classDef osv fill:#f0fdfa,stroke:#14b8a6,color:#134e4a
Legend: Blue = ATT&CK · Amber = CAPEC · Pink = CWE / F3 · Red = CVE · Indigo = CPE · Green = D3FEND / NIST · Cyan = ATLAS · Yellow = CAR · Violet = ENGAGE · Fuchsia = MISP Galaxies · Gray = EPSS / KEV · Orange = Atomic Red Team · Scarlet = LOLBAS / LOLDrivers · Royal Blue = Nuclei · Purple = EUVD · Teal = OSV
pip install -r requirements.txt
# Convert all 24 sources → output/*.parquet + combined.parquet
python src/convert.py
# Convert specific sources in parallel
python src/convert.py --sources cve epss kev --parallel --workers 8All options
| Option | Description |
|---|---|
--sources <src ...> |
Sources to convert (default: all). Values: attack capec cwe cve cpe d3fend atlas car engage f3 epss kev vulnrichment ghsa sigma exploitdb misp_galaxy lolbas loldrivers atomic nist_800_53 nuclei euvd osv |
--domains <dom ...> |
ATT&CK domains: enterprise, mobile, ics (default: all) |
--output-dir <dir> |
Output directory (default: output/) |
--cache-dir <dir> |
Source file cache (default: source/) |
--parquet-format v1|v2 |
v2 = Parquet 2.6 + snappy (default), v1 = 1.0 + gzip |
--no-combined |
Skip combined.parquet generation |
--parallel |
Run conversions in parallel |
--workers <n> |
Parallel workers (default: 4) |
--force-download |
Re-download source data even if cached version is up-to-date |
--force-convert |
Re-convert even if source data hasn't changed |
--limit <n> |
Limit each source to N triples (quick local testing) |
--update-readme |
Update hf_dataset/README.md with triple counts |
--no-stats |
Skip dashboard stats JSON generation |
--log-dir <dir> |
Log file directory (default: logs/) |
Individual converters also run standalone: python src/convert_attack.py, python src/convert_cve.py, etc.
Source files are cached in source/ by default. Files are versioned using Last-Modified or ETag headers and only re-downloaded when the source has been updated.
Output goes to output/:
| File | Source | Est. Triples |
|---|---|---|
enterprise.parquet |
ATT&CK Enterprise | ~40-50K |
mobile.parquet |
ATT&CK Mobile | ~5-7K |
ics.parquet |
ATT&CK ICS | ~4-5K |
attack-all.parquet |
ATT&CK combined (deduplicated) | ~50-60K |
capec.parquet |
CAPEC attack patterns | ~8-10K |
cwe.parquet |
CWE weaknesses | ~14-16K |
cve.parquet |
CVE vulnerabilities | ~3-4M |
cpe.parquet |
CPE platform enumeration | ~10-15M |
d3fend.parquet |
D3FEND defensive techniques | ~8-10K |
atlas.parquet |
ATLAS AI/ML techniques | ~1-2K |
car.parquet |
CAR analytics | ~1-2K |
engage.parquet |
ENGAGE adversary engagement | ~1-2K |
f3.parquet |
F3 fraud techniques & tactics | ~1-2K |
epss.parquet |
EPSS exploit prediction scores | ~600-700K |
kev.parquet |
KEV known exploited vulns | ~15-20K |
vulnrichment.parquet |
CISA Vulnrichment (SSVC, CVSS, CWE) | ~500K-1M |
ghsa.parquet |
GitHub Security Advisories | ~300-400K |
sigma.parquet |
Sigma detection rules | ~30-40K |
exploitdb.parquet |
ExploitDB public exploits | ~300-400K |
misp_galaxy.parquet |
MISP Galaxy clusters | ~100-200K |
lolbas.parquet |
LOLBAS living-off-the-land binaries | ~5-10K |
loldrivers.parquet |
LOLDrivers vulnerable drivers | ~10-15K |
atomic.parquet |
Atomic Red Team test procedures | ~15-20K |
nist_800_53.parquet |
NIST 800-53 ATT&CK mappings | ~5-10K |
nuclei.parquet |
Nuclei detection templates | ~30-40K |
euvd.parquet |
EUVD EU vulnerability database | ~10-20K |
osv.parquet |
OSV open-source vulnerabilities | ~500K-1M+ |
combined.parquet |
All sources merged (deduplicated) | ~16-22M |
ATT&CK <──> CAPEC <──> CWE <──> CVE <──> CPE
^ ^
├── D3FEND (counters) ├── EPSS (scores)
├── ATLAS (AI parallel) ├── KEV (exploited)
├── CAR (detects) ├── Vulnrichment (SSVC/CVSS)
├── ENGAGE (engages) ├── GHSA (advisories)
├── F3 (fraud techniques) ├── Nuclei (templates)
├── Sigma (detects) ├── Sigma (related CVE)
├── LOLBAS (maps-to) ├── ExploitDB (exploits)
├── LOLDrivers (maps-to) ├── EUVD (EU vulns)
├── Atomic Red Team (tests) └── OSV (open-source vulns) ──> Packages
├── NIST 800-53 (mitigates)
└── MISP Galaxies (cross-refs)
The SPO triples support real graph queries via DuckDB recursive CTEs — multi-hop traversals, hierarchy walks, and cross-source chain analysis without a graph database.
python examples/graph_traversals.py # all 8 queries
python examples/graph_traversals.py --query exploit-to-defense # single query
python examples/graph_traversals.py --list # list queries| Query | Description |
|---|---|
attack-path |
Technique → CAPEC → CWE multi-hop chain (recursive CTE) |
defense-coverage |
All CAR/Sigma/D3FEND/Engage defenses per technique |
cwe-hierarchy |
Walk CWE child-of tree to root pillar (recursive CTE) |
vuln-risk |
CVE risk profile across EPSS, KEV, CVSS, Vulnrichment |
exploit-to-defense |
Exploit → CVE → CWE → CAPEC → technique → defenses (5-hop) |
threat-actor |
Threat actors → ATT&CK techniques → target platforms |
sigma-gap |
ATT&CK techniques with vs without Sigma/CAR detection |
stats |
Cross-source relationship density statistics |
The cross-source visualizations notebook demonstrates 16 analyses across all 24 sources — including SSVC patch prioritization, defensive gap analysis, kill chain coverage, exploit weaponization timelines, supply chain risk scoring, and more.
pip install -e ".[viz]"
jupyter notebook examples/cross_source_visualizations.ipynbExplore the Parquet files interactively at security-kg-viz.
python -m pytest tests/ -v --ignore=tests/test_integration.py # unit tests
python -m pytest tests/test_integration.py -v # integration (network)The dataset is published at s0u9ata/security-kg on HuggingFace Hub and auto-updated weekly via GitHub Actions.
See the dataset card for schema details, example queries, and usage with the datasets library.
The following sources were researched and evaluated for inclusion. They are deferred for now but may be added in future versions.
| Source | Format | Cross-links | License | Notes |
|---|---|---|---|---|
| GTFOBins | YAML-in-Markdown (~400+ binaries) | ATT&CK via Navigator layer | GPL-3.0 | Linux counterpart to LOLBAS. Parsing slightly awkward (YAML front-matter in Markdown). |
| DISARM | CSV + STIX | Mirrors ATT&CK structure | CC-BY-SA-4.0 | Disinformation tactics & techniques. Niche domain (info ops, not cyber). STIX format eases integration. |
| Caldera Stockpile | YAML abilities | ATT&CK technique IDs | Apache-2.0 | Adversary emulation abilities mapped to ATT&CK. Smaller than Atomic Red Team, some overlap. |
| RE&CT | YAML (~200 actions) | Response actions → ATT&CK techniques | Apache-2.0 | Defensive complement — incident response actions that counter specific ATT&CK techniques. |
| VERIS | JSON Schema + CSV | VERIS actions → ATT&CK mapping | CC | Incident taxonomy (Verizon DBIR vocabulary). Schema/vocabulary rather than entity database. |
| OWASP ASVS | CSV | CWE mappings per requirement | CC-BY-SA-4.0 | Web-app security verification requirements. CWE cross-links need confirmation. |
| Source | Country | Status |
|---|---|---|
| JVN iPedia | Japan | RSS feeds available, CVE-linked, bilingual (JP/EN). Limited bulk structured data access. |
| ThaiCERT | Thailand | 504 APT group threat cards, structured. Niche coverage, limited API. |
| CNNVD / CNVD | China | Access restrictions for non-Chinese IPs, data quality concerns, significant latency vs NVD. |
| KrCERT / KNVD | South Korea | Limited public API, Korean-language only. |
| BSI | Germany | Advisories available, German-language, no bulk structured feed. |
| ANSSI | France | Advisories and IOC reports, French-language, limited machine-readable data. |
| CERT-In | India | CVE CNA, publishes advisories but no bulk structured data download. |
| AusCERT | Australia | RSS feeds available, English-language. Limited structured data beyond advisories. |
| CERT-EU | EU | Threat landscape reports, limited machine-readable data. |
| BDU (FSTEC) | Russia | Poor data quality, slow updates, access restrictions. |
| Source | Why Excluded |
|---|---|
| MAEC | Malware attribute enumeration. Sparse community adoption, limited structured data available. |
| OVAL | Compliance-focused XML definitions. Very large, focused on system configuration rather than threat context. |
| CCE | Configuration enumeration (Excel format). Narrow scope, limited cross-linking potential. |
| Abuse.ch (ThreatFox/URLhaus/MalwareBazaar) | IOC feeds are ephemeral/high-volume and don't produce stable entity relationships for a KG. |
| Ransomware.live | API-only, rate-limited, no bulk download. |
| PhishTank | No cross-links to ATT&CK/CVE/CWE. Pure IOC feed. |
| Metasploit Modules | No machine-readable CVE mapping file. Would require Ruby AST parsing. |
| MITRE EMB3D | Very niche (OT/embedded). Cross-links to ATT&CK/CWE unclear. Worth revisiting as it matures. |
| CIS Controls | No freely downloadable machine-readable data. Proprietary. |
| VulnCheck KEV | No confirmed public bulk data repository. Commercial. |
| AttackIQ / SCYTHE / ANY.RUN / Triage | Commercial platforms, no open bulk data. |
BRON is a linked threat knowledge graph developed at MIT that bridges ATT&CK, CAPEC, CWE, CVE, and D3FEND into a unified graph structure. Its goals overlap significantly with this project — both aim to connect disparate security ontologies into a queryable knowledge graph. This project was started independently and covers a broader set of sources (24 vs BRON's 5) with a flat SPO triple design stored as Parquet rather than a property graph.
This project grew out of the author's master's thesis, which applied KEPLER (a joint knowledge-embedding and language model) to classify ATT&CK techniques from cyber threat intelligence reports. The thesis converted TRAM training data into KG triples enriched with ATT&CK metadata — tactics, technique names, and procedure examples — and showed that the KG-enhanced model outperformed a text-only baseline, validating the value of structured security knowledge for downstream ML tasks. security-kg extends that data pipeline to 24 sources, providing the broad, structured KG foundation that such models need at scale.
This project is licensed under Apache 2.0. The underlying source data is provided under various licenses as detailed below.
| Source | License | Attribution |
|---|---|---|
| ATT&CK | Custom royalty-free (MITRE) | © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation. |
| CAPEC | Custom royalty-free (MITRE) | © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation. |
| CWE | Custom royalty-free (MITRE) | © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation. |
| CVE | Custom permissive (MITRE) | © The MITRE Corporation. CVE® is a registered trademark of The MITRE Corporation. |
| CPE / NVD | Public domain (NIST) | This product uses data from the NVD API but is not endorsed or certified by the NVD. |
| D3FEND | MIT License | © The MITRE Corporation. MITRE D3FEND™ is a trademark of The MITRE Corporation. |
| ATLAS | Apache 2.0 | © MITRE. |
| CAR | Apache 2.0 | © The MITRE Corporation. |
| ENGAGE | Apache 2.0 (GitHub repo) / Custom restrictive (website ToU) | © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation. Note: the GitHub repo is licensed Apache 2.0, but the website terms restrict use to internal/non-commercial purposes. Clarification pending with MITRE. |
| F3 | Apache 2.0 | © MITRE Engenuity, Center for Threat-Informed Defense. |
| EPSS | Custom permissive (FIRST) | Jacobs, Romanosky, Edwards, Roytman, Adjerid (2021), Exploit Prediction Scoring System, Digital Threats Research and Practice, 2(3). See first.org/epss. |
| KEV | Public domain (U.S. Gov) | Source: CISA Known Exploited Vulnerabilities Catalog. |
| Vulnrichment | CC0 1.0 Universal | Source: CISA Vulnrichment. |
| GHSA | CC BY 4.0 | Source: GitHub Advisory Database. Licensed under CC BY 4.0. |
| Sigma | Detection Rule License 1.1 | Source: SigmaHQ. Licensed under DRL 1.1. Rule author attribution is preserved in triples. |
| ExploitDB | GPLv2+ | Source: OffSec ExploitDB. Derived factual metadata (IDs, CVE mappings, dates) extracted under GPLv2+. |
| MISP Galaxies | CC0 1.0 / BSD 2-Clause | Source: MISP Project. Dual-licensed under CC0 1.0 and BSD 2-Clause. |
| LOLBAS | GPL-3.0 | Source: LOLBAS Project. Licensed under GPL-3.0. |
| LOLDrivers | Apache 2.0 | Source: magicsword-io. |
| Atomic Red Team | MIT License | Source: Red Canary. Licensed under MIT. |
| NIST 800-53 Mappings | Apache 2.0 | © MITRE Engenuity, Center for Threat-Informed Defense. |
| Nuclei Templates | MIT License | Source: ProjectDiscovery. Licensed under MIT. |
| EUVD | Public (ENISA) | Source: European Union Agency for Cybersecurity (ENISA). |
| OSV | CC BY 4.0 | Source: Google OSV. Licensed under CC BY 4.0. |
Apache 2.0 — see Source Licensing & Attribution for individual source terms.