Skip to content

Latest commit

 

History

History
33 lines (24 loc) · 2.78 KB

File metadata and controls

33 lines (24 loc) · 2.78 KB

Framework Mapping

This document maps the conceptual contributions of the published paper to the modules in this implementation.

Reference: Mudusu, S. K., & Gentyala, S. (2026). Zero-Trust Data Pipelines for AI Systems: A Framework for Secure, Verifiable, and Auditable Data Engineering. Journal of Recent Trends in Computer Science and Engineering, 14(2), 10–25.


Paper section → implementation module

Paper concept Module / component Notes
Zero-trust ingestion boundary ingestion.py Checksum, extension guard, size limit
Data integrity verification ingestion._sha256() SHA-256 on raw bytes before parsing
Schema validation layer validation.py Required fields, null counts, duplicate check
Policy-driven access control policy_engine.py + policies.yaml Declarative YAML rules, per-rule decisions
PII detection and flagging policy_enginepii_columns rule Flags presence, does not mask (extension point)
Data lineage capture lineage.pyLineageTracker SQLite, queryable history
Immutable audit trail audit.pyAuditLogger Append-only SQLite, JSONL export
AI-readiness / trust scoring trust_score.py Weighted 0–100 score, letter grade
Verifiable pipeline composition examples/sample_pipeline.py End-to-end stage orchestration

Design decisions

Why YAML for policies? The paper argues that policy definitions should be separate from pipeline code and auditable as configuration artifacts. YAML satisfies both: it is human-readable, version-controllable, and parsed at runtime so policies can change without code changes.

Why SQLite for lineage and audit? The implementation targets local and single-node deployments. SQLite gives us ACID semantics and queryability without requiring a database server. The LineageTracker and AuditLogger interfaces are thin enough that the storage backend can be swapped (e.g., to PostgreSQL or DuckDB) by changing the connection string.

Why separate lineage and audit stores? Lineage describes what happened to data; audit describes who did what and whether it succeeded. Mixing them conflates two distinct concerns. Keeping them separate simplifies querying and access control in multi-actor deployments.