This document maps the conceptual contributions of the published paper to the modules in this implementation.
Reference: Mudusu, S. K., & Gentyala, S. (2026). Zero-Trust Data Pipelines for AI Systems: A Framework for Secure, Verifiable, and Auditable Data Engineering. Journal of Recent Trends in Computer Science and Engineering, 14(2), 10–25.
| Paper concept | Module / component | Notes |
|---|---|---|
| Zero-trust ingestion boundary | ingestion.py |
Checksum, extension guard, size limit |
| Data integrity verification | ingestion._sha256() |
SHA-256 on raw bytes before parsing |
| Schema validation layer | validation.py |
Required fields, null counts, duplicate check |
| Policy-driven access control | policy_engine.py + policies.yaml |
Declarative YAML rules, per-rule decisions |
| PII detection and flagging | policy_engine — pii_columns rule |
Flags presence, does not mask (extension point) |
| Data lineage capture | lineage.py — LineageTracker |
SQLite, queryable history |
| Immutable audit trail | audit.py — AuditLogger |
Append-only SQLite, JSONL export |
| AI-readiness / trust scoring | trust_score.py |
Weighted 0–100 score, letter grade |
| Verifiable pipeline composition | examples/sample_pipeline.py |
End-to-end stage orchestration |
Why YAML for policies? The paper argues that policy definitions should be separate from pipeline code and auditable as configuration artifacts. YAML satisfies both: it is human-readable, version-controllable, and parsed at runtime so policies can change without code changes.
Why SQLite for lineage and audit?
The implementation targets local and single-node deployments. SQLite gives us ACID semantics and queryability without requiring a database server. The LineageTracker and AuditLogger interfaces are thin enough that the storage backend can be swapped (e.g., to PostgreSQL or DuckDB) by changing the connection string.
Why separate lineage and audit stores? Lineage describes what happened to data; audit describes who did what and whether it succeeded. Mixing them conflates two distinct concerns. Keeping them separate simplifies querying and access control in multi-actor deployments.