Data Detector

Data Detector is a high-performance engine for detecting, redacting, and generating sensitive data (PII).

The purpose behind the development

"To be honest, data privacy wasn't exactly on my radar. But that changed when I started experimenting with RAG on a local LLM. I realized that using a personal AI agent could accidentally expose sensitive info. That 'aha' moment led me to build Data Detector. I really hope this tool helps keep people's data safe."

Installation

pip install data-detector

For more options, see the Installation Guide.

Troubleshooting Submodules

If you cloned the repository without submodules or downloaded the auto-generated GitHub source zip, you may encounter errors like:

ModuleNotFoundError: No module named 'verification'
FileNotFoundError: Pattern directory not found: .../pii-pattern-engine/...

To fix this:

If using Git: Run the following command in the project root:
```
git submodule update --init --recursive
```
If using GitHub Releases: Do not use the default "Source code (zip/tar.gz)" files. Instead, download the data-detector-<version>-full.tar.gz asset which includes all submodule content.
If using Docker: Ensure you have checked out submodules before building. The included Dockerfile is pre-configured to handle the necessary internal paths and symlinks.

Quick Start

Library Usage

from datadetector import Engine, load_registry

# Load patterns and initialize engine
registry = load_registry()
engine = Engine(registry)

# Find PII
results = engine.find("My phone: 010-1234-5678")

# Redact text
redacted = engine.redact("Contact me at test@example.com")
print(redacted.redacted_text)

NLP-Enhanced Detection (Korean, Chinese, Japanese)

For improved CJK PII detection with particle handling and word segmentation:

from datadetector import Engine, load_registry, NLPConfig

# Configure NLP for CJK processing
nlp_config = NLPConfig(
    enable_language_detection=True,
    enable_korean_particles=True,
    enable_chinese_segmentation=True,
    enable_japanese_segmentation=True
)

registry = load_registry()
engine = Engine(registry, nlp_config=nlp_config)

# Detects PII even with particles or without spaces
text = "私の電話번호는 090-1234-5678입니다"
results = engine.find(text, namespaces=["jp", "kr"])

Detection Process Steps

Here is how the engine processes text, illustrated with a Korean example:

Original Text: 제 이름은 마크이고 전화번호는 010-1234-5678입니다.
Tokenization: ['제', '이름', '은', '마크', '이고', '전화번호', '는', '010-1234-5678', '입니다']
- The text is split into meaningful units (morphemes/words).
No-word (Stopword Filtering): 이름 마크 전화번호 010-1234-5678
- Particles (은, 는, 이고, 입니다) are removed to isolate the data.
Regex Matching: 010-1234-5678
- The pattern is now clearly visible and matched.
Verification: Data Check (Valid)
- The extracted number is verified against format rules.

Install NLP dependencies:

pip install data-detector[nlp]

See NLP Features Documentation for more details.

ML-Enhanced Detection (Transformer Classifiers)

Boost detection accuracy with fine-tuned DistilBERT classifiers that validate regex matches:

from datadetector import Engine, load_registry
from datadetector.models import TransformerConfig

config = TransformerConfig(enable_context_classifier=True)
engine = Engine(load_registry(), transformer_config=config)

results = engine.find("My SSN is 123-45-6789")
# Binary classifier confirms PII, category classifier validates type
# Scores are boosted/penalized based on ML confidence

Model	Task	Accuracy	F1
Binary Classifier	PII vs Non-PII	96.2%	96.9%
Category Classifier	21 PII types	87.9%	86.5%

Install Transformer dependencies:

pip install data-detector[transformer]

See Context Analysis Guide for details.

Configurable Scoring (ScoringConfig)

Fine-tune detection sensitivity by adjusting scoring weights, initial scores, and filtering thresholds:

from datadetector import Engine, ScoringConfig, load_registry

# High-precision mode: only keep confident matches
scoring = ScoringConfig(
    min_score=0.7,                   # Drop low-confidence matches
    keyword_pre_close_boost=0.20,    # Reduce keyword influence
    filter_placeholders=True,        # Remove test data (default)
)
engine = Engine(load_registry(), scoring_config=scoring)

results = engine.find("Phone: 010-1234-5678", namespaces=["kr"])
for m in results.matches:
    print(f"{m.category.value}: score={m.score:.3f}, verified={m.verified}")

Key features:

Verified matches (Luhn, checksum) start at score 0.95 and skip ML binary classifier
min_score filtering drops matches below a configurable threshold
Placeholder filtering automatically removes test data like 010-1234-5678

See Context Analysis Guide for the full parameter reference.

Resource Scanning: Search > Inventory > Lineage

Data Detector provides a three-stage pipeline for securing structured data resources. Each stage can be used independently or linked together:

Stage 1: Search for Security Information → Stage 2: Create Security Inventory → Stage 3: Security Data Lineage

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Stage 1        │     │  Stage 2        │     │  Stage 3        │
│  Data Explorer  │────▶│  Inventory      │────▶│  Lineage        │
│                 │     │  Generator      │     │  Tracer         │
│  Scan DB, Kafka,│     │                 │     │                 │
│  API, Files,    │     │  Create PII     │     │  Trace how PII  │
│  VectorDB, AI   │     │                 │     │                 │
│  sensitive data │     │  catalog &      │     │  flows across   │
│                 │     │  export reports │     │  resources      │
└─────────────────┘     └─────────────────┘     └─────────────────┘
   ResourceScanResult ──────▶ DataInventory ──────▶ LineageGraph
   (shared interface)

Each stage produces its own output and can be used alone:

Stage 1 only: Just scan for sensitive data
Stage 1 + 2: Scan and generate an inventory report
Stage 1 + 3: Scan and trace data lineage
Stage 1 + 2 + 3: Full pipeline

Stage 1: Search for Security Information

from datadetector import Engine, load_registry, DataExplorer
from datadetector import DataResource, ResourceType, ConnectionConfig
from datadetector.adapters.database import DatabaseAdapter

registry = load_registry()
engine = Engine(registry)
explorer = DataExplorer(engine)

resource = DataResource(
    name="my-db",
    resource_type=ResourceType.DATABASE,
    connection=ConnectionConfig(uri="postgresql://user:pass@localhost/mydb"),
)

with DatabaseAdapter(resource) as adapter:
    result = explorer.scan(adapter)
    print(f"Found {result.pii_fields} PII fields in {result.pii_containers} tables")

Supported resources: Database (SQLAlchemy), Kafka (Schema Registry), REST API (OpenAPI), File Storage (CSV/JSON/Parquet/Excel), Vector DB (ChromaDB), Training Data (JSONL/HuggingFace)

Stage 2: Create Security Inventory

from datadetector import DataInventoryGenerator, InventoryFormat

gen = DataInventoryGenerator()
gen.add_scan_result(result)        # From Stage 1
inventory = gen.generate()

# Export as HTML report, JSON, CSV, or YAML
gen.export(inventory, InventoryFormat.HTML, output=open("report.html", "w"))

# Compare inventories over time
diff = DataInventoryGenerator.diff(old_inventory, inventory)
print(f"New PII: {len(diff.added)}, Removed: {len(diff.removed)}")

Stage 3: Security Data Lineage

from datadetector import DataLineageTracer

tracer = DataLineageTracer()
tracer.add_scan_result(db_result, db_adapter)     # DB with FK discovery
tracer.add_scan_result(kafka_result)               # Kafka topics

# Link fields across resources
tracer.add_cross_resource_link(
    "my-db", "users.email",
    "my-kafka", "user-events.email",
)

graph = tracer.build_graph()
print(tracer.to_mermaid())         # Visualize PII flow as diagram

Install resource adapter dependencies:

pip install data-detector[database]       # SQLAlchemy for DB scanning
pip install data-detector[kafka]          # Kafka + Schema Registry
pip install data-detector[file-storage]   # Parquet, Excel support
pip install data-detector[vector-db]      # ChromaDB for vector store scanning
pip install data-detector[training-data]  # HuggingFace datasets scanning
pip install data-detector[resources]      # All resource adapters

See Resource Scanning Guide for more details.

CLI Usage

# Find PII in text
data-detector find --text "010-1234-5678" --ns kr

# Redact a file
data-detector redact --in input.log --out redacted.log

# Start a REST API server
data-detector serve --port 8080

CLI Commands & Options

Command	Description	Key Options
`find`	Search for PII in text or files.	`--text`, `--in`, `--ns` (namespace)
`redact`	Mask or tokenize sensitive data.	`--in`, `--out`, `--format`
`validate`	Validate text against a pattern.	`--text`, `--pattern-id`
`list-patterns`	Show all available PII patterns.	`--ns`, `--category`
`serve`	Run as an HTTP/gRPC server.	`--port`, `--host`, `--workers`

Use data-detector --help for a full list of options.

Chrome Extension

Monitor PII in real-time as you browse with the PII Detector Chrome Extension. It uses a hybrid approach combining fast client-side pattern matching with the Data-Detector API for accurate verification.

Features

Multi-Source Monitoring: Detect PII in form inputs, page content, and network requests
Real-Time Alerts: Visual highlights and notifications when PII is detected
Privacy-Preserving: Never stores actual PII values, only metadata
Hybrid Detection: Fast client-side matching with API verification for accuracy
Offline Fallback: Continues working even when API is unavailable

Quick Setup

Start the API Server:
```
data-detector serve --port 8080
```
Load the Extension:
- Open Chrome and go to chrome://extensions/
- Enable "Developer mode"
- Click "Load unpacked" and select the chrome-extension directory
Configure Settings:
- Click the extension icon
- Go to Settings
- Verify API endpoint is http://localhost:8080
- Select namespaces (e.g., comm, us, kr)

For detailed instructions, architecture, and troubleshooting, see the Chrome Extension README.

Documentation

For detailed guides and references, please see the following:

Guides: Quick Start | Architecture | Configuration
Patterns: Supported Patterns | Custom Patterns | Pattern Structure
Features: NLP Processing | ML Context Analysis | Resource Scanning | Fake Data Generation | RAG Security | Verification Functions
API: API Reference

CI/CD Integration

Data Detector can be integrated into your CI/CD pipeline to automatically block PII leaks.

Guide: CI/CD Integration Guide
Example Script: examples/cicd_scan.sh

# Example: Fail build if PII is found in changed files
data-detector find --file "changed_file.py" --on-match exit

License

Apache License 2.0 - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github		.github
api		api
chrome-extension		chrome-extension
config		config
docs		docs
examples		examples
images		images
pii-ml-engine @ b997cb1		pii-ml-engine @ b997cb1
pii-pattern-engine @ 1e92b41		pii-pattern-engine @ 1e92b41
public		public
schemas		schemas
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
CHINESE_NLP_SUMMARY.md		CHINESE_NLP_SUMMARY.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
GEMINI.md		GEMINI.md
JAPANESE_NLP_SUMMARY.md		JAPANESE_NLP_SUMMARY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
config.yml		config.yml
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
redacted_sample.txt		redacted_sample.txt
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
test_sample.txt		test_sample.txt
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Detector

The purpose behind the development

Installation

Troubleshooting Submodules

Quick Start

Library Usage

NLP-Enhanced Detection (Korean, Chinese, Japanese)

Detection Process Steps

ML-Enhanced Detection (Transformer Classifiers)

Configurable Scoring (ScoringConfig)

Resource Scanning: Search > Inventory > Lineage

Stage 1: Search for Security Information

Stage 2: Create Security Inventory

Stage 3: Security Data Lineage

CLI Usage

CLI Commands & Options

Chrome Extension

Features

Quick Setup

Documentation

CI/CD Integration

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Detector

The purpose behind the development

Installation

Troubleshooting Submodules

Quick Start

Library Usage

NLP-Enhanced Detection (Korean, Chinese, Japanese)

Detection Process Steps

ML-Enhanced Detection (Transformer Classifiers)

Configurable Scoring (ScoringConfig)

Resource Scanning: Search > Inventory > Lineage

Stage 1: Search for Security Information

Stage 2: Create Security Inventory

Stage 3: Security Data Lineage

CLI Usage

CLI Commands & Options

Chrome Extension

Features

Quick Setup

Documentation

CI/CD Integration

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages