Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions dpsynth/CHANGELOG.md → CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ private synthetic data.
This first release contains code for generating differentially private synthetic
tabular data using marginal measurement and Private-PGM inference, including:

- **Two execution modes**: In-memory local mode (via `dpsynth.generate()`,
tested up to ~100M rows) and a distributed Apache Beam mode for larger
- **Two execution modes**: In-memory local mode
(via `dpsynth.TabularSynthesizer`, tested up to ~100M rows) and a
workloads.
- **Marginal-based mechanisms**: AIM, MST, Independent, and Direct mechanisms
for selecting and measuring marginals under differential privacy.
Expand Down
64 changes: 33 additions & 31 deletions docs/in_memory_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,32 +11,31 @@ within a single machine's RAM.

--------------------------------------------------------------------------------

## Python API: `dpsynth.generate`
## Python API: `dpsynth.TabularSynthesizer`

The primary entry point for in-memory synthesis is `dpsynth.generate()`. It
accepts a Pandas DataFrame alongside a dictionary of attribute domains and
returns a fully synthetic, differentially private DataFrame matching the exact
schema and data types of your input.
The primary entry point for in-memory synthesis is
`dpsynth.TabularSynthesizer`. It accepts a dictionary of attribute domains,
is calibrated with a privacy budget, and generates a fully synthetic,
differentially private DataFrame matching the exact schema and data types of
your input.

### Function Signature
### Usage

```python
import dpsynth
from dpsynth import discrete_mechanisms
import numpy as np
import pandas as pd

synthetic_df = dpsynth.generate(
data: pd.DataFrame,
domains: dict[str, dpsynth.domain.AttributeType],
epsilon: float,
delta: float,
*,
discrete_config: discrete_mechanisms.DiscreteMechanismConfig = discrete_mechanisms.MSTConfig(),
numerical_bins: int = 32,
one_way_marginal_budget_fraction: float = 0.1,
cross_attribute_constraints: list = (),
skip_compression: bool = False,
) -> pd.DataFrame
synth = dpsynth.TabularSynthesizer(
domains=domains,
discrete_mechanism=discrete_mechanisms.MSTMechanism(),
)
result = synth.calibrate(
epsilon=1.0,
delta=1e-6,
)(np.random.default_rng(), sensitive_df)
synthetic_df = result.synthetic_data
```

### Key Arguments
Expand Down Expand Up @@ -70,6 +69,7 @@ synthetic records.
import dpsynth
from dpsynth import discrete_mechanisms
from dpsynth import domain
import numpy as np
import pandas as pd

# 1. Load sensitive tabular data into Pandas
Expand All @@ -78,23 +78,25 @@ sensitive_df = pd.read_csv("sensitive_transactions.csv")
# 2. Load domain schema from YAML
attribute_domains = domain.from_yaml_file("transaction_domain.yaml")

# 3. Configure the synthesis mechanism (AIM)
aim_config = discrete_mechanisms.AIMConfig(
seed=42,
rounds=50,
pgm_iters=1000,
)

# 4. Generate Differentially Private synthetic data
synthetic_df = dpsynth.generate(
data=sensitive_df,
# 3. Configure and calibrate the synthesizer (AIM)
synth = dpsynth.TabularSynthesizer(
domains=attribute_domains,
discrete_mechanism=discrete_mechanisms.AIMConfig(
seed=42,
rounds=50,
pgm_iters=1000,
),
)
calibrated = synth.calibrate(
epsilon=1.0,
delta=1e-6,
discrete_config=aim_config,
numerical_bins=16, # Use 16 quantile buckets for numerical columns
numerical_bins=16, # Use 16 quantile buckets for numerical columns
)

# 4. Generate Differentially Private synthetic data
result = calibrated(np.random.default_rng(), sensitive_df)
synthetic_df = result.synthetic_data

# 5. Save the synthetic dataframe
synthetic_df.to_csv("synthetic_transactions.csv", index=False)
print("Synthetic data successfully generated!")
Expand Down Expand Up @@ -139,7 +141,7 @@ python3 bin/main.py \

## Under the Hood: The In-Memory Lifecycle

When you invoke `dpsynth.generate()`, the library performs the following
When you invoke `TabularSynthesizer`, the library performs the following
single-machine pipeline:

1. **Discretization**: Continuous numerical columns are bucketed into
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ dataframes to massive distributed datasets across computing clusters:
└────────────────────────────────────────┘
```

### 1. In-Memory DataFrame API (`dpsynth.generate`)
### 1. In-Memory DataFrame API (`dpsynth.TabularSynthesizer`)

Optimized for rapid prototyping, research experimentation, and datasets that
easily fit within single-machine memory.
Expand Down
4 changes: 2 additions & 2 deletions docs/sitemap.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

* [Why DPSynth?](index.md#why-dpsynth)
* [Core APIs and Execution Models](index.md#core-apis-and-execution-models)
* [1. In-Memory DataFrame API (`dpsynth.generate`)](index.md#1-in-memory-dataframe-api-dpsynthgenerate)
* [1. In-Memory DataFrame API (`dpsynth.TabularSynthesizer`)](index.md#1-in-memory-dataframe-api-dpsynth-tabularsynthesizer)
* [2. Scalable PipelineBackend API (`dpsynth.data_generation`)](index.md#2-scalable-pipelinebackend-api-dpsynthdata_generation)
* [Documentation Sitemap & Navigation](index.md#documentation-sitemap--navigation)
* [Supported Synthesis Algorithms](index.md#supported-synthesis-algorithms)
Expand Down Expand Up @@ -46,7 +46,7 @@
<details>
<summary>📁 <a href="in_memory_api.md">In-Memory DataFrame API Guide</a></summary>

* [Python API: `dpsynth.generate`](in_memory_api.md#python-api-dpsynthgenerate)
* [Python API: `dpsynth.TabularSynthesizer`](in_memory_api.md#python-api-dpsynth-tabularsynthesizer)
* [Function Signature](in_memory_api.md#function-signature)
* [Key Arguments](in_memory_api.md#key-arguments)
* [End-to-End Python Example](in_memory_api.md#end-to-end-python-example)
Expand Down
Loading
Loading