gen3-metadata-simulator

Generate realistic, linked, schema-valid Gen3 metadata from a bundled Gen3 JSON schema. Point it at a Gen3 data dictionary and it produces one JSON file per node (plus a DataImportOrder.txt), with every foreign key resolving to a real parent record — then self-validates the result with gen3-validator.

Why

Standing up or testing a Gen3 commons needs example data that conforms to your dictionary and links together correctly. Hand-authoring it is tedious and error-prone. This tool reads the dictionary, works out the node dependency order, and fills every node with simulated records that pass validation.

Install

Requires Python ≥ 3.12.10 (a constraint inherited from gen3schemadev).

poetry install

Quickstart

poetry run gen3-metadata-simulator generate \
    --schema examples/jsonschema/acdc_schema_v1.1.5.json \
    --output-dir ./output \
    --num-records 30 \
    --project-code AusDiab_Simulated \
    --seed 1

This writes ./output/<node>.json for every node, plus DataImportOrder.txt, and prints 0 validation errors on success. Re-running with the same --seed reproduces byte-identical output. If validation fails, nothing is written.

Options for `generate`

Flag	Default	Description
`--schema`, `-s`	(required)	Path to the bundled Gen3 JSON schema.
`--output-dir`, `-o`	`./output`	Where to write the metadata files.
`--num-records`, `-n`	`30`	Records per node.
`--project-code`, `-p`	`simulated_project`	Project `code` children link to.
`--seed`	(none)	RNG seed for reproducible output.
`--array-size`	`0`	Elements per array property (`0` → `[]`).
`--skip-validation`	off	Write without self-validating first.

Run poetry run gen3-metadata-simulator generate --help for the full list, or see docs/usage.md.

Validate an existing dataset

poetry run gen3-metadata-simulator validate \
    --schema examples/jsonschema/acdc_schema_v1.1.5.json \
    --metadata-dir ./output

What the output looks like

project.json — a single JSON object identified by code.
<node>.json — a JSON array of N records, each with type, a unique submitter_id, foreign-key objects ({"submitter_id": ...}, or {"code": ...} for links to the project), and schema-conforming property values.
DataImportOrder.txt — node names in dependency order, one per line, ready to drive a sequential Gen3 submission.

How it works

Resolve the schema (gen3-validator inlines every $ref).
Order nodes topologically so parents are generated before children.
Generate records per node, wiring links to real parents.
Validate the whole set with gen3_validator.validate_list_dict and refuse to write anything that fails.

See docs/dev-notes.md for a full walkthrough of how it works and docs/usage.md for every flag.

Realistic values with an LLM (`--provider llm`)

By default (--provider random) values are random within schema bounds. The LLM provider instead asks a lightweight model for the semantic properties of each field and samples from them, so output looks believable while still validating:

numeric — a distribution (mean ± stddev) and realistic limits, so month_birth stays in [1, 12] and bmi_baseline lands near 27 ± 5;
dates — a real calendar date in a plausible window (no 3170-94-14), rendered to the schema's pattern;
free text — domain-appropriate strings (an assay description reads like a real one) drawn from an LLM-supplied pool.

Works with Anthropic or OpenAI models. Enums, booleans, and pattern-constrained strings (UBERON / ORCID / md5sum) keep the random/regex behavior. Specs are cached to .cache/distributions.json, so repeat runs make no API calls and a fixed --seed is reproducible.

Setup

Copy the example env file and fill in three values — the vendor, the model, and a path to a file holding your API key (the key never goes in .env or the repo):

cp .env.example .env
# edit .env:
#   LLM_PROVIDER=anthropic            # or: openai
#   LLM_MODEL=claude-haiku-4-5        # or e.g. gpt-4o-mini
#   LLM_API_KEY_FILE=/path/to/your/api_key

.env is gitignored. Then just select the LLM strategy — provider and model come from .env:

poetry run gen3-metadata-simulator generate \
    --schema examples/jsonschema/acdc_schema_v1.1.5.json \
    --provider llm --num-records 5 --seed 1

Override per run with --llm-provider anthropic|openai and --llm-model <id>. See docs/usage.md for all flags and docs/dev-notes.md for the design and the pluggable ValueProvider / SpecSource interfaces.

Documentation

docs/dev-notes.md — start here. A ground-up, junior-dev-friendly walkthrough of how it all works: the pipeline, the value providers, a worked example, design decisions, a module map, and how to extend it.
docs/usage.md — every CLI flag for generate and validate, with examples.

Development

poetry run python3 -m pytest    # run the test suite (fully offline)

The example dictionary in examples/jsonschema/ is the test fixture. The key tests are the round-trips (tests/test_roundtrip.py, tests/test_roundtrip_llm.py): generate → validate → assert zero errors. New to the codebase? Read docs/dev-notes.md first.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
src/gen3_metadata_simulator		src/gen3_metadata_simulator
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gen3-metadata-simulator

Why

Install

Quickstart

Options for `generate`

Validate an existing dataset

What the output looks like

How it works

Realistic values with an LLM (`--provider llm`)

Setup

Documentation

Development

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gen3-metadata-simulator

Why

Install

Quickstart

Options for generate

Validate an existing dataset

What the output looks like

How it works

Realistic values with an LLM (--provider llm)

Setup

Documentation

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Options for `generate`

Realistic values with an LLM (`--provider llm`)

Packages